open-source software framework for storing and processing massive datasets across clusters of computers distributed architecture, breaking data into pieces and processing in parallel