MapReduce

A programming model for parallel programs on sharded data (partitioned across multiple shared-nothing servers) & distributed system architecture

System handles data distribution, parallelization, fault tolerance under the hood
Map(): process one record independently
- could be batch of multiple examples
- dependences across mappers not allowed
- allows for diff. input & output data types
Reduce(): gather all Map outputs across workers sharing same key into an iterator
- agg. function on iterator
Ex: count word occurences in corpus

each mapper and reducer is a separate process, reducers face barrier synchronization by Bulk Synchronous Parallelism

Pros

Map() and Reduce() are highly general for diff. data structures, ETL
Native scalability, large cluster parallelism
Fault Tolerance automatically handled
some jobs are map only (Reduce() not needed where no-cross shared agg. not necessary)

Cons

Ex: emulating in SQL

Screen Shot 2024-06-08 at 6.12.52 PM.png

Spark

Dataflow programming model inspired by Pandas chaining functions that exploits distributed memory to cache data

Architecture

Screen Shot 2024-06-08 at 6.22.37 PM.png