https://haojian.github.io/DSC102SP24/static_files/presentations/10DataflowSystems.pdf
A programming model for parallel programs on sharded data (partitioned across multiple shared-nothing servers) & distributed system architecture
System handles data distribution, parallelization, fault tolerance under the hood
Map(): process one record independently
Reduce(): gather all Map outputs across workers sharing same key into an iterator
Ex: count word occurences in corpus
each mapper and reducer is a separate process, reducers face barrier synchronization by Bulk Synchronous Parallelism
Pros
Cons
Ex: emulating in SQL
Dataflow programming model inspired by Pandas chaining functions that exploits distributed memory to cache data
Architecture