https://haojian.github.io/DSC102SP24/static_files/presentations/10DataflowSystems.pdf

MapReduce

A programming model for parallel programs on sharded data (partitioned across multiple shared-nothing servers) & distributed system architecture

Pros

Cons

Ex: emulating in SQL

Screen Shot 2024-06-08 at 6.12.52 PM.png

Spark

Dataflow programming model inspired by Pandas chaining functions that exploits distributed memory to cache data

Architecture

Screen Shot 2024-06-08 at 6.22.37 PM.png