https://haojian.github.io/DSC102SP24/static_files/presentations/9DataParallelismReplication.pdf
3 Paradigms of Multi-Node Parallelism
Shared-Nothing Data Parallelism
- Sharding: partition a data file across nodes
- ETL: processing done to the file before its ready for query, analyze, etc
Data Partitioning Strategies
- Round-robin: assign tupule i to node i MOD k
- Hashing: needs hash partitioning attributes
- Range-based: needs ordinal partitioning attributes
Tradeoffs:
- hashing most common for RA/SQL
- Range-based often goodfor range predicates in RA/SQL
Replication of partition across nodes is common to enable fault tolerance and better parallel runtime
Cluster Architectures
Manager-Worker: Manager tells workers what to do and
when to talk to other nodes
- Most common in data systems (e.g., Dask, Spark, par. RDBMS, etc.)
Peer-to-Peer: workers talk to each other directly (decentralized)
Quantifying Speedup from Parallelism
