5/20 - Parallelism Architectures | Notion

https://haojian.github.io/DSC102SP24/static_files/presentations/9DataParallelismReplication.pdf

3 Paradigms of Multi-Node Parallelism

Shared-Nothing Data Parallelism

Sharding: partition a data file across nodes
ETL: processing done to the file before its ready for query, analyze, etc

Data Partitioning Strategies

Round-robin: assign tupule i to node i MOD k
Hashing: needs hash partitioning attributes
Range-based: needs ordinal partitioning attributes

Tradeoffs:

hashing most common for RA/SQL
Range-based often goodfor range predicates in RA/SQL

Replication of partition across nodes is common to enable fault tolerance and better parallel runtime

Cluster Architectures

Manager-Worker: Manager tells workers what to do and when to talk to other nodes

Most common in data systems (e.g., Dask, Spark, par. RDBMS, etc.)

Peer-to-Peer: workers talk to each other directly (decentralized)

Quantifying Speedup from Parallelism

Screen Shot 2024-05-22 at 3.11.36 PM.png