5/22 - Parallel Operations TODO

Screen Shot 2024-05-29 at 3.15.25 PM.png

Select, Project, Aggregate, Groupby

Basic Idea: Manager splits work → node: local work → manager unifies results

Screen Shot 2024-05-29 at 3.18.26 PM.png

Green is what is sent to manager

Example Query: Select ‘3b’

I/O cost: 6 pages, 0 workers (already in disk)
Example Query: Select C from R

I/O cost: 6 pages, 0 workers (already in disk)
Example Query: Select MAX(A) from R

I/O cost: 6 pages, 3 workers (each worker must compute max(page)
Example Query: Select A, SUM(D) from R GroupBy A
- worker sends partial hash table to manager based on local shards
- manager unifies
- Network I/O depends on dataset size

Okay Matrix Sum / Norm

Example Query: |M|^2

I/O cost: 6 pages, 3 workers (one to compute sum of squares)

Scalable SGD

A SGD epoch is similar to SQL aggs but is more complex with multiple mini-batch updates

it is not commutative and hard to parallelize

Solution: multiple managers for parts of weight vector (multi server manager)

Worders send gradients to manager for updates at each mini-batch (High I/O)
Model params may get out of sync, but since SGD is robust, multiple updates / epochs will still lead to convergence

Review

What are the 4 main regimes of scalable data access? 2. Briefly explain 1 pro and 1 con of scaling with local disk vs. scaling with remote reads. 3. You are given a DataFrame serialized as a 100 GB Parquet columnar file. It has 20 columns, all of the same fixed-length data type. You compute a sum over 4 columns. What is the I/O cost (in GB)? 4. Which is the most flexible data layout format for 2-D structured data? 5. You lay out a 1 TB matrix in tile format with a shape 2000x500. What is the I/O cost (in GB) of computing its full matrix sum? 6. Briefly explain 1 pro and 1 con of SGD vs. BGD. 7. Suppose you use scalable SGD to train a DL model. The dataset has 100 million examples. Mini-batch size is set to 50. How many iterations (number of model update steps) will SGD finish in 20 epochs? 8. What is the precise runtime tradeoff involved in shuffle-once-upfront vs. shuffle-every-epoch for SGD?