Screen Shot 2024-05-29 at 3.15.25 PM.png

Select, Project, Aggregate, Groupby

Basic Idea: Manager splits work → node: local work → manager unifies results

Screen Shot 2024-05-29 at 3.18.26 PM.png

Green is what is sent to manager

Okay Matrix Sum / Norm

Scalable SGD

A SGD epoch is similar to SQL aggs but is more complex with multiple mini-batch updates

Solution: multiple managers for parts of weight vector (multi server manager)

Review

What are the 4 main regimes of scalable data access? 2. Briefly explain 1 pro and 1 con of scaling with local disk vs. scaling with remote reads. 3. You are given a DataFrame serialized as a 100 GB Parquet columnar file. It has 20 columns, all of the same fixed-length data type. You compute a sum over 4 columns. What is the I/O cost (in GB)? 4. Which is the most flexible data layout format for 2-D structured data? 5. You lay out a 1 TB matrix in tile format with a shape 2000x500. What is the I/O cost (in GB) of computing its full matrix sum? 6. Briefly explain 1 pro and 1 con of SGD vs. BGD. 7. Suppose you use scalable SGD to train a DL model. The dataset has 100 million examples. Mini-batch size is set to 50. How many iterations (number of model update steps) will SGD finish in 20 epochs? 8. What is the precise runtime tradeoff involved in shuffle-once-upfront vs. shuffle-every-epoch for SGD?