Basic Idea: Manager splits work → node: local work → manager unifies results
Green is what is sent to manager
Example Query: Select ‘3b’
I/O cost: 6 pages, 0 workers (already in disk)
Example Query: Select C from R
I/O cost: 6 pages, 0 workers (already in disk)
Example Query: Select MAX(A) from R
I/O cost: 6 pages, 3 workers (each worker must compute max(page)
Example Query: Select A, SUM(D) from R GroupBy A
Example Query: |M|^2
I/O cost: 6 pages, 3 workers (one to compute sum of squares)
A SGD epoch is similar to SQL aggs but is more complex with multiple mini-batch updates
Solution: multiple managers for parts of weight vector (multi server manager)
What are the 4 main regimes of scalable data access? 2. Briefly explain 1 pro and 1 con of scaling with local disk vs. scaling with remote reads. 3. You are given a DataFrame serialized as a 100 GB Parquet columnar file. It has 20 columns, all of the same fixed-length data type. You compute a sum over 4 columns. What is the I/O cost (in GB)? 4. Which is the most flexible data layout format for 2-D structured data? 5. You lay out a 1 TB matrix in tile format with a shape 2000x500. What is the I/O cost (in GB) of computing its full matrix sum? 6. Briefly explain 1 pro and 1 con of SGD vs. BGD. 7. Suppose you use scalable SGD to train a DL model. The dataset has 100 million examples. Mini-batch size is set to 50. How many iterations (number of model update steps) will SGD finish in 20 epochs? 8. What is the precise runtime tradeoff involved in shuffle-once-upfront vs. shuffle-every-epoch for SGD?