https://canvas.ucsd.edu/courses/54740/external_tools/82
https://haojian.github.io/DSC102SP24/lectures/
You are given a large training dataset with schema D (Y, X1, X2, …, Xd). The target Y and all features are numeric and stored as float64.
The number of examples is n. The dataset is laid out in row store format on disk (no compression) on your single machine. No batching of filescans or caching of data is performed.
Step 1) You build linear models using BGD. You try 3 different learning rates and train all models for 5 epochs each. Step 2) You now read the data and append quadratic (order 2) feature interactions as explained in the lecture. You write out the output file to disk. All features are still stored as float64. Step 3) You rebuild the linear models on the output file of step 2 using the same model selection process as the first step. Given: (n = 1 billion, d = 3)
n x (d + 1) matrix
= 4 billion entries* 8 bytes per entry
= 32 billion bytes / 1000kb/byte x 1000mb/kb x 1000mb/gb
32 GB file
5 epochs x 3 lr = 15 total epochs, reading the data in once
32 GB x 15 total iterations (have to update data every time) = 480GB
quadratic interations for each feature, d = 4
= 3(3+1) / 2 = 12 / 2 = 6 new features + 3 original features = 9 features
3 features = 32 GB
Now, d = 9
= n x d + 1 entries
= 1 billion x 10 entries
= 10 billion * 8 bytes
= 80 billion bytes
= 80 GB
15 total training iterations = 15 different reads x 80 GB per read
= 1200
What is the degree of parallelism of this entire 3-step workflow if you were to run it in a task-parallel manner?
Step 2 doesn’t depend on step 1 so we can run them in parallel
Step 3 depends on step 2
T1 → finish
T2 → T3 → finish