https://canvas.ucsd.edu/courses/54740/external_tools/82

https://haojian.github.io/DSC102SP24/lectures/

Memory Hierarchy (1) **✅**

Float Representation (1) **✅ include**

Operating Systems (1)

Cloud Computing (2) **✅ include**

Parallelism Basics, Data Access (4) include

Data Parallelism (4) **✅ include**

Dataflow Systems (1)

Model Building (2)

You are given a large training dataset with schema D (Y, X1, X2, …, Xd). The target Y and all features are numeric and stored as float64.

The number of examples is n. The dataset is laid out in row store format on disk (no compression) on your single machine. No batching of filescans or caching of data is performed.

Step 1) You build linear models using BGD. You try 3 different learning rates and train all models for 5 epochs each. Step 2) You now read the data and append quadratic (order 2) feature interactions as explained in the lecture. You write out the output file to disk. All features are still stored as float64. Step 3) You rebuild the linear models on the output file of step 2 using the same model selection process as the first step. Given: (n = 1 billion, d = 3)

  1. What is the size in GB of the file D on disk? (Hint: Do not forget to count Y.)

n x (d + 1) matrix

= 4 billion entries* 8 bytes per entry

= 32 billion bytes / 1000kb/byte x 1000mb/kb x 1000mb/gb

32 GB file

  1. What is the total I/O cost in GB of step 1 for reading data files?

5 epochs x 3 lr = 15 total epochs, reading the data in once

32 GB x 15 total iterations (have to update data every time) = 480GB

  1. What is the size in GB of the file written out in step 2? (Hint: Do not forget to count Y.)

quadratic interations for each feature, d = 4

= 3(3+1) / 2 = 12 / 2 = 6 new features + 3 original features = 9 features

3 features = 32 GB

Now, d = 9

= n x d + 1 entries

= 1 billion x 10 entries

= 10 billion * 8 bytes

= 80 billion bytes

= 80 GB

  1. What is the total I/O cost in GB of step 3 for reading data files

15 total training iterations = 15 different reads x 80 GB per read

= 1200

  1. What is the degree of parallelism of this entire 3-step workflow if you were to run it in a task-parallel manner?

    Step 2 doesn’t depend on step 1 so we can run them in parallel

    Step 3 depends on step 2

    T1 → finish

    T2 → T3 → finish