5/3 - Parallel Computing

Why bother with large-scale data? Why does sampling not suffice?

We need a diverse dataset, and as the number of examples increases, the accuracy / evaluation metrics of the model improve (with respect to the bias variance tradeoff)

More data helps the model be as complex as possible without overfitting by learning different representations
Enables study of granular phenomena in sciences
new applications possible

Parallel Data Processing

Main Idea: Workload takes long for a single processor

Solution: Split workload among processors / machines in a divide & conquer approach

Threads

Recall:

each process can have multiple threads
each CPU core can only execute 1 instruction / 1 thread
each thread runs its computation simultaneously
All threads share address space (data)
Special Case: Hyperthreading virtualizes a core to run multiple threads

Dataflow

A directed graph representation of a program with vertices

Screen Shot 2024-05-03 at 3.38.24 PM.png