Why bother with large-scale data? Why does sampling not suffice?

We need a diverse dataset, and as the number of examples increases, the accuracy / evaluation metrics of the model improve (with respect to the bias variance tradeoff)

Parallel Data Processing

Main Idea: Workload takes long for a single processor

Solution: Split workload among processors / machines in a divide & conquer approach

Threads

Recall:

Dataflow

A directed graph representation of a program with vertices

Screen Shot 2024-05-03 at 3.38.24 PM.png