https://haojian.github.io/DSC102SP24/static_files/presentations/8ParallismDataAccess.pdf

Central Issue: Large data file does not fit entirely in DRAM

Basic Idea: Divide-and-conquer again

4 regimes of scalability

  1. Single Node Disk: paged access from file on local disk
  2. Remote read: paged access from disks over a network
  3. Distributed memory: data fits on a cluster’s total DRAM
  4. Distributed disk: use entire memory hierchy of cluster

Paged Access

Screen Shot 2024-05-08 at 3.21.13 PM.png

Caching: retaining pages from disk in DRAM

Eviction: removing a page frame’s content in DRAM

Spilling: Writing out pages from DRAM to disk

Cache Replacement Policy: algorithm which chooses which page frames to evict

I/O Costs: Disk & Network

disk cost: count number of page I/Os, map to bytes given page size

communication cost: count number of pages / bytes sent/received to network