https://haojian.github.io/DSC102SP24/static_files/presentations/5OSBasics.pdf
Different RDBMS and Spark based tools serialize data in different binary formats
- One file per relation
- RDBMS vendor specific vs open Apache
- Parquet
Relation vs Matrix vs DF
- mtx and df have row/col numbers, relation is orderless
- schema flexibility: mtx cells are numbers, relation tuples conform to pre defined schema, all rows/cols can have names; col cells can be mixed types
- transpose: not supported by relations
Structured Data
- Matrix, Dataframes, Parquets, relational data model
Unstructured
Data Lake File Format
Lake: Loose coupling of data file format for storage and data/query processing stack (vs RDBMS’s tight coupling)
Tradeoffs of parquet vs text-based files
- less storage: parquet stores in compressed form → less I/O
- column pruning: enables app to read only columns needed to DRAM → less I/O
- Schema on file: rich metadata
- complex types: can store in a column