https://haojian.github.io/DSC102SP24/static_files/presentations/5OSBasics.pdf

Different RDBMS and Spark based tools serialize data in different binary formats

Relation vs Matrix vs DF

Structured Data

Unstructured

Data Lake File Format

Lake: Loose coupling of data file format for storage and data/query processing stack (vs RDBMS’s tight coupling)

Tradeoffs of parquet vs text-based files