Your dashboard runs `SELECT avg(amount) FROM sales WHERE region='EU'` over 500M rows but only 3 columns. Which format makes this cheapest?

Question

Accepted Answer

Parquet — column-oriented, the engine only scans the 3 columns it needs

Answer

CSV

Answer

JSON

Answer

Avro

Format	Layout	Schema	Best for
CSV	row, text	none	One-off exchange, smallish files, human inspection
JSON	row, text	self-desc	API payloads, nested records, log lines
Avro	row, bin	embedded	Streaming + schema evolution (Kafka)
Parquet	column, bin	embedded	Analytics — scan a few columns over many rows

CSV, JSON, Avro, Parquet — When to Use Which