CSV, JSON, Avro, Parquet — When to Use Which

Row vs column, text vs binary, schema-embedded vs not.

0/3 done

Pick by query pattern

Four formats, four trade-offs

FormatLayoutSchemaBest for
CSVrow, textnoneOne-off exchange, smallish files, human inspection
JSONrow, textself-descAPI payloads, nested records, log lines
Avrorow, binembeddedStreaming + schema evolution (Kafka)
Parquetcolumn, binembeddedAnalytics — scan a few columns over many rows

Rule of thumb: OLTP / streaming → row. OLAP / analytics → column. Anything crossing trust boundaries → schema-embedded.

Reading in progress · 0 of 3 activities done