Document & Aggregate Modeling

Embed vs reference — the trade-off between read speed and update fan-out.

0/2 done

Overview

Document & Aggregate Modeling

Embed vs reference — the trade-off between read speed and update fan-out.

Why it matters

Document stores reward aggregate-oriented design: model the unit that's loaded together. Embed when reads are frequent and updates rare; reference when the embedded thing changes independently.

Going deeper

A practical decision matrix for embed vs reference:

QuestionEmbedReference
Always loaded together?
Bounded size (fits in one doc, < 16 MB Mongo limit)?
Mutated independently of the parent?
Referenced from many parents (many-to-many)?
Needed in isolation by other queries?

Common mistake: embedding unbounded arrays (comments on a post, events on a user). Each new write rewrites the whole doc and eventually busts the size cap. Rule of thumb: if the child collection has no natural upper bound, reference it.

Analogy

Document modelling is packing a lunchbox vs running a salad bar.

A lunchbox is embedded: sandwich, fruit, juice in one container, handed over in one transaction. Perfect when each kid gets one box and nobody else needs the juice. But if the school changes the juice brand, you re-pack every box.

A salad bar is referenced: lettuce in one tub, tomatoes in another, dressing in a third. Anyone can update the dressing recipe in one place, and 400 plates get the new version. Slower per plate (you assemble at read time) — but mutations are O(1) instead of O(plates).

Document DBs win when your reads naturally match a 'lunchbox': one document = one natural aggregate the application loads as a unit.

Make it stick

Use the prompts below to anchor document & aggregate modeling to something you actually own.

  • Pick a document collection you've designed. Which embedded array could grow unbounded, and how would you split it?
  • Which 'reference' in your schema is in practice *always* joined at read time? Could a small denormalised cache live alongside the reference for speed?
  • Where would aggregate-oriented thinking simplify your *relational* schema's API layer (one query per screen)?

Reading in progress · 0 of 2 activities done