When should you reference (vs embed) a sub-document in MongoDB?

When the sub-document mutates independently or is referenced from many parents

Never — embedding is always faster

Only when the parent has more than 16 MB of data

Always — references scale better

When the sub-document mutates independently or is referenced from many parents

You're modelling a blogging platform in MongoDB. Posts have author info and a potentially-unbounded list of comments. What's the best embed/reference choice?

Embed a tiny denormalised slice of author (name, avatar); reference full author profile; store comments in a separate collection referencing the post

Use one giant document per blog

Embed both author and comments in the post

Embed a tiny denormalised slice of author (name, avatar); reference full author profile; store comments in a separate collection referencing the post

Document & Aggregate Modeling — Semantic Web Academy

Overview

Document & Aggregate Modeling

Embed vs reference — the trade-off between read speed and update fan-out.

Why it matters

Document stores reward aggregate-oriented design: model the unit that's loaded together. Embed when reads are frequent and updates rare; reference when the embedded thing changes independently.

Going deeper

A practical decision matrix for embed vs reference:

Question	Embed	Reference
Always loaded together?	✓	✗
Bounded size (fits in one doc, < 16 MB Mongo limit)?	✓	—
Mutated independently of the parent?	✗	✓
Referenced from many parents (many-to-many)?	✗	✓
Needed in isolation by other queries?	✗	✓

Common mistake: embedding unbounded arrays (comments on a post, events on a user). Each new write rewrites the whole doc and eventually busts the size cap. Rule of thumb: if the child collection has no natural upper bound, reference it.

Analogy

Document modelling is packing a lunchbox vs running a salad bar.

A lunchbox is embedded: sandwich, fruit, juice in one container, handed over in one transaction. Perfect when each kid gets one box and nobody else needs the juice. But if the school changes the juice brand, you re-pack every box.

A salad bar is referenced: lettuce in one tub, tomatoes in another, dressing in a third. Anyone can update the dressing recipe in one place, and 400 plates get the new version. Slower per plate (you assemble at read time) — but mutations are O(1) instead of O(plates).

Document DBs win when your reads naturally match a 'lunchbox': one document = one natural aggregate the application loads as a unit.

Make it stick

Use the prompts below to anchor document & aggregate modeling to something you actually own.

›Pick a document collection you've designed. Which embedded array could grow unbounded, and how would you split it?
›Which 'reference' in your schema is in practice *always* joined at read time? Could a small denormalised cache live alongside the reference for speed?
›Where would aggregate-oriented thinking simplify your *relational* schema's API layer (one query per screen)?

Reading in progress · 0 of 2 activities done