One chunk of text produces the relation label 'acquired', another semantically-identical chunk produces 'bought out', and a third produces 'took over'. Why does schema-guided extraction (a fixed set of relation types) usually beat free-form extraction here?

It forces all three sentences to map to one consistent relation type, so a graph query for 'acquired' actually finds all of them instead of silently missing two-thirds.

Schema-guided extraction produces fewer output tokens, saving cost.

It forces all three sentences to map to one consistent relation type, so a graph query for 'acquired' actually finds all of them instead of silently missing two-thirds.

It runs faster because it avoids using an LLM.

It removes the need for entity resolution entirely.

Why does schema-guided extraction outperform free-form for GraphRAG retrieval?

It yields a consistent, allow-listed set of entity and relation types the retriever can rely on, instead of synonym sprawl.

It produces fewer tokens and is cheaper.

It yields a consistent, allow-listed set of entity and relation types the retriever can rely on, instead of synonym sprawl.

Extracting entities and relations

How an LLM turns unstructured prose into the structured (subject, predicate, object) triples a graph can traverse.

0/4 done

Overview

How an LLM turns unstructured prose into the structured (subject, predicate, object) triples a graph can traverse.

Why it matters

Building a knowledge graph from raw text starts with a step that has no equivalent in naïve RAG: an LLM reads each chunk and pulls out entities (people, organisations, products) and the relations between them, e.g. the sentence 'Acme acquired Widgetco in 2023' becomes the triple (Acme, acquired, Widgetco) with a date=2023 property. This is the bridge that turns an amorphous pile of documents into a queryable graph database.

There are two ways to run this extraction, and the difference matters enormously downstream:

Free-form extraction lets the LLM invent whatever entity types and relation labels it wants. One chunk produces acquired, the next produces bought, another produces purchased — three different edge labels that all mean the same thing but are invisible to each other in a graph query.
Schema-guided extraction gives the LLM a fixed, closed set of entity types (Company, Person, Product) and relation types (acquired, worksAt, foundedBy) up front, and constrains it to only ever emit triples that conform to that schema.

Schema-guided extraction is almost always the right default in production: a graph with 200 synonymous relation labels for 'acquired' is not actually traversable — a query for 'who acquired whom' will silently miss two-thirds of the real acquisitions because they were extracted under a different label. Constraining the schema is what makes the graph reliable for the retriever, even if it means occasionally forcing a slightly awkward fit.

How it actually works

Entity extraction is the bridge from prose to graph: it turns "Alice at Acme approved refunds for Atlas under Policy P-12" into typed nodes (Alice:Person, Acme:Org, Atlas:Product, P-12:Policy) and typed edges (Alice-worksFor-Acme, Atlas-governedBy-P-12). Everything downstream inherits the quality of this step.

Schema-guided beats free-form. If you let the model invent types, you get CEO, Chief Executive, chief_exec as three relations and a graph nobody can query. A schema constrains output to an allow-list:

{
  "entities": ["Person", "Organisation", "Product", "Policy"],
  "relations": ["worksFor", "owns", "governedBy", "mentions"]
}

Emit provenance and confidence per edge. A production extractor returns, for each triple, the source doc_id + character span and a confidence score. That lets you (a) filter low-confidence edges before they pollute retrieval, (b) route borderline edges to human review, and (c) cite the original text at answer time.

Precision vs recall asymmetry. A wrong edge (false relation) is usually costlier than a missing edge: the wrong edge creates confident wrong multi-hop answers, while the missing edge merely fails to help. So tune extraction toward precision first, then claw back recall once canonicalisation is solid.

Analogy

Extraction is taking minutes of a meeting. Free-form notes ('the boss okayed it') are useless three weeks later; structured minutes with named people, decisions and references are queryable forever. The schema is your minutes template.

Pitfalls & how to avoid them

Free-form relation names. Symptom: synonym explosion. Fix: allow-listed relation vocabulary.
No confidence/provenance. Symptom: can't filter or cite. Fix: emit span + score per edge.
Optimising recall first. Symptom: confident wrong paths. Fix: precision-first, then recall.
Extracting from un-cleaned text. Symptom: headers/footers become entities. Fix: pre-clean before extraction.

Apply it to your system

Run extraction on one real paragraph from your corpus.

›How many distinct relation labels appeared that *mean the same thing*?
›For each edge, can you point to the exact source span it came from?
›Which low-confidence edges would you drop or send to human review?

Reading in progress · 0 of 4 activities done