Where is the *most* important place to redact PII in a RAG pipeline?

At every stage — ingest, retrieval, and egress — defence in depth.

Where is the *most* important place to redact PII in a RAG pipeline?

At every stage — ingest, retrieval and egress — as defence in depth.

Inside the embedding model itself.

Nowhere; the LLM handles it safely.

PII redaction in retrieval — Semantic Web Academy

Overview

Redact at ingest, redact at retrieval, audit at egress.

Why it matters

Once PII reaches the LLM, you've lost control of it. Three layered redaction points keep you out of trouble.

How it actually works

Once PII reaches the model, you've lost control of it — it can be echoed, logged, or leaked. So redaction is layered at three points, defence-in-depth, not one chokepoint.

ingest:     replace emails/phones with stable salted tokens before embedding
retrieval:  filter chunks by viewer entitlement and purpose
generation: redact any PII not present in authorised evidence
audit:      log [original_doc_id, redaction_version, viewer_role]

Redact at ingest so raw PII never enters the vector store (using stable tokens keeps records linkable without exposing the value). Filter at retrieval so a viewer only ever gets chunks they're entitled to. Redact at egress as the last backstop. Any single layer can fail; all three failing at once is what you're guarding against.

Version your redaction policy. When you change what counts as PII or how you tokenise, you must be able to prove which redactor touched which evidence — redaction_version in the audit log is what makes that defensible to a regulator.

Plan right-to-erasure into the vector store. Deleting a source record means its embeddings and tokens must be purged too, not just the original document — an often-forgotten path that quietly keeps 'deleted' PII searchable.

Analogy

PII redaction is airport security in layers: ID check at the door (ingest), boarding-pass gate (retrieval), and a final scan at the jet bridge (egress). No single checkpoint is trusted alone, and there's a logbook (audit) proving who was screened by which policy.

Pitfalls & how to avoid them

Redacting only at egress. Symptom: raw PII in the index. Fix: redact at ingest too.
No retrieval entitlement filter. Symptom: cross-user exposure. Fix: filter by viewer/purpose.
Unversioned redaction. Symptom: can't prove compliance. Fix: log redaction_version.
Erasure that misses embeddings. Symptom: 'deleted' PII still searchable. Fix: purge vectors + tokens.

Apply it to your system

Trace one piece of PII through your system.

›Does raw PII ever reach your vector store today?
›Can you prove which redaction policy version touched a given document?
›When a user invokes right-to-erasure, do the embeddings get purged too?

Reading in progress · 0 of 4 activities done