Which is a quasi-identifier?

Postcode + date of birth + gender combination

You're about to publish an 'anonymised' dataset that drops names and emails but keeps postcode, exact DOB and gender. What's the realistic re-identification risk?

High — those three quasi-identifiers uniquely identify the majority of any moderate-sized population

Only matters if you publish names too

Zero — GDPR doesn't apply once names are removed

PII Classification — Semantic Web Academy

Overview

PII Classification

Direct / quasi / sensitive — different classes need different controls.

Why it matters

A name alone isn't usually a re-identification risk; a name + DOB + postcode is. Quasi-identifiers are why anonymisation is harder than it looks.

Going deeper

A practical PII classification policy attaches a class label and a control set to every column:

Class	Examples	Default controls
Direct PII	email, phone, SSN, passport	Encrypt at rest; tokenise for analytics; access logged + reviewed
Quasi-identifier	postcode, DOB, gender, IP, device-id	Generalise (DOB → year, postcode → first 3 chars) before joining wide
Sensitive	health, biometrics, religion, sexuality, finances	Smallest-possible audience, purpose-bound, explicit lawful basis
Non-PII metadata	product SKU, server hostname	Default access

The classification is metadata that lives next to the schema (in the catalog, in dbt tags, in column-level lineage) so DSAR pipelines, masking rules and access reviews can be driven from one source of truth instead of N team conventions.

Analogy

PII classification is threat modelling for personal data.

In security threat modelling, you don't treat every asset the same — a public marketing page and the production credentials store have wildly different blast radii, so they get wildly different controls. PII works the same way:

Direct identifiers (email, SSN, passport number) are the production credentials: leak one and the subject is uniquely identified in one shot.
Quasi-identifiers (postcode + DOB + gender, browser fingerprint) are the combinatorial leak: each looks innocuous, but in combination they re-identify 87% of the US population.
Sensitive attributes (health, religion, sexuality, political views, salary) are the harm-class data: harm is large even when the identity is uncertain.

Mix up the classes and you'll either over-restrict harmless columns or quietly ship a dataset that re-identifies your customers in three joins.

Make it stick

Use the prompts below to anchor pii classification to something you actually own.

›Pick a recently-shipped dataset. Can you state the PII class of every column without opening a doc? If not, that gap is your real exposure.
›Which 'safe' columns in your warehouse become re-identifying when joined with another internal table? Map at least two such combinations.
›What's the smallest catalog change you could ship next sprint to make PII class a first-class column attribute?

Reading in progress · 0 of 2 activities done