PII Classification

Direct / quasi / sensitive — different classes need different controls.

0/2 done

Overview

PII Classification

Direct / quasi / sensitive — different classes need different controls.

Why it matters

A name alone isn't usually a re-identification risk; a name + DOB + postcode is. Quasi-identifiers are why anonymisation is harder than it looks.

Going deeper

A practical PII classification policy attaches a class label and a control set to every column:

ClassExamplesDefault controls
Direct PIIemail, phone, SSN, passportEncrypt at rest; tokenise for analytics; access logged + reviewed
Quasi-identifierpostcode, DOB, gender, IP, device-idGeneralise (DOB → year, postcode → first 3 chars) before joining wide
Sensitivehealth, biometrics, religion, sexuality, financesSmallest-possible audience, purpose-bound, explicit lawful basis
Non-PII metadataproduct SKU, server hostnameDefault access

The classification is metadata that lives next to the schema (in the catalog, in dbt tags, in column-level lineage) so DSAR pipelines, masking rules and access reviews can be driven from one source of truth instead of N team conventions.

Analogy

PII classification is threat modelling for personal data.

In security threat modelling, you don't treat every asset the same — a public marketing page and the production credentials store have wildly different blast radii, so they get wildly different controls. PII works the same way:

  • Direct identifiers (email, SSN, passport number) are the production credentials: leak one and the subject is uniquely identified in one shot.
  • Quasi-identifiers (postcode + DOB + gender, browser fingerprint) are the combinatorial leak: each looks innocuous, but in combination they re-identify 87% of the US population.
  • Sensitive attributes (health, religion, sexuality, political views, salary) are the harm-class data: harm is large even when the identity is uncertain.

Mix up the classes and you'll either over-restrict harmless columns or quietly ship a dataset that re-identifies your customers in three joins.

Make it stick

Use the prompts below to anchor pii classification to something you actually own.

  • Pick a recently-shipped dataset. Can you state the PII class of every column without opening a doc? If not, that gap is your real exposure.
  • Which 'safe' columns in your warehouse become re-identifying when joined with another internal table? Map at least two such combinations.
  • What's the smallest catalog change you could ship next sprint to make PII class a first-class column attribute?

Reading in progress · 0 of 2 activities done