PII & Data Minimisation

RDF joins are too easy — minimise on purpose.

0/3 done

Theory

RDF was designed to make join across sources trivial. That's a feature for knowledge integration — and a serious risk for PII. Three defences:

  1. Minimise: don't store what you don't need.
  2. Pseudonymise: use opaque IRIs (<urn:user:9c3f…>) instead of mailable identifiers.
  3. Compartmentalise: keep PII in a separate named graph with its own ACL and retention policy.

Analogy

A knowledge graph is a magnet for re-identification. Two innocuous public datasets joined on a postcode + birthday can become a privacy incident.

The classic Sweeney result — 87% of US residents uniquely identified by {ZIP, DOB, sex} — is the exact failure mode RDF amplifies: every shared IRI is a free join key, and there's no DBA to gatekeep what gets merged.

Worked example — pseudonymise + compartmentalise

Worked example — pseudonymisation, before and after.

BEFORE (PII-laden, joinable across datasets, risky):

@prefix foaf: <http://xmlns.com/foaf/0.1/> .

:alice_at_example_com a foaf:Person ;
  foaf:name "Alice" ;
  foaf:mbox <mailto:alice@example.com> ;
  :department "Engineering" .

The IRI itself leaks the email. Anyone joining this graph with a public contact list re-identifies Alice.

AFTER (opaque identifier, PII split into a separate graph):

@prefix foaf: <http://xmlns.com/foaf/0.1/> .

# The 'public' graph keeps only the non-identifying facts:
<urn:user:9c3fb1d8> a foaf:Person ;
  :department "Engineering" .

# A SEPARATE named graph holds the PII, behind a stricter ACL:
# GRAPH :pii_graph {
#   <urn:user:9c3fb1d8> foaf:name "Alice" ;
#     foaf:mbox <mailto:alice@example.com> .
# }

Two changes did the work:

  1. The subject became an opaque <urn:user:...> IRI — no email, no name, no hint of who it points to.
  2. The PII triples moved into a different named graph with its own retention and access policy.

For the playground below, the minimal pass is the opaque <urn:user:...> subject; wrapping in GRAPH :pii_graph { ... } is the bonus compartmentalisation step.

Reflect

Look at the data you're modelling for your current project. Which triples are actually required? Which exist only because they were easy to add?

Useful frame: for each PII triple, write the sentence we need this because… — if the sentence ends in …it was already in the source, you've found a deletion candidate. Minimisation is rarely a technical problem; it's an explicit-justification problem.

  • Pick one PII triple you could remove with no business impact.
  • Where would you put it instead, if you can't delete it?

Reading in progress · 0 of 3 activities done