Which is the strongest pseudonymisation choice for a user IRI?

PII & Data Minimisation

RDF joins are too easy — minimise on purpose.

0/3 done

Theory

RDF was designed to make join across sources trivial. That's a feature for knowledge integration — and a serious risk for PII. Three defences:

Minimise: don't store what you don't need.
Pseudonymise: use opaque IRIs (<urn:user:9c3f…>) instead of mailable identifiers.
Compartmentalise: keep PII in a separate named graph with its own ACL and retention policy.

Analogy

A knowledge graph is a magnet for re-identification. Two innocuous public datasets joined on a postcode + birthday can become a privacy incident.

The classic Sweeney result — 87% of US residents uniquely identified by {ZIP, DOB, sex} — is the exact failure mode RDF amplifies: every shared IRI is a free join key, and there's no DBA to gatekeep what gets merged.

Theory

Going deeper — pseudonymisation is not anonymisation

Swapping an email for <urn:user:9c3f...> pseudonymises the data: under GDPR it is still personal data, because a mapping back to the person exists somewhere. True anonymisation (no re-identification by anyone) is a much higher bar — and RDF makes it harder, because every shared value is a join key.

Practical defences that survive contact with auditors:

Right-to-erasure becomes a graph drop. Keep each subject's PII in its own named graph (or behind a reversible key vault) so 'delete Alice' is DROP GRAPH :pii/alice — not a hunt across a tangled web of triples.
Quasi-identifiers re-identify even without direct IDs. {ZIP, DOB, sex} is enough for most people; generalise (year not date, region not ZIP) or aim for k-anonymity before publishing.
The mapping is the crown jewel. The urn-to-person table deserves the strictest ACL, separate storage, and its own retention clock; leak it and every pseudonym unravels at once.

Rule of thumb: pseudonymise to reduce exposure, compartmentalise to contain it, and never call pseudonymous data 'anonymous' in a privacy review.

Worked example — pseudonymise + compartmentalise

Worked example — pseudonymisation, before and after.

BEFORE (PII-laden, joinable across datasets, risky):

@prefix foaf: <http://xmlns.com/foaf/0.1/> .

:alice_at_example_com a foaf:Person ;
  foaf:name "Alice" ;
  foaf:mbox <mailto:alice@example.com> ;
  :department "Engineering" .

The IRI itself leaks the email. Anyone joining this graph with a public contact list re-identifies Alice.

AFTER (opaque identifier, PII split into a separate graph):

@prefix foaf: <http://xmlns.com/foaf/0.1/> .

# The 'public' graph keeps only the non-identifying facts:
<urn:user:9c3fb1d8> a foaf:Person ;
  :department "Engineering" .

# A SEPARATE named graph holds the PII, behind a stricter ACL:
# GRAPH :pii_graph {
#   <urn:user:9c3fb1d8> foaf:name "Alice" ;
#     foaf:mbox <mailto:alice@example.com> .
# }

Two changes did the work:

The subject became an opaque <urn:user:...> IRI — no email, no name, no hint of who it points to.
The PII triples moved into a different named graph with its own retention and access policy.

For the playground below, the minimal pass is the opaque <urn:user:...> subject; wrapping in GRAPH :pii_graph { ... } is the bonus compartmentalisation step.

Reflect

Look at the data you're modelling for your current project. Which triples are actually required? Which exist only because they were easy to add?

Useful frame: for each PII triple, write the sentence we need this because… — if the sentence ends in …it was already in the source, you've found a deletion candidate. Minimisation is rarely a technical problem; it's an explicit-justification problem.

›Pick one PII triple you could remove with no business impact.
›Where would you put it instead, if you can't delete it?

Reading in progress · 0 of 3 activities done