Is tokenized data considered anonymous under GDPR?

No. Tokenized data is pseudonymized — it remains personal data under GDPR because the organization holds the token-to-identity mapping key. GDPR obligations continue to apply: legal basis for processing, data subject rights, retention limits, and breach notification. Tokenization reduces the risk of unauthorized re-identification but does not remove the data from GDPR scope.

Can we retain data indefinitely if we anonymize it instead of deleting it?

Yes, if the anonymization is genuine. Truly anonymous data is outside GDPR scope and can be retained indefinitely without a legal basis. However, the anonymization must pass the EDPB's three-part test. Organizations sometimes apply weak anonymization to defer deletion — if the data can still be singled out or linked to individuals, it is pseudonymized at best and the retention obligation still applies.

GDPR Pseudonymization vs Anonymization: Technical Implementation Guide (2026)

Q: We replaced all names and emails with random IDs. Is our dataset now anonymous?

Only if it also passes the EDPB's three-part identifiability test: singling out, linkability, and inference. Removing direct identifiers is the first step, not the last. If your dataset contains rare demographic combinations, device fingerprints, precise location data, or attributes that allow inference, it may still fail one or more tests and remain personal data.

Q: Does GDPR require pseudonymization?

No. Articles 25 and 32 mention pseudonymization as an example of appropriate technical measures, but neither mandates it. Pseudonymization is one tool among many — encryption, access controls, data minimization, and purpose limitation are equally valid. Whether pseudonymization is appropriate depends on the risk profile: high-risk processing (special category data, large-scale profiling) benefits most.

The EDPB's 2025 Guidelines on Anonymization introduced a stricter three-part identifiability test that most organizations are failing, either unknowingly or because they are processing pseudonymized data as if it were outside the regulation's scope. The gap between what companies call anonymous and what the EDPB treats as anonymous has generated enforcement actions across healthcare, analytics, and advertising. Getting this distinction wrong doesn't just create legal exposure, it invalidates the compliance rationale for entire data programs.

Your legal team calls a hashed dataset "anonymous." Your engineers implemented SHA-256 without a salt key. Your DPO approved it as outside GDPR scope. None of this will hold up to a regulator's singling-out test.

This guide resolves the confusion: legal definitions, the EDPB's three-part identifiability test, practical implementation techniques for each approach, and a decision framework for choosing between them across the most common use cases. If you need a privacy governance platform that tracks data subject rights obligations, documents legal bases, and maps processing activities across your entire data program, Secure Privacy's Privacy & AI Governance Platform is built for exactly this.

Key takeaways

Pseudonymized data is personal data. GDPR Article 4(5) defines pseudonymization as processing that removes direct identifiers but retains a re-identification key. The data remains in scope for all GDPR obligations: legal basis, data subject rights, retention limits, security requirements, breach notification. The benefit is reduced risk, not reduced obligations.
Truly anonymized data is outside GDPR scope (Recital 26), but achieving genuine anonymization is harder than it appears. The EDPB's three-part test (singling out, linkability, and inference) must all fail. Aggregated data that can still be singled out to an individual is not anonymous. IP addresses truncated to three octets are not anonymous if they can be re-linked to behavioral data.
The EDPB published Guidelines 01/2025 on Pseudonymization in January 2025 and Guidelines 04/2025 on Anonymization, providing updated standards on both. The guidelines reflect the CJEU's 2023 SRB v. Breyer ruling, which adopted a context-sensitive approach to identifiability: the risk assessment is not purely theoretical but considers who actually has access to re-identification keys.
Pseudonymization reduces GDPR risk in specific ways: it can satisfy the "appropriate technical measures" requirement under Article 32, support data minimization compliance, reduce DPIA scope for high-risk processing, and enable more permissive secondary research use under Article 89.
Anonymization, when genuine, removes GDPR obligations entirely, enabling data sharing, indefinite retention, and unrestricted secondary use. The compliance value is high, but the technical bar is demanding and the risk of false confidence in anonymization is common.

Article 4(5), pseudonymization definition: "the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person."

The key phrase is "without the use of additional information." Pseudonymized data is still linkable to a real person, the linkage is just controlled. GDPR explicitly classifies pseudonymized data as personal data in Article 4(1) by including in the definition of personal data "pseudonymous data" as a form of identifiable data.

Recital 26, when data is outside GDPR scope: "The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable."

The standard is "no longer identifiable", not "unlikely to be re-identified" or "difficult to re-identify." The practical test is whether any party, with access to any reasonably available means, could re-identify the data subject from the processed data.

The EDPB's three-part identifiability test for anonymization

The EDPB's guidelines on anonymization require that anonymized data must pass all three of the following tests for any party with access to the data:

1. Singling out: Is it possible to isolate a record in the dataset that corresponds to a specific individual? Even without knowing the identity, if you can point to one person's record, the data is not anonymous.

Common failure: Health dataset where only one patient in a ZIP code has a particular rare condition, even without their name, the record singles them out.

2. Linkability: Can records relating to the same individual be linked across datasets, within the published dataset itself (across two tables) or with an external dataset? If an analytics dataset contains a device fingerprint or a fixed session ID that can be linked back to an identity in another dataset, it fails.

Common failure: Aggregated advertising frequency data that can be linked back to user identity via device fingerprint when combined with a third-party data broker's dataset.

3. Inference: Can the value of one attribute for an individual be inferred from other values in the dataset, even without a direct identifier? Inference attacks use statistical relationships: knowing someone is a senior female manager in a 10-person department may allow inference of salary range.

Common failure: HR analytics dataset with department, seniority, gender, and salary range where certain intersections identify individuals through inference.

Data must fail all three tests to be considered anonymous under GDPR. Failing any one test means it remains personal data, regardless of how the organization classifies it internally.

Pseudonymization techniques

Use pseudonymization when you need to retain the ability to re-identify or link records (for fraud investigation, medical follow-up, audit trails, regulatory requirements) while reducing the risk of unauthorized re-identification.

Tokenization

Replaces a direct identifier (name, email, national ID number) with a randomly generated token. The mapping between token and original value is stored in a secure key store, separate from the tokenized dataset. Learn more about health data and biometric identifiers under Article 9.

How it works: alice@example.com → tok_7x3k9qm. The token is meaningless without the key store. The same identifier always maps to the same token within a system, allowing cross-dataset linkage under controlled conditions.

GDPR Article 32 compliance: The key store must be held separately from the tokenized data, access-controlled, and its own GDPR obligations managed as a high-risk processing activity. A breach of the tokenized dataset without the key store carries materially lower risk, which Article 32 recognizes as an appropriate security measure.

Use cases: Payment card data (PCI-DSS also requires tokenization), customer records in analytics pipelines, clinical trial data, any dataset shared with third parties where re-identification must remain under organizational control.

Keyed hashing (HMAC)

Applies a cryptographic hash function with a secret key to a direct identifier. The same input with the same key always produces the same output (deterministic), allowing linking. The hash is one-way, computing the original value from the hash is computationally infeasible without the key. Learn more about special category data requiring stronger Article 9 safeguards.

How it works: HMAC-SHA256(alice@example.com, secret_key) → fixed-length hex string. Two datasets hashed with the same key can be joined on the hash value without either dataset containing the original identifier.

Limitations: If the secret key is compromised, all pseudonymized records become re-identifiable simultaneously. Key rotation requires re-hashing all data. The EDPB Guidelines 01/2025 note that HMAC is appropriate when the key management is strong, a compromised key destroys the pseudonymization.

Avoid: Unhashed SHA-256 of email addresses. Email addresses have low entropy (predictable structure), and rainbow table attacks can reverse common values. Always use a secret key.

Format-preserving encryption

Encrypts an identifier while preserving its format, a 16-digit card number encrypts to a different 16-digit number. Useful in systems that cannot accommodate changed data formats (legacy databases with column type constraints).

Use cases: When pseudonymization must pass format validation in downstream systems. Adds overhead compared to tokenization for new systems; preferred mainly when retrofitting existing systems.

Encryption with separate key management

Encrypts the entire record or specific fields using a symmetric key held separately from the data. Decryption requires the key. Conceptually similar to tokenization but uses encryption rather than a lookup table.

GDPR consideration: The EDPB treats encrypted personal data as pseudonymized, not anonymous, unless the encryption key has been irretrievably destroyed and the data cannot be re-encrypted. Encryption alone does not achieve anonymization.

Anonymization techniques

Use anonymization when you genuinely do not need to re-identify individuals and want to remove GDPR obligations entirely. The test is strict: the data must pass the EDPB's three-part identifiability test even for adversarial parties with access to external datasets.

Data aggregation and suppression

Replace individual-level data with group-level statistics. Instead of storing each user's purchase value, store the average purchase value for users in a geographic cohort.

Minimum group size: The EDPB's 2025 guidelines indicate that groups of fewer than 5–10 individuals are generally insufficient to prevent singling-out, even if no direct identifier is present. Some regulatory contexts (health data, financial services) require higher thresholds. A group of 3 salary records still allows inference.

Suppression: Remove records or cells where the group size falls below the minimum threshold. A cell showing "salary range: €80k–€90k" for a group of 2 people must be suppressed.

k-anonymity

Ensures that every record in a published dataset is indistinguishable from at least k-1 other records on the quasi-identifier attributes. If k=5, every combination of (age range, gender, ZIP code, job category) appears at least 5 times in the dataset.

How it works: Generalize attributes (exact age → age range; exact ZIP code → first 3 digits only; specific job title → job category) until no combination of quasi-identifiers singles out fewer than k individuals.

Limitation: k-anonymity alone does not prevent inference attacks. The extension l-diversity also requires that within each k-anonymous group, sensitive attributes take at least l distinct values. t-closeness adds distributional similarity requirements. For most enterprise use cases, k-anonymity with suppression of rare combinations is sufficient.

Where it applies: Publishing aggregate HR statistics, sharing customer cohort data with a research partner, releasing app usage data for academic study.

Differential privacy

Adds carefully calibrated mathematical noise to query results or to the dataset itself, such that the presence or absence of any single individual's data cannot be reliably detected from the output.

How it works: Rather than releasing exact counts (83 users in cohort A), the system releases the count with added noise drawn from a calibrated distribution (e.g., "82 or 84 users"). The privacy budget (ε) controls the tradeoff between accuracy and privacy, lower ε means stronger privacy but noisier results.

Enterprise use: Google Chrome uses differential privacy for usage statistics. Apple uses it for iOS feature usage telemetry. For most organizations, implementing differential privacy directly is complex, it requires specialist expertise. Practically, differential privacy is most accessible through libraries (Google's DP library, Apple's DP library, OpenDP) or platforms that implement it as a feature.

Where it applies: Releasing statistics about user behavior for research or advertising analytics, where aggregate accuracy matters but individual re-identification must be provably prevented.

Synthetic data generation

Creates a new dataset that statistically mimics the original data's distributions and relationships, but contains no real records. No individual in the synthetic dataset corresponds to a real person.

How it works: Train a statistical model (or a generative model like a VAE or GAN) on the original data, then generate new records from the model. The synthetic records have similar demographics, co-occurrence patterns, and attribute correlations to the original, but are not derived from any real individual.

GDPR status: Synthetic data generated from personal data is typically treated as personal data during the generation phase (the training data is in scope). The output, if genuinely synthetic and not memorizing individual records, may be outside GDPR scope, but this depends on the model's memorization risk. Models that have memorized rare combinations from training data can reproduce near-exact records.

Use cases: AI model training (replacing real patient data with synthetic equivalents), testing environments, analyst access to production-equivalent data without exposing real records.

Decision framework: choosing between techniques

Use case	Recommended approach	Rationale
Analytics pipeline with re-linkage needed	Tokenization or HMAC	Retains linkability under controlled conditions; satisfies Article 32
Sharing data with a research partner	k-anonymity + suppression	Group-level data removes individual identifiability; retains statistical value
AI model training on sensitive data	Synthetic data generation	No real records in the training corpus; reduces risk of memorization-based leakage
Publishing aggregate statistics publicly	Aggregation + suppression + differential privacy	Public release requires strong privacy guarantees
HR analytics for internal use	Pseudonymization (HMAC) with access controls	Internal re-identification controls may satisfy Article 32 without full anonymization
Archiving after retention period expires	Anonymization via aggregation + suppression	Data that cannot be deleted (legal hold, research) can be retained if genuinely anonymized
Fraud investigation trails	Tokenization with audited key access	Need for re-identification under exceptional circumstances; key access is the audit log

Common mistakes that create false anonymization

Truncating IP addresses: Removing the last octet of an IP address (e.g., 192.168.1.0/24) reduces precision but does not achieve anonymization. ISP allocation ranges, geolocation data, and device fingerprinting still allow re-identification in practice. The CJEU confirmed in Breyer v. Germany that dynamic IP addresses are personal data where a provider can obtain the identity from the ISP. Truncated IPs are pseudonymized, not anonymous.

Publishing sparse datasets without group size checks: A dataset with 500,000 rows feels large. But if 3,000 of those rows represent demographic combinations where only 1–3 individuals in the real population match, those rows single out individuals, regardless of whether names are removed. Run singling-out analysis before any release.

Hashing without a secret key: SHA-256 of a set of email addresses is reversible via rainbow table for common email formats. An attacker with a list of candidate emails can hash them all and match against the published hashes. Always use HMAC with a secret key, or use tokenization instead.

Treating aggregation as automatic anonymization: An HR report with average salary by team, gender, and seniority level may expose individuals in small teams. Suppression rules must be applied: any cell derived from fewer than 5–10 records should be replaced with a range or suppressed entirely.

Calling synthetic data anonymous without memorization testing: Generative models can memorize rare examples from training data. A synthetic healthcare dataset that memorizes a rare condition + age + geography combination for a patient with a unique profile is not anonymous, it may reproduce near-exact personal data. Test models for memorization before treating their output as outside GDPR scope.

EDPB Guidelines 01/2025 on Pseudonymization: key changes

The EDPB's Guidelines 01/2025 (adopted January 2025) updated the pseudonymization framework following the CJEU's 2023 SRB ruling, which held that identifiability must be assessed against the realistic means available to any party who might access the data, not just the data controller. Key implications:

Context-sensitive risk assessment: Whether pseudonymized data is personal data depends on who has the re-identification key and whether they realistically could use it. Data pseudonymized with a key held only by a single organization in a locked HSM is lower risk than data where the key is accessible to multiple processors.
Pseudonymization reduces risk but retains obligations: The guidelines explicitly state that pseudonymization does not reduce GDPR obligations, it reduces the risks that Article 32 security measures must address. Organizations cannot rely on pseudonymization as a substitute for legal basis or data subject rights mechanisms.
Cross-context risk: If pseudonymized data is shared with a recipient who can re-identify via their own datasets (a data broker, an ad network with identity graphs), the data must be treated as personal data for that recipient even if pseudonymized for others.
Data minimization remains mandatory: Pseudonymization does not substitute for collecting less data. A pseudonymized dataset with unnecessary attributes still violates data minimization principles.

Frequently asked questions

If we use tokenization and hold the key, is our analytics dataset outside GDPR scope?

No. Pseudonymization, by definition, keeps data within GDPR scope. You hold the re-identification key, which means the data is attributable to specific individuals by your organization. GDPR obligations continue to apply: legal basis, retention limits, data subject rights responses, and breach notification if the tokenized dataset is compromised in a way that increases re-identification risk.

We replaced all names and emails with random IDs. Is our dataset now anonymous?

Only if it also passes the singling-out, linkability, and inference tests. Removing direct identifiers is the first step, not the last. If your dataset contains rare demographic combinations, device fingerprints, precise location data, or attributes that can be inferred from each other, it may still fail one or more tests. Run a singling-out analysis and check whether the remaining quasi-identifiers create identifiable subgroups.

Can we use pseudonymized data for secondary purposes (AI training, analytics) without a new legal basis?

Pseudonymization is one factor in the Article 6(4) compatibility assessment for secondary purposes, it reduces the risk to individuals, making secondary use more likely to be compatible with the original purpose. But it does not automatically create a new legal basis. The full Article 6(4) assessment still applies: nature of the link between purposes, context, nature of data, possible consequences, and safeguards including pseudonymization.

Our data retention policy says we delete data after 7 years. Can we anonymize instead of deleting?

Yes, if the anonymization is genuine. Truly anonymous data is outside GDPR scope and can be retained indefinitely. However, the anonymization must pass the three-part EDPB test. Organizations sometimes use "anonymization" to defer deletion of datasets they still need, but if those datasets contain enough attributes to re-identify individuals, they are pseudonymized at best and the retention obligation still applies.

Does GDPR require pseudonymization or is it optional?

Article 25 (Data Protection by Design) and Article 32 (Security of Processing) both mention pseudonymization as an example of appropriate technical measures, but neither requires it. Pseudonymization is one tool among many, encryption, access controls, data minimization, and purpose limitation are equally valid measures. Whether pseudonymization is required depends on the risk profile of the processing: high-risk processing (special category data, large-scale profiling, systematic monitoring) benefits most.

What is the difference between anonymization and encryption?

Encryption transforms data so it cannot be read without the decryption key, but the original data remains recoverable. Encrypted personal data is pseudonymized, it is still personal data, still in GDPR scope, still requires a legal basis. Anonymization transforms data so the original individual cannot be re-identified at all, even with additional information. The difference is reversibility: encryption is reversible with the key; genuine anonymization is irreversible.

Getting pseudonymization and anonymization right is a technical decision with legal consequences. If your privacy program needs a platform that maps processing activities by legal basis, manages DSAR responses, and documents your Article 32 technical measures across the full data lifecycle — Secure Privacy's Privacy & AI Governance Platform gives DPOs and legal teams the structure, visibility, and control to do it well.