PII masking: a compliance checklist for financial-sector ML projects

Introduction

Financial organizations handle massive volumes of personal data (PII). When building ML models — from credit scoring to customer-request analysis — you have to comply with personal-data regulations.

This article is a practical checklist for staying compliant without blocking your ML roadmap.

What counts as PII

Typical categories of personal data:

Category	Examples	Risk
Full name	John A. Doe	Medium
ID document	Passport series, number	High
Address	Registered, physical	Medium
Phone, email	+1 (555) 123-4567	Medium
Tax / SSN	Tax ID, social-security number	High
Banking data	Card number, account	High
Biometrics	Photo, voice, fingerprints	High

Important: Special categories (race, ethnicity, health information, biometrics) require written consent from the data subject and heightened protection measures.

Masking strategies

1. Substitution

Replace real values with generated ones that are still valid in format. Preserves data structure for testing.

# Before
John A. Doe, passport 4515 123456

# After
Peter P. Smith, passport 4515 999999

2. Partial masking

Hide part of the value. Standard for cards and phone numbers.

# Before
+1 (555) 123-4567
4276 5500 1234 5678

# After
+1 (555) ***-**67
4276 55** **** 5678

3. Tokenization

Replace values with unique tokens that can be reversed via a secure mapping store.

# Before
Tax ID: 7707083893

# After
Tax ID: [TOKEN_a7f3d2]

4. Deletion

Drop the field entirely. Use when the data is not needed by the ML model.

ML-specific requirements

Training data

Training data must be anonymized or de-identified. De-identified data is generally not treated as personal data.

Model logs

Logs of ML-model requests must not contain PII. Hash user identifiers.

Models with memory

LLMs and some embedding models can "memorize" training data. Audit for PII leakage via model-inversion attacks.

Compliance checklist

Legal requirements

Personal-data processing policy is published and up to dateProcessing consent includes ML purposesA Data Protection Officer (DPO) is appointedData-protection impact assessment completed (including for ML models)

Technical controls

PII is detected automatically (NER, regular expressions)Masking applied before writing to storageEncryption at rest (AES-256) and in transit (TLS 1.3)Access segregation — only the staff who need itAccess logging without storing the data itselfAutomatic deletion after the retention period

ML-specific

Training datasets verified for raw PIIModels tested for training-data leakageFeature store contains no unmasked identifiersInference logs do not contain input PII

Tools

Tool	Purpose	Notes
Presidio (Microsoft)	PII detection and masking	Customizable recognizers, multi-language
spaCy + NER	Entity extraction	Trainable models for specific domains
Faker	Synthetic data generation	Locale-aware, valid formats
HashiCorp Vault	Tokenization and secret management	Enterprise-grade, access auditing

# Example: masking with Presidio
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

text = "Called John Doe, phone +1 (555) 123-4567"

# Analyze
results = analyzer.analyze(text=text, language="en")

# Mask
anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
# Result: "Called [PERSON], phone [PHONE_NUMBER]"

Liability

Typical penalties for legal entities for personal-data violations vary by jurisdiction. Examples include fines for processing without consent, breaches of data-subject rights, security-control failures, and PII leaks (including through ML models) that can also trigger civil suits.

Conclusion

Personal-data compliance is not an obstacle to ML — it's a quality requirement for the architecture. Apply masking at data ingestion, automate checks, and document the processes — that will save time during audits.

Need help with implementation? Get in touch — we have experience with financial-sector and compliance projects.