PII masking: a checklist for the financial sector
A practical checklist for compliance with personal-data regulations when running ML projects.
Introduction
Financial organizations handle massive volumes of personal data (PII). When building ML models — from credit scoring to customer-request analysis — you have to comply with personal-data regulations.
This article is a practical checklist for staying compliant without blocking your ML roadmap.
What counts as PII
Typical categories of personal data:
| Category | Examples | Risk |
|---|---|---|
| Full name | John A. Doe | Medium |
| ID document | Passport series, number | High |
| Address | Registered, physical | Medium |
| Phone, email | +1 (555) 123-4567 | Medium |
| Tax / SSN | Tax ID, social-security number | High |
| Banking data | Card number, account | High |
| Biometrics | Photo, voice, fingerprints | High |
Masking strategies
1. Substitution
Replace real values with generated ones that are still valid in format. Preserves data structure for testing.
# Before John A. Doe, passport 4515 123456 # After Peter P. Smith, passport 4515 999999
2. Partial masking
Hide part of the value. Standard for cards and phone numbers.
# Before +1 (555) 123-4567 4276 5500 1234 5678 # After +1 (555) ***-**67 4276 55** **** 5678
3. Tokenization
Replace values with unique tokens that can be reversed via a secure mapping store.
# Before Tax ID: 7707083893 # After Tax ID: [TOKEN_a7f3d2]
4. Deletion
Drop the field entirely. Use when the data is not needed by the ML model.
ML-specific requirements
Training data
Training data must be anonymized or de-identified. De-identified data is generally not treated as personal data.
Model logs
Logs of ML-model requests must not contain PII. Hash user identifiers.
Models with memory
LLMs and some embedding models can "memorize" training data. Audit for PII leakage via model-inversion attacks.
Compliance checklist
Legal requirements
Technical controls
ML-specific
Tools
| Tool | Purpose | Notes |
|---|---|---|
| Presidio (Microsoft) | PII detection and masking | Customizable recognizers, multi-language |
| spaCy + NER | Entity extraction | Trainable models for specific domains |
| Faker | Synthetic data generation | Locale-aware, valid formats |
| HashiCorp Vault | Tokenization and secret management | Enterprise-grade, access auditing |
# Example: masking with Presidio from presidio_analyzer import AnalyzerEngine from presidio_anonymizer import AnonymizerEngine analyzer = AnalyzerEngine() anonymizer = AnonymizerEngine() text = "Called John Doe, phone +1 (555) 123-4567" # Analyze results = analyzer.analyze(text=text, language="en") # Mask anonymized = anonymizer.anonymize(text=text, analyzer_results=results) # Result: "Called [PERSON], phone [PHONE_NUMBER]"
Liability
Typical penalties for legal entities for personal-data violations vary by jurisdiction. Examples include fines for processing without consent, breaches of data-subject rights, security-control failures, and PII leaks (including through ML models) that can also trigger civil suits.
Conclusion
Personal-data compliance is not an obstacle to ML — it's a quality requirement for the architecture. Apply masking at data ingestion, automate checks, and document the processes — that will save time during audits.
Need help with implementation? Get in touch — we have experience with financial-sector and compliance projects.