8 min read · December 2024 · Boris, MLOps Engineer

PII masking: a checklist for the financial sector

A practical checklist for compliance with personal-data regulations when running ML projects.

Introduction

Financial organizations handle massive volumes of personal data (PII). When building ML models — from credit scoring to customer-request analysis — you have to comply with personal-data regulations.

This article is a practical checklist for staying compliant without blocking your ML roadmap.

What counts as PII

Typical categories of personal data:

CategoryExamplesRisk
Full nameJohn A. DoeMedium
ID documentPassport series, numberHigh
AddressRegistered, physicalMedium
Phone, email+1 (555) 123-4567Medium
Tax / SSNTax ID, social-security numberHigh
Banking dataCard number, accountHigh
BiometricsPhoto, voice, fingerprintsHigh
Important: Special categories (race, ethnicity, health information, biometrics) require written consent from the data subject and heightened protection measures.

Masking strategies

1. Substitution

Replace real values with generated ones that are still valid in format. Preserves data structure for testing.

# Before
John A. Doe, passport 4515 123456

# After
Peter P. Smith, passport 4515 999999

2. Partial masking

Hide part of the value. Standard for cards and phone numbers.

# Before
+1 (555) 123-4567
4276 5500 1234 5678

# After
+1 (555) ***-**67
4276 55** **** 5678

3. Tokenization

Replace values with unique tokens that can be reversed via a secure mapping store.

# Before
Tax ID: 7707083893

# After
Tax ID: [TOKEN_a7f3d2]

4. Deletion

Drop the field entirely. Use when the data is not needed by the ML model.

ML-specific requirements

Training data

Training data must be anonymized or de-identified. De-identified data is generally not treated as personal data.

Model logs

Logs of ML-model requests must not contain PII. Hash user identifiers.

Models with memory

LLMs and some embedding models can "memorize" training data. Audit for PII leakage via model-inversion attacks.

Compliance checklist

Legal requirements

Technical controls

ML-specific

Tools

ToolPurposeNotes
Presidio (Microsoft)PII detection and maskingCustomizable recognizers, multi-language
spaCy + NEREntity extractionTrainable models for specific domains
FakerSynthetic data generationLocale-aware, valid formats
HashiCorp VaultTokenization and secret managementEnterprise-grade, access auditing
# Example: masking with Presidio
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

text = "Called John Doe, phone +1 (555) 123-4567"

# Analyze
results = analyzer.analyze(text=text, language="en")

# Mask
anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
# Result: "Called [PERSON], phone [PHONE_NUMBER]"

Liability

Typical penalties for legal entities for personal-data violations vary by jurisdiction. Examples include fines for processing without consent, breaches of data-subject rights, security-control failures, and PII leaks (including through ML models) that can also trigger civil suits.

Conclusion

Personal-data compliance is not an obstacle to ML — it's a quality requirement for the architecture. Apply masking at data ingestion, automate checks, and document the processes — that will save time during audits.

Need help with implementation? Get in touch — we have experience with financial-sector and compliance projects.