OpenAI released a compact, fine-tunable model for detecting personally identifiable information – under Apache 2.0. Why this is a strategic building block for European enterprises that want to leverage Frontier LLMs without exposing personal data.

What Is the OpenAI Privacy Filter?

The OpenAI Privacy Filter is a bidirectional token classification model that detects and masks personally identifiable information (PII) in text. Unlike generative models, it doesn’t work autoregressively token by token – instead, it classifies an entire input sequence in a single forward pass.

Technical Specifications

Property	Value
Total Parameters	1.5 billion
Active Parameters	50 million (Sparse Mixture-of-Experts)
Context Window	128,000 tokens
Architecture	Bidirectional token classifier with banded attention
License	Apache 2.0
Base	Autoregressive pretraining (gpt-oss architecture), then post-trained as classifier

The model uses a Sparse Mixture-of-Experts architecture with 128 experts (top-4 routing per token), Grouped-Query Attention with Rotary Positional Embeddings, and a final classification head over 33 label classes. Decoding uses a constrained Viterbi procedure that enforces coherent BIOES spans (Begin, Inside, Outside, End, Single).

What Does the Model Detect?

The Privacy Filter classifies eight categories of personal data:

account_number – Account numbers, IBANs, credit card numbers
private_address – Residential addresses
private_email – Email addresses
private_person – Personal names
private_phone – Phone numbers
private_url – Personal URLs
private_date – Birth dates and other personal date references
secret – API keys, passwords, credentials

Why Apache 2.0 Makes All the Difference

The Apache 2.0 licensing is the true strategic core of this release:

Commercial use without restrictions
Modification and redistribution permitted
No copyleft – no obligation to disclose your own modifications
Patent grant – explicit protection against patent claims from the licensor

For enterprises, this means: take the model, fine-tune it on your own data distributions, integrate it into your own infrastructure, and run it in production – without legal gray areas, without dependency on OpenAI’s API, without vendor lock-in.

Comparison to Previous Alternatives

Previous open-source PII detection tools like Microsoft Presidio or spaCy-based NER pipelines often work rule-based or with significantly smaller models. The Privacy Filter is the first context-aware model from a leading AI company to enter the open-source space – with an architecture that scales to 128k token context and is adaptable to domain-specific requirements through fine-tuning.

The European Use Case: Frontier LLMs Without Privacy Risk

This is where it gets particularly interesting for European enterprises. The situation is well known:

The Dilemma

The most capable Frontier LLMs – whether GPT-5, Claude Opus, or Gemini Ultra – offer capabilities that are transformative for many business processes. At the same time, European companies face concrete hurdles:

GDPR compliance: Personal data cannot be transmitted to third-party providers without a legal basis
Schrems II implications: Data transfers to the US remain legally complex
Sector regulation: Industries like healthcare, finance, and public administration have additional restrictions
EU AI Act: Transparency obligations when processing personal data through AI systems

The Solution: Privacy Filter as a Preprocessing Layer

The OpenAI Privacy Filter enables an architecture pattern that resolves this dilemma:

[Original Data] → [Privacy Filter (on-premises)] → [Masked Data] → [Frontier LLM API]

In practice:

Input data passes through the Privacy Filter on your own infrastructure
PII is detected and masked – names become [PERSON], emails become [EMAIL], etc.
The masked data goes to the Frontier LLM for processing
The response is mapped back – masked placeholders are replaced with the original data

The result: the full capability of a Frontier model, without personal data ever leaving your own infrastructure.

Why Now?

Several Frontier models are currently unavailable or only partially available in Europe. Some providers deliberately skip an EU launch because regulatory requirements are too complex. The Privacy Filter opens a pragmatic middle ground: instead of waiting for an EU launch or forgoing the models entirely, companies can use the API endpoints – but with a self-operated privacy filter in front.

Practical Usage

Installation and Getting Started

pip install -e .

This provides the opf CLI tool:

# One-shot redaction
opf "John Doe lives at 42 Example Street, London."

# Process a file
opf -f /path/to/file.txt

# CPU mode (no GPU required)
opf --device cpu "Alice was born on 1990-01-02."

# Interactive mode
opf

Fine-Tuning on Your Own Data

A crucial advantage: the model can be fine-tuned on your own data distribution:

opf train /path/to/training-data.jsonl --output-dir /path/to/finetuned-model

Typical fine-tuning scenarios:

Industry-specific PII: Medical record numbers, insurance numbers, internal employee IDs
Language adaptation: Optimization for German texts, Swiss address formats, Austrian social security numbers
Policy adaptation: What counts as PII is context-dependent – a company name can be public in one context and confidential in another

Runtime Precision/Recall Control

The Viterbi decoding parameters allow runtime behavior tuning:

High recall: Prefer masking too much – for high-risk data protection scenarios
High precision: Only mask at high confidence – for scenarios where context preservation matters

Limitations – An Honest Assessment

The Privacy Filter is not a silver bullet. OpenAI itself documents the limitations transparently:

Primarily trained on English: Performance on German texts, non-Latin scripts, or regional naming conventions may be limited
Static label policy: The eight categories are fixed – what doesn’t fit won’t be detected
Not an anonymization guarantee: The model is a data minimization aid, not a complete anonymization solution
False positives: Public entities (company names, place names) may be incorrectly masked
False negatives: Unusual names, regional naming conventions, or novel credential formats may slip through

Our assessment: For productive use in the German-speaking market, fine-tuning on German data is practically mandatory. The model provides the architecture and fundamental capability – the domain-specific adaptation is up to each organization.

Strategic Implications

For Enterprises

The Privacy Filter significantly lowers the barrier to legally compliant use of Frontier LLMs. Instead of waiting for a European LLM champion or working with significantly weaker local models, companies can:

Run the Privacy Filter on their own infrastructure
Fine-tune it for their domain
Use Frontier LLMs via API – with masked data
Unlock the full capability of the best available models

For AI Strategy

This release fits a larger pattern: the AI landscape is moving toward modular architectures. Not one model does everything – instead, specialized components (PII filters, guardrails, routing, evaluation) are orchestrated into an overall system. The Privacy Filter is one building block in this architecture.

For CompanyGPT Customers

For our CompanyGPT customers, we are already evaluating integration of the Privacy Filter as an additional data protection layer. The combination of a self-hosted AI platform and an upstream PII filter can further enhance data security – especially for customers connecting external Frontier models via API.

Conclusion

OpenAI’s Privacy Filter is not a revolutionary research result – it is a pragmatic, well-documented tool that fills a real gap. The Apache 2.0 license makes it usable for enterprises. The compact architecture makes it operable on standard hardware. Fine-tuning makes it adaptable.

For European companies caught between data protection requirements and the desire for Frontier LLM capabilities, this is a concrete, viable path forward.

Want to deploy the OpenAI Privacy Filter in your organization or integrate it into your existing AI infrastructure? Talk to us – we advise on evaluation, fine-tuning, and architecture.

Sources and Further Reading:

OpenAI Privacy Filter: Open-Source PII Detection Under Apache 2.0 -- A Game-Changer for European Enterprises