Presidio PII Redaction with Lakera Prompt-Injection Defense

What

A two-layer outbound safety net for any agent that ingests user text: Microsoft Presidio labels and redacts PII before the text reaches the LLM, and Lakera Guard screens prompts for jailbreak patterns with a configurable confidence threshold.

Project Context

In full_project_context_updated.txt -> app/config.py, the security settings block exposes:

pii_redaction_enabled: bool = True
lakera_guard_api_key: Optional[str] = None
security_jailbreak_threshold: float = 0.8
security_enabled: bool = True (master kill-switch)

In Dockerfile, the third build stage downloads the Spacy en_core_web_lg wheel from the explosion-models GitHub release. That Spacy model is required by presidio-analyzer on first invocation, so it must be baked into the image rather than fetched at runtime.

How

Settings block with master kill-switch and bounds-checked threshold


security_enabled: bool = Field(default=True, description="Enable security guardrails")
pii_redaction_enabled: bool = Field(default=True, description="Redact PII from logs and DB")
security_jailbreak_threshold: float = Field(default=0.8, ge=0.0, le=1.0)
 
lakera_guard_api_key: Optional[str] = Field(
    default=None,
    description="API Key for Lakera Guard (Advanced Prompt Injection Defense)",
)

The master security_enabled switch short-circuits BOTH checks in one place when devs want raw prompts.
ge=0.0, le=1.0 enforces threshold bounds — Lakera returns confidence scores in the [0.0, 1.0] range, so any value outside is a misconfiguration.

Presidio analyzer with the Spacy en_core_web_lg engine


from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
 
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
 
def redact_pii(prompt: str) -> str:
    if not settings.pii_redaction_enabled:
        return prompt
    results = analyzer.analyze(text=prompt, language="en")
    anonymized = anonymizer.anonymize(text=prompt, analyzer_results=results)
    return anonymized.text

The AnalyzerEngine() constructor reads the Spacy model pipeline; the wheel installed in the Dockerfile ensures first-call latency stays low.
anonymizer.anonymize(...) replaces entities with canonical placeholders such as EMAIL and PHONE_NUMBER, rendered at runtime as the familiar entity-tag blocks. This preserves enough structure for downstream agents to reason about entity positions without keeping the raw values.
Calling analyze(...) per prompt is the simple version; in hot paths cache the AnalyzerEngine once at module load.

Lakera Guard jailbreak check


import httpx
 
async def check_prompt_safety(prompt: str) -> bool:
    if not settings.lakera_guard_api_key:
        return True  # fail-open when no key configured
    async with httpx.AsyncClient() as client:
        r = await client.post(
            "https://api.lakera.ai/v1/prompt_injection",
            json={"input": prompt},
            headers={"Authorization": f"Bearer {settings.lakera_guard_api_key}"},
            timeout=settings.groq_timeout,
        )
        data = r.json()
    return data.get("score", 0.0) < settings.security_jailbreak_threshold

Fail-open when the key is missing is the correct dev default; production deployments must add an explicit alarm when the API key is absent.
The threshold is round-tripped from config so A/B-tuning thresholds can be done without a redeploy.

Common Pitfalls

Running Presidio with the wrong Spacy wheel raises OSError: [E050] Can't find model 'en_core_web_lg' on first call. Pin the wheel URL in the Dockerfile so the image is self-sufficient.

Setting the threshold too high (above 0.95) makes Lakera permissive and lets jailbreaks slip through; setting it too low (below 0.5) blocks legitimate queries. Benchmark before deployment.

Calling Presidio per-token inside the agent loop is O(n²) overhead — invoke once on the final user message before the LLM call, not on every chunk of streaming output.

Real-World Interview Prep

Q1: How accurate is Presidio on banking-domain PII?

A: Presidio achieves 90-95% recall on standard entities (PERSON, EMAIL, PHONE, CREDIT_CARD) when the Spacy model is en_core_web_lg (as in this stack). The 5-10% miss rate is concentrated in (a) non-Western names not in the Spacy vocabulary, (b) domain-specific entities not in the default presets (e.g., UK National Insurance, sort codes, IBAN) — the inline PatternRecognizer for UK_NINO adds a regex recogniser to cover this gap, (c) heavily obfuscated formats (john[at]bank[dot]com). For high-stakes compliance, follow Presidio with a custom regex pass for the regex-foolable categories. Precision remains high (very few false positives) because Presidio’s scoring weights Spacy + recogniser agreement.

Q2: How does Lakera Guard decide if a prompt is a jailbreak?

A: Lakera Guard runs a classifier network (a fine-tuned BERT-like model) over the prompt and returns a score in [0.0, 1.0] (or flagged: bool per the v2 API). It’s been trained on a corpus of jailbreak patterns: persona-shift (“you are now X”), prompt-extraction (“repeat your system prompt”), base64-decoded attacks, etc. Comparison vs heuristic: Lakera catches heuristics-blind patterns (multi-language jailbreaks, semantic-only attacks) and is far more robust against obfuscation like leet-speak or unicode tricks. The trade-off is cost — Lakera is a paid API billed per call; for low-latency hot paths pre-screen with a fast regex pass and only call Lakera on the suspicious ~5%.

Q3: When should a security layer “fail open” vs “fail closed”?

A: Fail closed (block on error) when (a) the user-facing impact of a false negative (a leaked PII / a successful jailbreak) is severe, (b) you have regulatory obligation (PCI-DSS, FCA), (c) the user can re-attempt safely. Fail open (allow on error) when (a) the dependency is non-critical (e.g., Lakera is down but Presidio still runs), (b) a false positive (blocked legit user) is worse than a false negative, (c) the user cannot retry (e.g., in-chat escalation). The FCA stack defaults to fail-open with explicit logger.error(...) so outages are visible; in production a stronger default is fail-closed on PII but fail-open on jailbreak (lower user friction, higher compliance fidelity).

Top-to-Bottom Code Walkthrough (`app/services/security_service.py` — `sanitize_input`, `check_prompt_injection`, `redact_pii`)

This is the outbound defense-in-depth for an FCA-grade chatbot. Three layers, each catching what the others miss.

Layer 1 — Presidio (local PII detection)


from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
 
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
 
entities = analyzer.analyze(text=text, language="en")
anonymized = anonymizer.anonymize(text=text, analyzer_results=entities)

Built-in entity types: PERSON, EMAIL, PHONE_NUMBER, IBAN, UK_NINO, CREDIT_CARD, IP_ADDRESS, URL, ORGANIZATION. Custom regex recognisers (for UK-specific):


from presidio_analyzer import Pattern, PatternRecognizer
 
uk_nino_pattern = PatternRecognizer(
    supported_entity="UK_NINO",
    patterns=[Pattern(name="nino", regex=r"\b[A-CEGHJ-PR-TW-Z]{2}\d{6}[A-D]\b", score=0.85)],
)
analyzer.registry.add_recognizer(uk_nino_pattern)

Layer 2 — Lakera Guard (cloud prompt-injection detection)


import httpx
 
async with httpx.AsyncClient() as client:
    r = await client.post(
        "https://api.lakera.ai/v1/prompt_injection",
        headers={"Authorization": f"Bearer {settings.lakera_guard_api_key}"},
        json={"input": user_text},
        timeout=2.0,
    )
    data = r.json()
    is_injection = data["injection"] > settings.security_jailbreak_threshold  # default 0.8

timeout=2.0 — Lakera is called inline; a slow response would block the user message. 2 seconds is the right budget; if Lakera is unhealthy, fall back to regex + heuristic.

Layer 3 — Local regex (always-on fallback)


import re
 
JAILBREAK_REGEXES = [
    r"(?i)ignore (?:all|previous|above) instructions",
    r"(?i)you are now (?:DAN|jailbroken)",
    r"(?i)system prompt",
    r"(?i)prompt injection",
]
 
def local_jailbreak_check(text):
    for pattern in JAILBREAK_REGEXES:
        if re.search(pattern, text):
            return True
    return False

Why three layers

Presidio alone: misses novel injection patterns. Strong on PII, weak on jailbreaks.
Lakera alone: misses PII. Strong on jailbreaks. Requires API key + connectivity.
Local regex alone: limited vocabularly; too many false negatives.

Together: Presidio scrubs PII in 50ms. Lakera flags jailbreaks in 200ms-2s. Regex catches what Lakera misses (e.g., Lakera downtime).

The combined pipeline in `sanitize_input`


async def sanitize_input(self, text: str) -> dict:
    # Layer 1: PII
    entities = self.analyzer.analyze(text=text, language="en")
    anonymized = self.anonymizer.anonymize(text=text, analyzer_results=entities)
 
    # Layer 2 & 3: Jailbreak (parallel)
    if settings.lakera_guard_api_key:
        is_inj_lakera = await self._lakera_check(anonymized.text)
    else:
        is_inj_lakera = self.local_jailbreak_check(text)
 
    return {
        "safe_text": anonymized.text,
        "is_injection": is_inj_lakera,
        "redacted_entities": [e.to_dict() for e in entities],
    }

Critical sequence: PII redaction runs BEFORE jailbreak check. Why? Because lazera should never see raw PII (data sovereignty).

`redact_pii(text)` — log-safe variant


def redact_pii(self, text: str) -> str:
    entities = self.analyzer.analyze(text=text, language="en")
    return self.anonymizer.anonymize(text=text, analyzer_results=entities).text

Used by logger.info(f"User said: {redact_pii(user_input)}"). Never log the raw prompt.

Kill switches


if not settings.security_enabled:
    return {"safe_text": text, "is_injection": False}

Two booleans:

security_enabled — master switch for ALL sanitisation (set to false ONLY in dev).
pii_redaction_enabled — Presidio only, Lakera still runs.

Performance characteristics

Presidio NLP: 50-300ms (spaCy model overhead). Cached instance at startup.
Lakera: 100-500ms. Async httpx.
Regex: <1ms.

Throughput: ~3-5 user messages/sec per worker. Sufficient for hundreds of concurrent users with a few workers.

Common Pitfalls

Sending raw PII to Lakera — Presidio must redact first. Lakera has its own SOC2; data still leaves your VPC.

Not running Presidio on the LLM responses — agent output might echo back entered PII. Anonymiser runs BOTH on input and output.

Caching Lakera results without TTL — attack patterns evolve; a 30-day cache could re-allow a now-known-bad prompt. TTL 5 min.

Real-World Interview Prep

Q1: Why is `UK_NINO` not a built-in Presidio entity?

A: Presidio ships with US-centric recognisers (US_SSN, US_PASSPORT). UK-specific identifiers need custom regex recognisers, registered via PatternRecognizer and added to the analyzer’s registry.

Q2: When would you fail-open vs fail-closed on Lakera downtime?

A: Fail-open (let request through with a warning) for inbound prompt-injection — blocking every request because Lakera is down is worse than running with degraded checks. Fail-closed for outbound PII redaction if Presidio is down — sending PII to LLM models is a regulatory violation.

Q3: How do you measure the false-positive rate of the jailbreak detector?

A: Build a labelled test corpus of 1000 legitimate + 500 attack prompts. Run each through the detector. Compute F1. Production tuning targets F1 ≈ 0.95. False positives are worse than misses — they block legitimate users.

Presidio PII Redaction with Lakera Prompt-Injection Defense

What

Project Context

How

Settings block with master kill-switch and bounds-checked threshold

Presidio analyzer with the Spacy en_core_web_lg engine

Lakera Guard jailbreak check

Common Pitfalls

Real-World Interview Prep

Q1: How accurate is Presidio on banking-domain PII?

Q2: How does Lakera Guard decide if a prompt is a jailbreak?

Q3: When should a security layer “fail open” vs “fail closed”?

Top-to-Bottom Code Walkthrough (app/services/security_service.py — sanitize_input, check_prompt_injection, redact_pii)

Layer 1 — Presidio (local PII detection)

Layer 2 — Lakera Guard (cloud prompt-injection detection)

Layer 3 — Local regex (always-on fallback)

Why three layers

The combined pipeline in sanitize_input

redact_pii(text) — log-safe variant

Kill switches

Performance characteristics

Common Pitfalls

Real-World Interview Prep

Q1: Why is UK_NINO not a built-in Presidio entity?

Q2: When would you fail-open vs fail-closed on Lakera downtime?

Q3: How do you measure the false-positive rate of the jailbreak detector?

Top-to-Bottom Code Walkthrough (`app/services/security_service.py` — `sanitize_input`, `check_prompt_injection`, `redact_pii`)

The combined pipeline in `sanitize_input`

`redact_pii(text)` — log-safe variant

Q1: Why is `UK_NINO` not a built-in Presidio entity?