Which defense layer is most important?

No single layer is sufficient—that's why defense-in-depth is essential. However, if forced to prioritize, focus on output validation first. You ultimately care about what leaves your system. Input filtering can be bypassed, but output filtering catches problems regardless of how they originated.

How do I balance security with user experience?

Start permissive and tighten based on observed attacks. Log suspicious activity before blocking, and provide clear error messages. False positives frustrate users more than slightly delayed security. Use graduated responses—monitor, warn, slow down, restrict, block—rather than binary allow/deny.

What's the performance impact of comprehensive security?

Input validation and sanitization add 1-5ms. LLM-based classification adds 200-500ms. PII detection with Presidio adds 10-50ms. NeMo Guardrails adds ~500ms for comprehensive checks. Design your pipeline to run independent checks in parallel where possible. For most applications, sub-second overhead is acceptable; optimize if specific use cases require lower latency.

Should I use multiple LLMs for security?

Yes, using a separate model for input classification adds significant defense-in-depth. An injection designed for GPT-4 may not work against a Claude classifier. The cost is additional API calls and latency, but for high-security applications, this tradeoff is worthwhile.

How do I handle legitimate requests that look suspicious?

Implement tiered responses: flag for monitoring, require additional verification (like CAPTCHA or re-authentication), or queue for human review. Allow users to appeal blocked requests through a human review process. Never permanently block based solely on automated detection.

Should I tell users why their request was blocked?

Provide generic messages ("Request could not be processed for security reasons") rather than specific details that could help attackers understand and circumvent your defenses. Log the specific reason internally for investigation.

How often should I update my defenses?

Review and update defenses at least quarterly, or immediately when new attack patterns are published. Subscribe to security mailing lists and monitor research publications. The LLM security landscape evolves rapidly; defenses that worked six months ago may be inadequate today.

LLM Application Security: Practical Defense Patterns for Production

Security in LLM applications differs fundamentally from traditional software security. The attack surface expands beyond network and application layers to include natural language inputs that can manipulate model behavior, outputs that may leak sensitive information, and tool integrations that could be exploited for unauthorized actions. OWASP has recognized this shift by ranking prompt injection as the number one AI security risk in its 2025 Top 10 for LLMs, a position it has held since the list was first compiled.

This guide provides a comprehensive overview of practical security patterns for production LLM applications. We focus on defense-in-depth—multiple layers of protection that work together. No single technique is sufficient, but combined, they create a robust security posture that can withstand sophisticated attacks.

The 2025 Threat Landscape

Understanding what you're defending against is the foundation of effective security. The OWASP Top 10 for LLM Applications 2025 provides a comprehensive categorization of the most critical vulnerabilities.

OWASP Top 10 for LLM Applications 2025

LLM01: Prompt Injection remains the most significant threat. Attackers manipulate LLMs via crafted inputs that can lead to unauthorized access, data breaches, and compromised decision-making. Unlike traditional exploits such as SQL injection where malicious inputs are clearly distinguishable, prompt injection presents an unbounded attack surface with infinite variations, making static filtering largely ineffective.

LLM02: Sensitive Information Disclosure moved up from position six to become the second most critical risk. LLMs can inadvertently reveal confidential data from their training sets, user conversations, or connected data sources. This includes PII, proprietary business information, and system configuration details.

LLM03: Supply Chain Vulnerabilities arise from malicious or vulnerable models, training data, or third-party components. Models downloaded from public repositories like Hugging Face, or training data sourced externally, can be prone to poisoning attacks that compromise the entire application.

LLM04: Data and Model Poisoning encompasses attacks where data involved in pre-training, fine-tuning, or retrieval augmentation is manipulated to introduce vulnerabilities or backdoors. This can cause the model to produce biased, incorrect, or malicious outputs.

LLM05: Improper Output Handling occurs when developers neglect to validate LLM outputs before using them downstream. This can lead to code execution, cross-site scripting, SQL injection, and other classic web vulnerabilities triggered through model-generated content.

LLM06: Excessive Agency manifests when LLMs have more functionality, permissions, or autonomy than necessary. A plugin that allows file reading but also enables writing or deletion exemplifies this vulnerability. The principle of least privilege applies to AI agents just as it does to human users.

LLM07: System Prompt Leakage happens when system prompts containing sensitive configuration, secrets, or security-critical instructions are exposed to users. Beyond information disclosure, leaked system prompts help attackers understand and circumvent security controls.

LLM08: Vector and Embedding Weaknesses addresses vulnerabilities in RAG systems and embedding-based methods. Poisoned documents in vector databases can manipulate retrieval results, and embedding spaces can be exploited to surface malicious content.

LLM09: Misinformation encompasses the dangers of trusting LLM outputs without verification. Models can propagate false information confidently, and overreliance on unverified outputs can lead to real-world harm in critical applications.

LLM10: Unbounded Consumption occurs when applications allow excessive or uncontrolled resource usage. This enables denial of service attacks, financial exploitation through API cost inflation, and model theft through excessive querying.

Why Traditional Defenses Fall Short

Traditional perimeter defenses fail against prompt injection because the attack vector operates at the semantic layer, not the network or application layer. A WAF cannot distinguish between a legitimate question and a carefully crafted manipulation because both appear as valid natural language text.

The key insight from recent research is that static example attacks—single string prompts designed to bypass systems—are almost useless for evaluating defenses. A team including representatives from OpenAI, Anthropic, and Google DeepMind subjected 12 published defenses to adaptive attacks and found that attackers with sufficient computational resources can eventually bypass most current safety measures through power-law scaling behavior. This suggests robust defense requires fundamental architectural approaches rather than incremental improvements to post-training safety.

Defense Layer 1: Input Validation and Sanitization

The first line of defense filters and analyzes inputs before they reach the LLM. While no input filter can block all attacks, this layer catches obvious manipulation attempts and provides valuable telemetry about attack patterns.

Pattern-Based Detection

Input sanitization involves scanning user inputs for suspicious patterns that commonly appear in injection attempts. These include phrases like "ignore previous instructions," role hijacking attempts such as "you are now" or "pretend to be," and special tokens that might manipulate model behavior. However, the goal is not necessarily to block all flagged content—sophisticated attacks can rephrase these patterns—but to detect and log potential threats for monitoring while continuing to process the request with additional scrutiny.

Unicode normalization is equally important. Attackers may use zero-width characters, Unicode confusables (Cyrillic characters that look like Latin letters), or various encoding tricks to hide malicious content. Normalizing inputs to a canonical form before analysis ensures these obfuscation techniques don't bypass detection.

Encoding Attack Prevention

A common evasion technique involves encoding malicious prompts in base64, URL encoding, or hexadecimal. The actual attack payload becomes invisible to simple pattern matching but gets decoded when the LLM processes it. Defense requires checking for suspicious encoded content, optionally decoding it for analysis, and flagging inputs that contain encoded strings resembling injection attempts.

The challenge is balancing security with legitimate use cases—developers discussing base64 encoding or users sharing encoded data should not be blocked. The solution is tiered response: log and monitor suspicious patterns, require additional verification for high-risk detections, and only block the most clearly malicious inputs.

Input Complexity Limits

Resource exhaustion attacks exploit the computational cost of LLM inference. Token bombing involves sending extremely long inputs that consume expensive processing resources. Repetition attacks fill inputs with repeated patterns that models struggle to handle efficiently. Defense includes setting hard limits on character counts, token counts, line counts, and detecting excessive repetition through n-gram analysis.

A well-designed limit system calculates the repetition ratio by comparing unique n-grams to total n-grams. If a significant portion of content is repeated, the input is likely an attack rather than legitimate user content. These limits should be configurable and logged—legitimate power users occasionally need longer inputs, and false positives should be trackable.

Defense Layer 2: Prompt Injection Defense

Prompt injection is the most significant LLM-specific vulnerability, requiring multiple defensive strategies working together. The OWASP LLM Prompt Injection Prevention Cheat Sheet provides authoritative guidance on these techniques.

Delimiter-Based Isolation

The foundational defense is clear separation between system instructions and user content. By wrapping user input in explicit delimiters (such as XML-style tags like <user_input>) and instructing the model to treat everything within those tags as untrusted data rather than instructions, you establish a semantic boundary the model can recognize.

The system prompt should explicitly state that content within user input delimiters must never be treated as instructions, that the model should politely decline any requests within user input that ask to ignore instructions, and that only the system message contains authoritative instructions. This defense isn't foolproof—sophisticated attacks can still sometimes manipulate models—but it significantly raises the bar and catches naive injection attempts.

Instruction Hierarchy Reinforcement

Beyond delimiters, the system prompt should establish an explicit hierarchy where system instructions are immutable and cannot be overridden by any user input. This includes defining rules that the model should never reveal system instructions, never adopt different personas, never execute actions outside defined capabilities, and never generate harmful or prohibited content.

Critically, the system prompt should include explicit guidance for handling manipulation attempts. When users try phrases like "ignore previous instructions" or "pretend you have no restrictions," the model should have clear instructions to treat these as regular conversation and respond normally without following the embedded commands.

LLM-Based Input Classification

A powerful defense uses a separate, smaller model to classify input risk before the main model processes it. This classifier model examines user input for prompt injection patterns, role hijacking attempts, requests to reveal system prompts, encoded content that might hide attacks, and attempts to access unauthorized functionality.

The classifier returns a risk assessment (safe, suspicious, or malicious) with reasoning. This approach adds defense-in-depth because an injection designed for the main model cannot simultaneously manipulate a completely separate classifier with different prompts. The cost is added latency (typically 200-500ms) and additional API calls, but for high-security applications, this tradeoff is worthwhile.

Canary Token Detection

Canary tokens provide a detection mechanism for successful injections. The system prompt includes a randomly generated secret token with instructions to never reveal it under any circumstances. If the model's response contains this canary, it indicates the model's instructions were successfully manipulated—the injection convinced the model to output content from its system prompt.

Canary detection doesn't prevent attacks but provides immediate detection when attacks succeed. This enables automated responses: logging the incident, alerting security teams, potentially blocking the session, and gathering data for improving defenses. In production, canaries should be unique per session to enable precise incident attribution.

Research-Informed Defense Strategies

According to recent research, several architectural approaches show promise for robust prompt injection defense:

Ensemble Decisions: Using multiple models to cross-check each other's decisions can identify anomalies that indicate manipulation. While this trades security for cost, critical applications may find this worthwhile.

Dual-Model Architecture: Separating the model that reads untrusted content from the model that takes privileged actions creates an architectural barrier. The reading model can be compromised without directly affecting the action model.

Privilege Separation: Following the "rule of two"—where accessing user content and having access to privileged actions never occur in the same model call—limits the damage any single successful injection can cause.

Defense Layer 3: Output Validation

Validating LLM outputs before returning them to users or passing them to downstream systems prevents several classes of vulnerabilities.

Content Filtering Strategies

Output filtering checks responses for content that violates policies. This includes detecting patterns indicating harmful content (instructions for dangerous activities), identifying when responses reveal system prompts or internal information, and catching hallucinated credentials, URLs, or other fabricated sensitive data.

A 2025 comparative study by Palo Alto Networks found that when malicious prompts bypass input filters and models generate harmful content, output filters sometimes fail to intercept these responses. This underscores the importance of defense-in-depth—output filtering should not be the only defense, but it catches failures in earlier layers.

The recommended approach is focusing on output moderation rather than just input filtering. Ultimately, we care about what the model outputs. Input-based filters can be sidestepped by clever rephrasings, but by evaluating the model's answer, moderators catch policy violations even if the query appeared innocuous.

Structured Output Validation

When using structured outputs (JSON, function calls), validation should go beyond schema conformance to include security checks. URLs should be validated to ensure they use allowed schemes (http/https only) and don't point to internal addresses (localhost, private IP ranges). Generated code should be scanned for dangerous patterns like shell execution, eval statements, or network operations. File paths should be checked for directory traversal attempts.

Pydantic validators in Python or similar validation frameworks in other languages enable declarative security rules that apply automatically whenever structured outputs are parsed. This catches issues regardless of which part of the application generates the output.

Response Consistency Checking

Beyond content filtering, responses should be checked for behavioral consistency. Indicators that a model's behavior has been manipulated include phrases like "I am now," "ignoring previous instructions," references to jailbreak personas, or sudden changes in communication style.

Topic relevance checking ensures responses stay within expected bounds. A customer service bot should not provide detailed technical instructions outside its domain, even if prompted. Drift detection identifies when responses deviate significantly from the expected topic or tone.

LLMs as Content Judges

An emerging pattern uses LLMs themselves for content moderation. Experiments show that models like GPT-4o Mini and Claude 3.5 Haiku achieved 80% accuracy in detecting harmful content. LLM judges recognize subtle manipulations and understand context, catching harmful content that evades pattern-based detection.

The FLAME system demonstrates this approach can significantly improve security—it reduced jailbreak attack success on GPT-4 by 9x with negligible performance impact. The key is using a separate LLM call with specific moderation instructions to evaluate outputs before delivery.

Defense Layer 4: PII Detection and Protection

Protecting personally identifiable information (PII) in both inputs and outputs is critical for compliance (GDPR, CCPA) and user trust. LLMs create unique PII risks because they learn from vast datasets and may inadvertently memorize and leak sensitive information.

Microsoft Presidio for PII Detection

Microsoft Presidio is the leading open-source framework for detecting and anonymizing PII. The framework provides two core components: presidio-analyzer for detecting PII entities using NLP and pattern matching, and presidio-anonymizer for redacting, replacing, hashing, or encrypting detected entities.

Presidio detects a comprehensive set of entity types including names, email addresses, phone numbers, credit card numbers, social security numbers, IP addresses, and locations. It uses spaCy for named entity recognition, with the en_core_web_lg model recommended for high accuracy in English.

Beyond standard PII, custom patterns can detect application-specific sensitive data: API keys (patterns like sk- or api_key_), AWS access keys, GitHub tokens, Slack tokens, and other credentials that appear in developer-facing applications.

PII Protection Strategies

Input Protection: Scan user inputs for PII before sending to the LLM. Detected PII can be logged for monitoring (without storing the actual values), redacted with placeholders like [EMAIL] or [PHONE], or used to trigger user warnings about sharing sensitive information.

Output Protection: Scan LLM responses before returning to users. This catches cases where the model generates PII from its training data, hallucinates realistic-looking personal information, or surfaces PII from RAG-retrieved documents. Output PII should generally be redacted automatically.

RAG Document Protection: For retrieval-augmented systems, apply PII detection to retrieved documents before adding them to the context window. This prevents the model from seeing raw PII from internal documents, reducing the risk of it appearing in responses.

Fine-tuning Data Protection: Before using custom datasets for fine-tuning, perform comprehensive PII scanning and anonymization. This prevents the model from baking sensitive information into its weights, which would cause persistent leakage that's extremely difficult to remediate.

Anonymization Methods

Presidio supports multiple anonymization strategies depending on use case:

Redaction replaces PII with type labels like [EMAIL_REDACTED], completely removing the sensitive value while preserving context about what was there.

Masking partially obscures values, useful when some information should remain visible—for example, showing the last four digits of a phone number.

Hashing applies SHA-256 or other algorithms, enabling consistent replacement (the same value always produces the same hash) while preventing recovery of the original.

Encryption protects PII with AES or similar algorithms, allowing authorized systems to decrypt and access the original values when legitimately needed.

LLM-Enhanced PII Detection

Recent developments explore using LLMs to enhance PII detection. LLMs excel at identifying contextual PII—information that's sensitive because of context rather than format. "He recently got divorced" contains no traditional PII patterns but reveals sensitive personal information. LLM-based detection can catch these cases that pattern matching misses.

Defense Layer 5: Guardrails Frameworks

Production LLM applications benefit from dedicated guardrails frameworks that provide pre-built protections with configurable policies. Two leading solutions dominate this space.

NVIDIA NeMo Guardrails

NeMo Guardrails is an open-source toolkit for adding programmable guardrails to LLM applications. It provides a declarative way to control LLM behavior including content moderation, topic restrictions, response style enforcement, and structured data extraction.

NeMo supports five main guardrail types:

Input rails validate and filter user inputs before processing
Output rails check and filter model responses before delivery
Dialog rails influence how the LLM is prompted
Retrieval rails filter chunks in RAG scenarios before context assembly
Execution rails control tool and function execution

The framework includes built-in guardrails for LLM self-checking (input/output moderation, fact-checking, hallucination detection), NVIDIA safety models (content safety, topic safety), and jailbreak detection. Recent benchmarks show that orchestrating up to five GPU-accelerated guardrails in parallel increases detection rate by 1.4x while adding only ~0.5 seconds of latency.

2025 Updates: NeMo Guardrails now integrates with Cisco AI Defense and Trend Micro AI Guard, expanding third-party security integrations. The library also adopted stricter type annotations and dropped Python 3.9 support ahead of its EOL.

Lakera Guard

Lakera Guard functions as an AI firewall specifically focused on security, protecting LLMs from adversarial prompts, data leakage, and dangerous content. Unlike NeMo's broader focus on controllability, Lakera specializes in security-specific protections.

Lakera's API classifies prompts as benign or malicious using ML models trained specifically for LLM attack detection. The service has been recognized as a Representative Vendor in Gartner's AI Trust, Risk and Security Management (AI TRiSM) report.

2025 Focus: Lakera has expanded focus to agentic AI threats, addressing how tool overreach and uncontrolled browsing create new exploit paths. Their "Agentic AI Threats" series documents emerging attack vectors specific to AI agents with tool access.

Guardrails vs. AI Firewalls

Understanding the distinction helps choose the right tool:

Guardrails (like NeMo) focus on ethical and responsible use—ensuring outputs align with guidelines, maintaining topic relevance, enforcing response formats, and preventing unintended behaviors. They're about control and policy enforcement.

AI Firewalls (like Lakera Guard) focus specifically on security—detecting attacks, preventing data leakage, blocking jailbreaks. They're about threat protection.

Most production applications benefit from both: guardrails for behavioral control and compliance, plus security-focused filtering for threat protection. These tools are complementary rather than competitive.

Azure AI Content Safety

Azure AI Content Safety provides enterprise-grade content filtering with configurable severity thresholds across hate, violence, self-harm, and sexual content categories. Additional features include:

Custom blocklists for organization-specific prohibited terms
PII detection with predefined and custom regex patterns
Grounding checks to filter hallucinations and ungrounded content
Jailbreak attack detection
Configurable severity levels allowing nuanced policy enforcement

For organizations already using Azure, Content Safety integrates seamlessly with Azure OpenAI Service and provides comprehensive logging for compliance requirements.

Defense Layer 6: Rate Limiting and Abuse Prevention

Protecting against resource exhaustion, cost attacks, and automated abuse requires sophisticated rate limiting beyond simple request counts.

Token-Based Rate Limiting

Traditional rate limiting counts requests per time window. For LLM applications, token-based rate limiting is more appropriate because request cost varies dramatically—a 10-token prompt costs far less than a 10,000-token prompt with multiple tool calls.

Effective rate limiting tracks multiple dimensions:

Requests per minute (prevents rapid-fire attacks)
Tokens per minute (prevents cost attacks)
Tokens per day (prevents sustained abuse)
Concurrent requests (prevents resource monopolization)

Rate limits should be configurable per user tier, with higher limits for authenticated premium users and restrictive limits for anonymous or trial users. When limits are exceeded, return clear error messages with retry-after headers so legitimate users know when to retry.

Abuse Pattern Detection

Beyond rate limits, analyze request patterns to detect abuse:

Rapid request patterns: Many requests in short windows suggest automated attacks even if individual windows stay under limits.

Repeated content: The same or very similar prompts sent repeatedly may indicate automated probing for injection vulnerabilities.

Token bombing: Sudden spikes in token usage compared to a user's baseline suggest attempts to inflate costs or exhaust resources.

Error rate anomalies: High rates of blocked requests or errors indicate someone testing system boundaries.

Each abuse signal has a severity weight. Combining signals produces a risk score that can trigger responses from increased monitoring to temporary suspension. The goal is identifying malicious automation while allowing legitimate burst usage.

Best Practices for Rate Limiting

Target attack detection within 15 minutes and enable automated containment within 5 minutes. These benchmarks from enterprise security guidance balance responsiveness against false positives.

Implement tiered responses: Don't immediately block suspicious activity. Progress through increased monitoring, warning messages, slower response times (adding artificial latency), temporary restrictions, and finally blocking. This graduated response protects legitimate users while still defending against attacks.

Log everything: Rate limit events provide valuable data for adjusting limits, identifying attack patterns, and demonstrating compliance. Include user identifiers, request metadata (but not full content for privacy), and the specific limit triggered.

Defense Layer 7: Tool and Agent Security

When LLMs can execute actions through tools or function calls, the attack surface expands dramatically. A successful injection could trigger unauthorized actions in connected systems.

Permission Systems

Tools should operate under explicit permission systems that enforce:

Operation restrictions: A read tool shouldn't also allow writes. Separate read, write, create, and delete permissions and grant only what's needed.

Resource restrictions: Pattern-based rules limit which resources each tool can access. A customer service agent should only access that customer's data, not arbitrary records.

Role-based access: Different user roles get different tool permissions. Admin tools should not be available to standard users even if an injection tries to invoke them.

Confirmation requirements: High-risk actions (deleting data, sending emails, making purchases) should require explicit user confirmation that cannot be bypassed by the LLM.

Privilege Separation Architecture

Following security best practices, the lowest level of privilege across all entities that have contributed to the LLM prompt should be applied to each subsequent service call. This means:

Tool templates should be parameterized wherever possible
External service calls must be strictly parameterized (no string concatenation of user input into queries)
API keys and credentials for tools should have minimal scope
Each tool call should be authorized independently, not assumed valid because the conversation started legitimately

Sandboxed Code Execution

If your application allows LLM-generated code execution (for data analysis, calculations, etc.), sandboxing is essential:

Resource limits: Cap memory usage, CPU time, and file system access. A runaway computation shouldn't affect other users.

Network isolation: Generated code should not be able to make network requests unless explicitly required and allowed.

File system restrictions: Limit access to temporary directories, blocking sensitive system paths.

Dangerous pattern blocking: Block known dangerous operations (shell execution, eval, import of dangerous modules) at the code analysis level before execution.

Even with sandboxing, treat code execution as inherently risky and limit which users and use cases can access it.

Human-in-the-Loop (HITL)

For high-risk operations, implement human approval workflows. When the system detects a high-risk request—based on action type, resource sensitivity, or anomaly detection—queue it for human review rather than automatic execution.

The HITL pattern is especially important for:

Actions that are irreversible (deleting data, sending communications)
Actions with financial impact (purchases, transfers)
Actions affecting multiple users (bulk operations)
First-time actions for a user (establishing baseline trust)

Defense Layer 8: Audit Logging and Monitoring

Comprehensive logging enables security monitoring, incident response, and compliance demonstration.

Security Event Logging

Log security-relevant events including:

Input events: Sanitization applied, inputs blocked, injection patterns detected, PII in inputs

Output events: Content filtered, canary leaks detected, harmful content caught

Access events: Rate limits triggered, tool access denied or granted, tool executions

Abuse events: Abuse patterns detected, users blocked or restricted

Each event should include timestamp, user identifier, session identifier, request identifier, severity level, and relevant details. Input hashes (not full content) enable correlation without storing sensitive data.

Request Tracing

Distributed tracing through the security pipeline enables understanding of how requests flow through each defensive layer. A trace captures:

Time spent in each layer (input validation, classification, LLM call, output filtering)
Security events triggered at each layer
Attributes like token counts, PII entities detected, filter actions taken

This data supports both security monitoring and performance optimization. If latency spikes, traces identify which layer is the bottleneck. If attacks succeed, traces show which layers failed to detect them.

Alerting and Response

Configure alerts for critical security events:

Canary token leaks (immediate investigation needed)
Sustained high abuse scores from a single source
Novel attack patterns not matching known signatures
Unusual spikes in any security event category

Integrate with existing SIEM/SOAR platforms for centralized security monitoring. LLM application security events should flow into the same systems that monitor traditional application security.

Putting It Together: Defense-in-Depth Architecture

Effective LLM security combines all these layers into a cohesive pipeline. Here's how a request flows through a properly secured system:

Rate Limiting: Check if the user is within rate limits before any processing. Reject early to minimize cost of attacks.
Abuse Detection: Analyze request patterns for abuse signals. Flag suspicious requests for enhanced monitoring.
Input Sanitization: Normalize Unicode, check length limits, flag suspicious patterns. Log findings but don't necessarily block.
Input Classification: Use a separate model to classify injection risk. Block clearly malicious inputs; flag suspicious ones.
PII Detection: Scan for sensitive information. Redact before sending to LLM if configured to do so.
Prompt Construction: Build the prompt with clear delimiters, instruction hierarchy, and canary tokens.
LLM Call: Send to the model with appropriate parameters.
Output Filtering: Check response for harmful content, PII, and policy violations. Redact or block as configured.
Response Consistency: Verify the response maintains expected behavior patterns and hasn't been manipulated.
Logging and Tracing: Record the full trace including all security events for monitoring and compliance.

Each layer provides value independently, but the combination is far more robust than any single defense. An attack that bypasses input classification might be caught by output filtering. An injection that evades pattern matching might trigger canary detection. Defense-in-depth means multiple layers must fail simultaneously for an attack to succeed.

Security Testing and Continuous Improvement

Security is not a one-time implementation but an ongoing process.

Regular Testing

Test your defenses regularly with:

Known attack patterns: Maintain a library of prompt injection examples and verify they're detected
Canary leakage tests: Attempt to extract canary tokens to verify detection works
PII injection tests: Verify PII detection catches expected entity types
Rate limit verification: Confirm limits trigger at expected thresholds
Tool permission tests: Attempt unauthorized tool access to verify permissions

Red Team Exercises

Periodic red team exercises with security professionals who attempt to bypass your defenses provide invaluable insights. External perspectives identify blind spots internal teams miss.

Monitoring and Adaptation

Monitor security metrics continuously:

Detection rates for each attack category
False positive rates affecting legitimate users
Novel attack patterns not in your test library
Performance impact of security layers

Use this data to tune detection thresholds, add new patterns, and balance security against user experience.

Table of Contents