Skip to main content
Back to Blog

LLM Debugging & Troubleshooting: A Practical Guide for AI Engineers

Comprehensive guide to debugging LLM applications. Covers common failure patterns, systematic diagnosis approaches, prompt debugging, retrieval troubleshooting, observability practices, and production debugging workflows.

12 min read
Share:

LLM Debugging & Troubleshooting

LLM applications fail differently than traditional software. There's no stack trace pointing to a broken line of code. A model can return grammatically correct, confident text that's factually hallucinated, contextually irrelevant, or completely off-task. Unlike traditional software where a stack trace points to a specific cause, LLMs fail in nuanced, often silent ways.

This guide provides practical approaches to debugging LLM applications: identifying common failure patterns, systematic diagnosis workflows, tooling for observability, and strategies for fixing issues once found.


The Nature of LLM Failures

Why LLM Debugging Is Different

Traditional debugging follows deterministic logic: given the same inputs, you get the same outputs and errors. LLM debugging faces unique challenges:

Probabilistic outputs: The same input can produce different outputs across runs due to temperature settings and model behavior.

Silent failures: An LLM application can return a grammatically correct, confident response that is factually hallucinated, contextually irrelevant, or unsafe. No error is thrown—the failure is semantic.

Distributed causality: Failures may result from prompt issues, retrieval problems, model limitations, or their interaction. Isolating the cause requires checking multiple components.

Non-reproducibility: Due to randomness and model updates, reproducing exact failure conditions can be difficult.

Common Failure Categories

LLM failures cluster into recognizable patterns:

Hallucinations: The model generates content not grounded in source material or factual reality. In RAG systems, this often means the model failed to adhere to retrieved context.

Retrieval failures: In RAG systems, the model is only as good as its context. Either relevant documents weren't retrieved (missed retrieval) or irrelevant documents were included (noise injection).

Prompt sensitivity: Small changes in prompt wording, structure, or context produce vastly different outputs. Poorly engineered prompts cause misinterpretation.

Context window issues: The model "forgets" instructions in the middle of long prompts—the "lost in the middle" phenomenon.

Format/structural errors: The application expects structured output (JSON, SQL, code) but receives unstructured text or malformed content.

Tool/function calling errors: The model calls the wrong function, passes invalid arguments, or fails to call functions when it should.


Systematic Diagnosis

The Debugging Workflow

When an LLM application fails, follow a systematic approach:

  1. Reproduce: Can you reliably reproduce the failure?
  2. Isolate: Which component is failing (prompt, retrieval, model, post-processing)?
  3. Analyze: What specifically went wrong in that component?
  4. Fix: Address the root cause
  5. Verify: Confirm the fix works and doesn't introduce regressions

Reproduction Strategies

Reproducibility is the bane of LLM debugging. To reproduce failures:

Fix randomness: Set temperature to 0 during investigation to isolate logic errors from creativity variance. If the provider supports seeding, use a fixed seed.

Log everything: Ensure you have complete logs of the failing interaction—input, context, prompt, raw model output, processed output.

Snapshot state: Capture the exact state at failure time—which model version, which retrieval results, which prompt version.

Create test cases: Convert reproduction steps into automated tests for regression prevention.

Component Isolation

LLM applications have multiple potential failure points. Isolate by testing components independently:

Input processing: Is the input being parsed and preprocessed correctly?

Retrieval (RAG): Are the right documents being retrieved? Check retrieval results before they reach the model.

Prompt construction: Is the full prompt assembled correctly? Inspect the actual prompt sent to the model.

Model inference: Given the prompt, is the model response reasonable? Test the prompt directly in a playground.

Output parsing: Is the model output being parsed correctly? Check raw output before post-processing.

Post-processing: Are downstream steps handling the parsed output correctly?

Test each component with known good inputs to identify where failures occur.


Debugging by Failure Type

Hallucination Debugging

When the model generates unfounded content:

Check context: Did the model have the information needed to answer correctly? Inspect retrieved documents or provided context.

Check faithfulness: Compare the model's claims against the provided context. Are claims grounded in the context?

Check prompt instructions: Does the prompt clearly instruct the model to stay grounded? Add explicit instructions like "Only use information from the provided context."

Check temperature: High temperature increases creativity but also hallucination risk. Lower temperature for factual tasks.

Implement verification: Use a separate call to verify claims against sources, or implement retrieval-augmented verification.

Retrieval Debugging

In RAG systems, failures often trace to retrieval:

Missed retrieval: Relevant documents exist but weren't retrieved.

  • Check embedding quality: Is the query embedding similar to relevant document embeddings?
  • Check chunking: Were documents chunked appropriately? Is relevant content split across chunks?
  • Check k value: Are you retrieving enough documents?
  • Check metadata filtering: Are filters inadvertently excluding relevant documents?

Noise injection: Irrelevant documents retrieved, distracting the model.

  • Check similarity thresholds: Are you accepting low-similarity results?
  • Check reranking: Would a reranker help filter noise?
  • Check query transformation: Could query rewriting improve precision?

Debug retrieval separately: Query your retrieval system directly with test queries. Verify results before involving the LLM.

Prompt Debugging

Poorly engineered prompts cause misinterpretation or incomplete responses:

Ambiguity: The prompt can be interpreted multiple ways. Make instructions explicit and unambiguous.

Missing context: The prompt assumes knowledge the model doesn't have. Add necessary background.

Conflicting instructions: Different parts of the prompt give conflicting guidance. Ensure consistency.

Poor structure: Instructions are buried or disorganized. Use clear sections and formatting.

Wrong persona/tone: The system prompt sets an inappropriate persona. Adjust tone and role.

Debug by simplification: Strip the prompt to its minimal form. Add complexity back incrementally to identify which addition causes problems.

Format/Structure Debugging

When the model fails to produce valid structured output:

Check schema clarity: Is the expected format clearly specified in the prompt?

Check examples: Do few-shot examples demonstrate the exact format expected?

Check complexity: Is the required structure too complex? Simplify if possible.

Use structured outputs: Enable strict mode or constrained decoding if available.

Implement validation: Parse and validate output, providing error feedback for retry.

Tool Calling Debugging

When function calling goes wrong:

Wrong tool selected: Tool descriptions may be ambiguous. Improve descriptions to clarify when each tool applies.

Invalid arguments: The model may not understand parameter requirements. Add parameter descriptions and constraints.

Missing tool calls: The model didn't recognize a function-calling opportunity. Check if the prompt encourages tool use.

Excessive tool calls: The model calls tools unnecessarily. Add guidance on when NOT to use tools.


Observability Infrastructure

The Case for Tracing

You cannot debug what you cannot see. Traditional logging is insufficient for complex AI applications. Distributed tracing provides the visibility needed.

Traces: Represent the full lifecycle of a request—from user input through retrieval, model calls, tool execution, and response.

Spans: Individual units of work within a trace (a retrieval query, an LLM call, a tool invocation). Each span captures timing, inputs, outputs, and metadata.

What to Log

For effective debugging, log:

Inputs:

  • Original user input
  • Processed/cleaned input
  • Query transformations

Retrieval (for RAG):

  • Query embedding
  • Retrieved document IDs and scores
  • Retrieval latency

Prompt:

  • Full assembled prompt
  • Prompt version/template
  • Token count

Model call:

  • Model name and version
  • Parameters (temperature, max tokens)
  • Raw response
  • Token usage
  • Latency

Output processing:

  • Parsing results
  • Validation results
  • Final output

Errors:

  • Error type and message
  • Stack trace
  • Context at failure

Tracing Tools

Langfuse: Open-source observability for LLM applications. Traces, prompt management, evaluation integration.

LangSmith: LangChain's observability platform. Deep integration with LangChain, tracing, evaluation.

Weights & Biases: ML experiment tracking extended to LLM applications.

Arize Phoenix: Open-source tracing with embedding visualization for debugging retrieval.

Datadog/New Relic: Traditional APM tools with LLM-specific extensions.

Building Debug Views

Create dashboards that answer common debugging questions:

  • What were the retrieval results for this query?
  • What was the full prompt sent to the model?
  • How did the model's response differ from expected?
  • What was the latency breakdown across components?
  • What errors occurred and at what rate?

Error Analysis

To improve your LLM app, you must understand how it fails. Aggregate metrics don't tell you if your system retrieves wrong documents or if the model's tone alienates users.

Building an Error Taxonomy

Categorize failures systematically:

Factual errors: Wrong information Relevance errors: Right information, wrong context Format errors: Wrong structure or style Omission errors: Missing important information Instruction violations: Didn't follow explicit instructions Safety violations: Generated inappropriate content

Root Cause Analysis

If you have traces with multiple errors, focus on the first failure. A single upstream error, like incorrect document retrieval, often causes multiple downstream issues. Fixing the root cause resolves them all.

For each failure:

  1. Identify the error category
  2. Trace back to the earliest point where something went wrong
  3. Determine why that component failed
  4. Identify the fix

Golden Dataset Building

Add failure cases to a "golden dataset" for evaluation. This ensures future versions don't repeat specific errors.

Label failures: Annotate why each response was bad (hallucination, missed constraint, wrong tool).

Create regression tests: Turn failure cases into tests that must pass.

Version the dataset: Track dataset changes alongside application changes.


Common Fixes

Prompt Fixes

Fix context issues: Smart chunking (400-800 tokens with overlap), retrieve only top 5-8 chunks, use conversation summarization for long chats.

Add explicit constraints: If the model violates implicit rules, make them explicit. "Do not mention competitors" beats assuming the model knows.

Use structured prompting: Clear sections for instructions, context, and examples reduce ambiguity.

Add examples: Few-shot examples demonstrate expected behavior more effectively than descriptions.

Retrieval Fixes

Improve embeddings: Try different embedding models. Domain-specific fine-tuning can help.

Adjust chunking: Align chunk boundaries with semantic units. Experiment with chunk size.

Add hybrid search: Combine vector search with keyword search for better recall.

Implement reranking: Use cross-encoders to reorder retrieval results before passing to the model.

Query transformation: Rewrite or expand queries to improve retrieval.

Model-Level Fixes

Temperature adjustment: Lower temperature for factual tasks, higher for creative tasks.

Model selection: Different models have different strengths. Evaluate alternatives.

Instruction fine-tuning: If prompting doesn't work, consider fine-tuning for your specific use case.

System-Level Fixes

Validation and retry: Validate outputs and retry with feedback when validation fails.

Fallback strategies: If the primary approach fails, fall back to simpler alternatives.

Human escalation: Route low-confidence or high-stakes cases to human review.


Production Debugging

Live Issue Triage

When issues occur in production:

  1. Assess severity: How many users affected? What's the business impact?
  2. Identify pattern: Is this a new issue or recurring? Affecting specific query types?
  3. Check recent changes: Any recent deployments, prompt updates, or model changes?
  4. Gather samples: Collect examples of affected interactions.
  5. Implement quick mitigation: Can you route affected queries differently while investigating?

Debugging Tools

Centralized prompt management tools help organize, version, and deploy prompts. A/B testing and rollbacks allow experimenting with variants and quickly reverting if regressions are detected.

Prompt playgrounds: Test prompts interactively against live models.

Trace explorers: Drill into specific interactions to see full context.

Comparison views: Compare failing traces against successful ones.

Replay capability: Re-run historical inputs through updated systems.

Continuous Improvement

Monitor error rates: Track failure rates by category over time.

Review samples: Regularly sample and review production interactions.

Collect feedback: User feedback (explicit and implicit) indicates issues.

Iterate systematically: Use error analysis to prioritize improvements. Fix the most impactful issues first.


Concrete Failure Examples

Understanding failures in the abstract is different from recognizing them in production. This section walks through real-world failure patterns with specific examples and resolution approaches.

Example 1: The Confident Hallucinator

Symptom: A customer support RAG bot confidently tells users about a "30-day money-back guarantee" when the company only offers 14 days.

Investigation:

  • Checked retrieval: The refund policy document was retrieved correctly
  • Checked the retrieved content: It clearly states "14-day refund window"
  • Checked the prompt: No explicit grounding instruction

Root cause: The model was "pattern completing" based on common refund policies in its training data. Without explicit grounding instructions, it defaulted to what "sounds right" rather than what the context says.

Fix: Added explicit instruction: "Answer ONLY based on the provided context. If the exact information isn't in the context, say you don't have that specific information rather than guessing." Additionally, added a claim verification step that cross-references extracted facts against retrieved chunks.

Example 2: The Missing Retrieval

Symptom: Users ask "What's the deadline for the Q4 report?" and get "I don't have information about Q4 report deadlines" despite the deadline being documented.

Investigation:

  • Tested retrieval directly: Query "Q4 report deadline" returned 0 relevant results
  • Checked embeddings: The document uses "Q4 2024 Quarterly Financial Report Submission Timeline"—no exact match for "deadline"
  • Checked the document content: The word "deadline" never appears; it says "due date"

Root cause: Semantic gap between user vocabulary and document vocabulary. Users say "deadline," documents say "due date." Embedding similarity wasn't bridging this gap.

Fix: Implemented query expansion. Before retrieval, generate synonyms and related terms: "Q4 report deadline" → ["Q4 report deadline", "Q4 report due date", "quarterly report submission date", "Q4 filing timeline"]. Run retrieval on expanded queries and merge results.

Example 3: The Format Breaker

Symptom: An extraction pipeline expecting JSON occasionally returns partial JSON or JSON wrapped in markdown code blocks.

Investigation:

  • Collected failing samples: ~8% of responses were malformed
  • Analyzed patterns: Failures correlated with longer input documents
  • Tested in playground: Same prompt with short input → valid JSON; long input → wrapped in ```json blocks

Root cause: As context lengthened, the model started mimicking chat/documentation format (code blocks) rather than raw JSON output. The system message's format instructions were being "forgotten" in the middle of long contexts.

Fix: Three-pronged approach: (1) Moved format instructions to end of prompt, closer to output; (2) Added explicit "Output raw JSON only. No markdown formatting, no code blocks, no explanation"; (3) Implemented output sanitization that strips markdown code block wrappers before JSON parsing.

Example 4: The Tool Misuse

Symptom: An agent with access to "search_database" and "search_web" tools consistently searches the web for questions that should query the internal database.

Investigation:

  • Logged tool selection: 73% of database-appropriate queries went to web search
  • Compared tool descriptions: "search_database: Search the company database" vs "search_web: Search the internet for information"
  • Tested with example queries: Model couldn't distinguish which tool applied

Root cause: Tool descriptions were too vague. The model defaulted to web search because it's more general-purpose.

Fix: Rewrote tool descriptions with explicit scope and examples:

  • "search_database: Search internal company data including customer records, orders, products, and employee information. Use for questions about specific customers, order status, product inventory, or internal metrics."
  • "search_web: Search the public internet. Use for general knowledge, current events, or information NOT available in our internal systems."

Example 5: The Context Window Overflow

Symptom: A document Q&A system works perfectly for small documents but returns irrelevant or incomplete answers for large documents.

Investigation:

  • Logged token counts: Large document queries were hitting ~95% of context limit
  • Analyzed response quality vs. context usage: Clear degradation above 80% context utilization
  • Tested information position: Questions about content in the middle of long contexts failed most often

Root cause: Classic "lost in the middle" combined with aggressive context stuffing. The system was retrieving maximum chunks to "give the model more context" but actually degrading quality.

Fix: Implemented adaptive retrieval: start with fewer, higher-relevance chunks; only expand if initial answer confidence is low. Also restructured prompts to place query-relevant context at beginning and end, with less relevant context in the middle.


Debugging Workflow: A Step-by-Step Process

When an LLM application fails, follow this structured workflow to efficiently identify and resolve the issue.

Step 1: Capture the Failure

Before anything else, capture everything about the failing interaction:

Collect:

  • Exact user input
  • Timestamp
  • Session/conversation context if multi-turn
  • Full system prompt and user prompt
  • Retrieved documents (for RAG)
  • Raw model response
  • Processed/final response
  • Any error messages

Key question: Can you reproduce this failure with the captured information?

Step 2: Classify the Failure

Categorize what type of failure occurred:

Failure TypeIndicators
HallucinationOutput contains claims not in context or factually wrong
Retrieval missRelevant information exists but wasn't retrieved
Retrieval noiseIrrelevant content retrieved, confused the model
Instruction violationModel didn't follow explicit prompt instructions
Format errorOutput structure doesn't match requirements
Tool errorWrong tool called or invalid arguments
Context overflowLong context causing degraded performance
Latent knowledge leakModel using training data instead of provided context

Step 3: Isolate the Component

Test each component independently to identify where failure occurred:

Test retrieval independently:

  • Run the same query against your retrieval system directly
  • Inspect retrieved documents: Are relevant ones present? Are irrelevant ones contaminating?
  • Check similarity scores: Are relevant documents scoring high enough?

Test prompt independently:

  • Take the exact assembled prompt and test in a playground
  • Does the model produce correct output with the same prompt and context?
  • Try with perfect/curated context—does that fix it?

Test output parsing independently:

  • If raw model output is correct but final output is wrong, the parser is the issue
  • Check edge cases in parsing logic

Step 4: Form a Hypothesis

Based on isolation testing, form a specific hypothesis:

Example hypotheses:

  • "The retrieval query doesn't capture the user's intent; need query transformation"
  • "The context is correct but instructions are ambiguous; need clearer prompting"
  • "The model is using training knowledge instead of context; need stronger grounding"
  • "Output format degrades with long outputs; need length limits or chunked generation"

Step 5: Test the Hypothesis

Design a minimal experiment to test your hypothesis:

Controlled experiment:

  • Change ONE variable at a time
  • Test on the failing case AND similar cases
  • Test on cases that were working (regression check)

Example: If hypothesis is "need stronger grounding instructions"

  • Add grounding instruction to prompt
  • Test on the failing case: Does it fix it?
  • Test on similar cases: Does it help there too?
  • Test on working cases: Does it break anything?

Step 6: Implement and Verify

Once hypothesis is validated:

Implement the fix:

  • Make the change in your system
  • Document what was changed and why

Verify comprehensively:

  • Confirm the original failure is fixed
  • Run your evaluation suite for regression testing
  • If possible, A/B test in production before full rollout

Step 7: Prevent Recurrence

Turn the failure into future protection:

Add to golden dataset: Include the failing case in your evaluation set

Add monitoring: Set up alerts for similar patterns

Document the failure: Record in a knowledge base: symptom, root cause, fix, prevention

The Workflow Visualized

Code
User Report / Alert
        ↓
┌─────────────────────┐
│  CAPTURE FAILURE    │ ← Full logs, inputs, outputs
└─────────────────────┘
        ↓
┌─────────────────────┐
│  CLASSIFY TYPE      │ ← Hallucination? Retrieval? Format?
└─────────────────────┘
        ↓
┌─────────────────────┐
│  ISOLATE COMPONENT  │ ← Test retrieval, prompt, parsing separately
└─────────────────────┘
        ↓
┌─────────────────────┐
│  FORM HYPOTHESIS    │ ← Specific, testable theory
└─────────────────────┘
        ↓
┌─────────────────────┐
│  TEST HYPOTHESIS    │ ← Controlled experiment
└─────────────────────┘
        ↓
    Pass? ─── No ──→ Return to ISOLATE/HYPOTHESIS
        │
       Yes
        ↓
┌─────────────────────┐
│  IMPLEMENT FIX      │ ← Change + document
└─────────────────────┘
        ↓
┌─────────────────────┐
│  VERIFY + PREVENT   │ ← Regression test + add to golden dataset
└─────────────────────┘

Advanced Debugging Techniques

Diff Debugging

When outputs are inconsistent between runs or environments:

  1. Capture two runs: One working, one failing
  2. Diff the inputs: Full prompt, context, parameters—what's different?
  3. Find the minimal diff: What's the smallest change that causes the failure?
  4. Root cause from the diff: The difference reveals what's sensitive

Ablation Testing

Systematically remove components to identify what's necessary:

  1. Start with full system: Current (failing) configuration
  2. Remove one component: Remove reranking → does behavior change?
  3. Continue ablating: Remove query expansion, reduce retrieved chunks, simplify prompt
  4. Identify critical factors: Which removals impact failure vs. not?

Counterfactual Prompting

Test how prompt changes affect behavior:

Counterfactual 1: What if instructions were stronger?

  • Add "CRITICAL: You must ONLY use provided context"
  • Does output change?

Counterfactual 2: What if context were perfect?

  • Manually provide ideal context
  • Does output become correct?

Counterfactual 3: What if output format were simpler?

  • Ask for plain text instead of JSON
  • Does quality improve?

Attention Analysis (When Available)

Some debugging scenarios benefit from analyzing model attention:

  • Which parts of the prompt is the model attending to?
  • Is relevant context being attended to or ignored?
  • Are there attention patterns that correlate with failures?

Tools like BertViz or provider-specific interpretability tools can help, though this is more relevant for fine-tuned models than API-based applications.


Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles