Agentic RAG: When Retrieval Meets Autonomous Reasoning
How to build RAG systems that don't just retrieve—they reason, plan, and iteratively refine their searches to solve complex information needs.
Table of Contents
Beyond Single-Shot Retrieval
Traditional RAG has a fundamental limitation: it's reactive. User asks a question, system retrieves documents, LLM generates an answer. One shot. If the retrieval misses relevant information or the question requires synthesizing insights across multiple searches, the system fails silently.
Why single-shot RAG fails on complex questions: Consider: "How did our Q3 marketing campaign affect customer acquisition compared to last year?" This requires finding: (1) Q3 campaign details, (2) current customer acquisition metrics, (3) last year's metrics, (4) correlation between campaigns and acquisition. No single query retrieves all this. Single-shot RAG returns whatever documents match best—often missing crucial pieces. The LLM then generates a confident-sounding answer from incomplete information.
The iteration imperative: Humans don't research complex topics in one search. We search, read results, refine our understanding, search again with better queries, follow references, cross-check facts. Agentic RAG gives LLMs the same capability: observe results, reason about what's missing, decide what to search next. This closes the gap between "retrieval that finds something" and "retrieval that finds everything needed."
Agentic RAG changes this paradigm. Instead of a single retrieve-then-generate pipeline, an agentic system can:
- Plan a retrieval strategy before searching
- Decompose complex questions into sub-queries
- Iterate on retrieval based on intermediate results
- Verify information across multiple sources
- Synthesize insights that require multi-hop reasoning
At Goji AI, we've seen agentic RAG systems achieve 40% higher answer accuracy on complex questions compared to traditional single-shot RAG. The difference is particularly stark for questions requiring synthesis across multiple documents or topics.
The Agentic RAG Architecture
Core Components
An agentic RAG system extends the traditional pipeline with reasoning and control loops:
User Query
↓
[Query Analyzer] → Classify complexity, identify sub-questions
↓
[Strategy Planner] → Determine retrieval approach
↓
[Retrieval Agent] ←→ [Vector Store / Search APIs]
↓ ↑
[Result Evaluator] ────────┘ (iterate if insufficient)
↓
[Synthesis Agent] → Combine findings, resolve conflicts
↓
[Response Generator] → Format final answer with citations
↓
[Self-Critic] → Verify answer quality, trigger refinement
Query Analysis and Planning
Not all questions need agentic capabilities. Simple factual queries ("What is the capital of France?") work fine with single-shot RAG. Complex queries benefit from planning:
Query Classification:
| Query Type | Example | Strategy |
|---|---|---|
| Simple factual | "What's our refund policy?" | Single retrieval |
| Multi-faceted | "Compare our pricing with competitors" | Parallel retrieval |
| Multi-hop | "Which team lead approved the Q3 budget?" | Sequential retrieval |
| Exploratory | "What are emerging trends in our market?" | Iterative expansion |
| Verification | "Is it true that we launched in 2019?" | Multi-source confirmation |
Query Decomposition: For complex queries, break them into atomic sub-questions:
Original: "How did our customer satisfaction change after we implemented the new support system, and what were the main drivers?"
Decomposed:
- "What was customer satisfaction before the new support system?"
- "When was the new support system implemented?"
- "What is current customer satisfaction?"
- "What factors correlate with satisfaction changes?"
Each sub-question can be retrieved independently, then synthesized.
Implementation: This query analyzer is the first critical component of agentic RAG. It determines whether to route to simple single-shot RAG or activate the full agentic pipeline. The classification is based on linguistic patterns (keywords like "compare", "analyze", "why"), question complexity (presence of conjunctions indicating multiple sub-questions), and learned patterns from historical data.
The decomposition uses an LLM with chain-of-thought reasoning to break complex questions into dependency-ordered sub-questions. Notice how the dependencies are tracked—some sub-questions must be answered before others (e.g., "When was X implemented?" must be answered before "What changed after X?").
from openai import OpenAI
from typing import List, Dict, Tuple
from enum import Enum
import re
class QueryComplexity(Enum):
SIMPLE = "simple" # Single-shot RAG
MULTI_FACETED = "multi_faceted" # Parallel retrieval
MULTI_HOP = "multi_hop" # Sequential retrieval
EXPLORATORY = "exploratory" # Iterative expansion
VERIFICATION = "verification" # Multi-source check
class QueryAnalyzer:
"""
Analyze query complexity and decompose into sub-questions.
This is the routing layer that decides whether to use simple RAG
or activate the full agentic system.
"""
def __init__(self):
self.client = OpenAI()
# Keywords that indicate different query types
self.comparison_keywords = ['compare', 'vs', 'versus', 'difference between', 'better']
self.analysis_keywords = ['analyze', 'why', 'how did', 'what caused', 'drivers', 'factors']
self.multi_hop_keywords = ['who manages', 'which team', 'related to', 'leads to']
self.exploratory_keywords = ['trends', 'insights', 'overview', 'landscape', 'emerging']
def analyze(self, query: str) -> Dict:
"""
Analyze query to determine complexity and strategy.
Returns:
Dict with:
- complexity: QueryComplexity enum
- sub_questions: List of decomposed questions (if complex)
- strategy: Recommended retrieval strategy
- reasoning: Why this classification
"""
# Classify complexity
complexity = self._classify_complexity(query)
# Decompose if needed
sub_questions = []
dependencies = []
if complexity != QueryComplexity.SIMPLE:
sub_questions, dependencies = self._decompose_query(query)
# Determine strategy
strategy = self._determine_strategy(complexity, sub_questions)
return {
'complexity': complexity,
'sub_questions': sub_questions,
'dependencies': dependencies,
'strategy': strategy,
'use_agentic': complexity != QueryComplexity.SIMPLE
}
def _classify_complexity(self, query: str) -> QueryComplexity:
"""
Classify query complexity using rule-based + LLM approach.
Rule-based catches common patterns quickly.
LLM handles edge cases and nuanced queries.
"""
query_lower = query.lower()
# Rule-based classification (fast path)
if any(kw in query_lower for kw in self.verification_keywords):
return QueryComplexity.VERIFICATION
if any(kw in query_lower for kw in self.multi_hop_keywords):
return QueryComplexity.MULTI_HOP
if any(kw in query_lower for kw in self.exploratory_keywords):
return QueryComplexity.EXPLORATORY
if any(kw in query_lower for kw in self.comparison_keywords):
return QueryComplexity.MULTI_FACETED
if any(kw in query_lower for kw in self.analysis_keywords):
return QueryComplexity.MULTI_FACETED
# Check for compound questions (multiple "and"/"or")
if query_lower.count(' and ') >= 2 or query_lower.count(' or ') >= 2:
return QueryComplexity.MULTI_FACETED
# Check for question chaining ("... and what...", "... and why...")
if re.search(r'and (what|why|how|when|where|who)', query_lower):
return QueryComplexity.MULTI_HOP
# LLM-based classification for ambiguous cases
return self._llm_classify(query)
def _llm_classify(self, query: str) -> QueryComplexity:
"""Use LLM to classify query complexity."""
prompt = f"""Classify this query's complexity:
Query: {query}
Classifications:
- SIMPLE: Single factual question, one retrieval likely sufficient
- MULTI_FACETED: Multiple independent aspects, needs parallel searches
- MULTI_HOP: Requires chaining searches (answer to Q1 informs Q2)
- EXPLORATORY: Open-ended, requires iterative expansion
- VERIFICATION: Needs cross-checking across multiple sources
Return only the classification name.
"""
response = self.client.chat.completions.create(
model="gpt-4-turbo-preview",
temperature=0,
messages=[{"role": "user", "content": prompt}]
)
classification = response.choices[0].message.content.strip().upper()
try:
return QueryComplexity[classification]
except KeyError:
# Default to SIMPLE if unclear
return QueryComplexity.SIMPLE
def _decompose_query(self, query: str) -> Tuple[List[Dict], List[Tuple]]:
"""
Decompose complex query into sub-questions with dependencies.
Returns:
Tuple of (sub_questions, dependencies)
sub_questions: List of dicts with 'id', 'question', 'type'
dependencies: List of (question_id, depends_on_id) tuples
"""
prompt = f"""Decompose this complex query into atomic sub-questions.
Query: {query}
Requirements:
1. Each sub-question should be answerable independently
2. Identify dependencies (questions that must be answered first)
3. Order questions by dependency chain
4. Mark the type of each question (factual, comparison, temporal, causal)
Return JSON format:
{{
"sub_questions": [
{{
"id": "q1",
"question": "...",
"type": "factual|comparison|temporal|causal",
"depends_on": [] // IDs of questions that must be answered first
}},
...
],
"reasoning": "Why this decomposition?"
}}
JSON Response:"""
response = self.client.chat.completions.create(
model="gpt-4-turbo-preview",
temperature=0,
response_format={"type": "json_object"},
messages=[{"role": "user", "content": prompt}]
)
import json
result = json.loads(response.choices[0].message.content)
sub_questions = result['sub_questions']
# Build dependency list
dependencies = []
for sq in sub_questions:
for dep_id in sq.get('depends_on', []):
dependencies.append((sq['id'], dep_id))
return sub_questions, dependencies
def _determine_strategy(
self,
complexity: QueryComplexity,
sub_questions: List[Dict]
) -> Dict:
"""Determine the optimal retrieval strategy."""
if complexity == QueryComplexity.SIMPLE:
return {
'type': 'single_shot',
'parallel': False,
'max_iterations': 1
}
elif complexity == QueryComplexity.MULTI_FACETED:
return {
'type': 'parallel_retrieval',
'parallel': True,
'max_iterations': 1,
'merge_strategy': 'interleave'
}
elif complexity == QueryComplexity.MULTI_HOP:
return {
'type': 'sequential_retrieval',
'parallel': False,
'max_iterations': len(sub_questions),
'confidence_threshold': 0.7 # Move to next hop if confident enough
}
elif complexity == QueryComplexity.EXPLORATORY:
return {
'type': 'iterative_expansion',
'parallel': False,
'max_iterations': 5,
'expansion_strategy': 'follow_references'
}
elif complexity == QueryComplexity.VERIFICATION:
return {
'type': 'multi_source_verification',
'parallel': True,
'min_sources': 3,
'consistency_threshold': 0.8
}
return {'type': 'single_shot'}
# Usage
analyzer = QueryAnalyzer()
# Simple query
result1 = analyzer.analyze("What is our refund policy?")
print(f"Complexity: {result1['complexity']}")
print(f"Use agentic: {result1['use_agentic']}")
# Complex query
result2 = analyzer.analyze(
"How did our customer satisfaction change after we implemented "
"the new support system, and what were the main drivers?"
)
print(f"\nComplexity: {result2['complexity']}")
print(f"Sub-questions: {len(result2['sub_questions'])}")
for sq in result2['sub_questions']:
print(f" - [{sq['id']}] {sq['question']}")
if sq['depends_on']:
print(f" Depends on: {sq['depends_on']}")
print(f"Strategy: {result2['strategy']}")
The Retrieval Agent
The retrieval agent is an LLM with access to search tools. Unlike traditional RAG where retrieval is a function call, here retrieval is a decision made by the agent:
Tools available to the retrieval agent:
vector_search(query, filters, top_k)— Semantic searchkeyword_search(query, filters)— BM25/keyword matchinghybrid_search(query, filters, alpha)— Combined searchget_document(doc_id)— Fetch full documentget_related(doc_id)— Find similar documentssearch_web(query)— External web search (if enabled)
Agent reasoning loop:
1. Consider the current sub-question
2. Decide which search tool(s) to use
3. Execute search, receive results
4. Evaluate: Do I have enough information?
- If yes: Move to next sub-question or synthesis
- If no: Reformulate query or try different tool
5. Repeat until satisfied or max iterations reached
Iterative Refinement
The key differentiator of agentic RAG is iteration. When initial retrieval is insufficient:
Query reformulation:
- Original: "revenue growth drivers" → Reformulated: "factors contributing to revenue increase 2024"
- Original: "customer churn" → Expanded: "customer churn OR customer attrition OR subscription cancellation"
Result-driven expansion: If retrieved documents mention a concept the agent doesn't have enough context for, it can search for that concept:
- Retrieved doc mentions "the Thompson acquisition"
- Agent decides: "I need more context on the Thompson acquisition"
- New search: "Thompson acquisition details terms"
Source triangulation: For high-stakes answers, retrieve from multiple independent sources and check consistency. Inconsistencies trigger deeper investigation.
Implementation Patterns
Pattern 1: ReAct-Style Agentic RAG
The ReAct (Reasoning + Acting) pattern interleaves thinking and tool use:
Question: "What was the revenue impact of our European expansion?"
Thought: I need to find information about European expansion and its revenue impact. Let me start with a search for European expansion timeline.
Action: vector_search("European expansion timeline launch")
Observation: [Retrieved 3 documents about EU market entry in Q2 2023...]
Thought: Good, I found when we expanded. Now I need revenue data before and after Q2 2023.
Action: vector_search("revenue Europe Q1 Q2 Q3 2023 comparison")
Observation: [Retrieved financial reports showing revenue trends...]
Thought: I have timeline and revenue data. Let me verify with another source.
Action: keyword_search("European revenue 2023 quarterly")
Observation: [Retrieved quarterly earnings call transcript...]
Thought: Multiple sources confirm ~23% revenue increase attributed to EU. I have enough to answer.
Answer: The European expansion in Q2 2023 contributed to a 23% increase in revenue...
Implementation: This ReAct agent is the core of agentic RAG. The key design decision is making retrieval an agent action rather than a fixed pipeline step. The agent sees its tools (vector_search, keyword_search, etc.) and decides which to use based on the reasoning trace.
The thought-action-observation loop is implemented through careful prompt engineering. The system prompt defines the agent's behavior and available tools. The loop continues until the agent outputs a final answer or hits the iteration limit.
Critical implementation details: (1) Parse action calls from LLM output using regex, (2) Execute tools and append observations to context, (3) Track confidence to enable early stopping, (4) Limit iterations to prevent infinite loops.
from openai import OpenAI
from typing import List, Dict, Callable, Optional
import re
import json
class ReActAgent:
"""
ReAct-style agentic RAG: Reasoning + Acting in iterative loop.
The agent:
1. Observes the question and context
2. Thinks about what to do next
3. Acts using available tools
4. Observes the results
5. Repeats until confident in answer
This implements the core agentic RAG pattern.
"""
def __init__(
self,
retriever, # HybridRetriever from previous code
model: str = "gpt-4-turbo-preview",
max_iterations: int = 5
):
self.client = OpenAI()
self.retriever = retriever
self.model = model
self.max_iterations = max_iterations
# Define available tools
self.tools = {
'vector_search': self._vector_search,
'keyword_search': self._keyword_search,
'hybrid_search': self._hybrid_search,
'get_document': self._get_document,
}
def run(self, query: str) -> Dict:
"""
Run the ReAct loop to answer the query.
Returns:
Dict with:
- answer: Final answer
- reasoning_trace: List of thought-action-observation steps
- total_iterations: Number of loops
- confidence: Estimated confidence
"""
# Initialize reasoning trace
trace = []
messages = [
{"role": "system", "content": self._get_system_prompt()},
{"role": "user", "content": f"Question: {query}"}
]
iteration = 0
answer = None
while iteration < self.max_iterations and not answer:
iteration += 1
# Get agent's next thought + action
response = self.client.chat.completions.create(
model=self.model,
temperature=0,
messages=messages
)
agent_message = response.choices[0].message.content
# Parse the response
thought, action, action_input, final_answer = self._parse_agent_response(
agent_message
)
# Log to trace
step = {
'iteration': iteration,
'thought': thought,
'action': action,
'action_input': action_input
}
# Check if agent provided final answer
if final_answer:
answer = final_answer
step['final_answer'] = answer
trace.append(step)
break
# Execute action
if action and action in self.tools:
try:
observation = self.tools[action](action_input)
step['observation'] = observation
# Add observation to conversation
messages.append({"role": "assistant", "content": agent_message})
messages.append({
"role": "user",
"content": f"Observation: {observation}"
})
except Exception as e:
step['observation'] = f"Error: {str(e)}"
messages.append({"role": "assistant", "content": agent_message})
messages.append({
"role": "user",
"content": f"Observation: Error executing {action}: {str(e)}"
})
else:
step['observation'] = f"Unknown action: {action}"
messages.append({"role": "assistant", "content": agent_message})
messages.append({
"role": "user",
"content": f"Observation: Unknown action '{action}'. Available actions: {list(self.tools.keys())}"
})
trace.append(step)
# If no answer after max iterations, force generation
if not answer:
messages.append({
"role": "user",
"content": "You've reached the iteration limit. Based on what you've gathered, provide your final answer."
})
response = self.client.chat.completions.create(
model=self.model,
temperature=0,
messages=messages
)
answer = response.choices[0].message.content
# Estimate confidence based on retrieval quality
confidence = self._estimate_confidence(trace)
return {
'answer': answer,
'reasoning_trace': trace,
'total_iterations': iteration,
'confidence': confidence
}
def _get_system_prompt(self) -> str:
"""Define agent behavior and available tools."""
return """You are a research agent that answers questions using a ReAct (Reasoning + Acting) approach.
Available tools:
- vector_search(query): Semantic search for relevant documents
- keyword_search(query): Keyword-based search (good for exact terms, acronyms)
- hybrid_search(query): Combined semantic + keyword search
- get_document(doc_id): Retrieve full document by ID
Format your response as:
Thought: [Your reasoning about what to do next]
Action: [tool_name]
Action Input: [input to the tool]
Or, when you have enough information:
Thought: [Your reasoning about why you can now answer]
Final Answer: [Your complete answer with citations]
Instructions:
1. Think step-by-step about what information you need
2. Use tools iteratively to gather information
3. If initial results are insufficient, reformulate and search again
4. Verify important facts across multiple sources when possible
5. Provide a Final Answer only when confident
6. Cite sources in your answer using [doc_id] notation
Begin!"""
def _parse_agent_response(self, response: str) -> tuple:
"""
Parse agent response to extract thought, action, and answer.
Returns:
(thought, action, action_input, final_answer)
"""
# Extract thought
thought_match = re.search(r'Thought:\s*(.*?)(?=\n|Action:|Final Answer:|$)', response, re.DOTALL)
thought = thought_match.group(1).strip() if thought_match else None
# Check for final answer
final_answer_match = re.search(r'Final Answer:\s*(.*?)$', response, re.DOTALL)
if final_answer_match:
return thought, None, None, final_answer_match.group(1).strip()
# Extract action
action_match = re.search(r'Action:\s*(\w+)', response)
action = action_match.group(1) if action_match else None
# Extract action input
action_input_match = re.search(r'Action Input:\s*(.*?)(?=\n|$)', response)
action_input = action_input_match.group(1).strip() if action_input_match else None
return thought, action, action_input, None
# Tool implementations
def _vector_search(self, query: str) -> str:
"""Execute vector search and format results."""
results = self.retriever.retrieve(query)
formatted = []
for i, result in enumerate(results[:5], 1): # Top 5
formatted.append(
f"[doc_{i}] (score: {result['final_score']:.2f})\n"
f"{result['text'][:300]}..."
)
return "\n\n".join(formatted) if formatted else "No results found."
def _keyword_search(self, query: str) -> str:
"""Execute keyword search (BM25)."""
# Use BM25 component of hybrid retriever
results = self.retriever.bm25_retriever.get_relevant_documents(query)
formatted = []
for i, result in enumerate(results[:5], 1):
formatted.append(
f"[doc_{i}]\n{result.page_content[:300]}..."
)
return "\n\n".join(formatted) if formatted else "No results found."
def _hybrid_search(self, query: str) -> str:
"""Execute hybrid search."""
return self._vector_search(query) # Our retriever is already hybrid
def _get_document(self, doc_id: str) -> str:
"""Retrieve full document by ID."""
# Implementation depends on your document store
return f"Full content of {doc_id}: [document content here]"
def _estimate_confidence(self, trace: List[Dict]) -> float:
"""
Estimate confidence based on reasoning trace.
Factors:
- Number of sources consulted
- Retrieval scores
- Whether verification was done
"""
if not trace:
return 0.0
# More iterations with good results = higher confidence
successful_retrievals = sum(
1 for step in trace
if step.get('observation') and 'No results found' not in step.get('observation', '')
)
# Base confidence on successful retrievals
confidence = min(successful_retrievals / 3, 1.0) # Max out at 3 good retrievals
# Bonus for verification (multiple searches on same topic)
actions = [step.get('action') for step in trace]
if len(set(actions)) > 2: # Used multiple different tools
confidence *= 1.1
return min(confidence, 1.0)
# Usage Example
from building_production_rag import HybridRetriever # From previous post
# Initialize retriever (from previous code)
retriever = HybridRetriever(...)
# Initialize ReAct agent
agent = ReActAgent(retriever=retriever, max_iterations=5)
# Run on complex query
result = agent.run(
"What was the revenue impact of our European expansion and how does it compare to our Asian markets?"
)
print("Answer:", result['answer'])
print(f"\nIterations: {result['total_iterations']}")
print(f"Confidence: {result['confidence']:.2%}")
print("\nReasoning Trace:")
for step in result['reasoning_trace']:
print(f"\n--- Iteration {step['iteration']} ---")
if step.get('thought'):
print(f"Thought: {step['thought']}")
if step.get('action'):
print(f"Action: {step['action']}({step.get('action_input', '')})")
if step.get('observation'):
print(f"Observation: {step['observation'][:200]}...")
Pattern 2: Plan-and-Execute
For complex queries, generate a full plan upfront, then execute:
Planning phase:
Query: "Analyze the competitive landscape for our new product launch"
Plan:
1. Identify our new product and its key features
2. Find direct competitors in this product category
3. Compare pricing across competitors
4. Analyze competitor strengths and weaknesses
5. Identify market gaps and opportunities
6. Synthesize competitive positioning recommendation
Execution phase: Execute each step, potentially parallelizing independent steps. Revise plan if new information suggests a different approach.
Pattern 3: Self-RAG with Reflection
After generating a response, the agent critiques its own answer:
[Generate initial answer]
↓
[Self-critique]
- Is every claim supported by retrieved evidence?
- Are there logical gaps in the reasoning?
- Did I miss any relevant aspect of the question?
- Is my confidence level appropriate?
↓
[If issues found]
- Identify what additional information is needed
- Retrieve more context
- Regenerate improved answer
↓
[Final answer with confidence score]
Pattern 4: Multi-Agent RAG
Different agents specialize in different aspects:
Retriever Agent: Optimizes for recall, finds all potentially relevant documents Analyst Agent: Deep-dives into specific documents, extracts detailed insights Synthesizer Agent: Combines findings across sources, resolves conflicts Critic Agent: Evaluates answer quality, identifies gaps
Agents communicate through structured messages, building toward a comprehensive answer.
Handling Multi-Hop Questions
Multi-hop questions require chaining retrievals where the output of one search informs the next:
Example: "Who manages the team that built our highest-revenue product?"
Hop 1: Find highest-revenue product → Result: "Enterprise Analytics Platform"
Hop 2: Find team that built Enterprise Analytics Platform → Result: "Platform Engineering Team"
Hop 3: Find manager of Platform Engineering Team → Result: "Sarah Chen, VP of Engineering"
Final answer: "Sarah Chen manages the Platform Engineering Team, which built the Enterprise Analytics Platform—our highest-revenue product."
Implementation Considerations
Hop limits: Set maximum hops to prevent infinite loops. 3-5 hops handles most queries.
Caching intermediate results: Store intermediate findings to avoid re-retrieval if the agent needs to backtrack.
Confidence propagation: Track confidence at each hop. Low confidence early should trigger broader search rather than proceeding with potentially wrong information.
Parallel vs. sequential: Some hops can be parallelized if they're independent. Analyze the dependency graph.
Tool Integration
Agentic RAG becomes more powerful when the agent can use tools beyond search:
Calculator/Code Execution
For questions involving computation: "What percentage of our revenue comes from the top 10 customers?" → Retrieve customer revenue data → Calculate percentages → Format answer
Structured Data Queries
For precise data needs: "How many support tickets did we close last month?" → Generate SQL query → Execute against database → Interpret results
Web Search
For questions requiring external context: "How does our pricing compare to the market average?" → Retrieve internal pricing → Search web for competitor pricing → Compare
Knowledge Graph Traversal
For relationship-heavy queries: "Find all projects connected to the AI initiative" → Query knowledge graph → Expand relationships → Retrieve related documents
Evaluation and Optimization
Metrics for Agentic RAG
Beyond traditional RAG metrics, measure:
| Metric | What It Measures | Target |
|---|---|---|
| Answer completeness | Did it address all aspects? | > 90% |
| Retrieval efficiency | Searches per successful answer | < 5 |
| Multi-hop accuracy | Correct chains of reasoning | > 85% |
| Self-correction rate | Issues caught by self-critique | > 70% |
| Iteration value | Quality gain from iteration | > 10% |
Ablation Testing
Measure the value of each agentic component:
- Single-shot RAG (baseline)
-
- Query decomposition
-
- Iterative retrieval
-
- Self-critique
- Full agentic system
This identifies which components provide value for your specific use case.
Cost Management
Agentic systems use more tokens than single-shot RAG. Optimize:
Early stopping: If first retrieval is high-confidence, skip iteration Query routing: Use simple RAG for simple queries, agentic for complex Caching: Cache intermediate reasoning for similar queries Model tiering: Use smaller models for planning, larger for final synthesis
Production Considerations
Latency Budgeting
Agentic systems are slower. Budget accordingly:
| Component | Typical Latency |
|---|---|
| Query analysis | 200-500ms |
| Each retrieval | 100-300ms |
| Each reasoning step | 500-1500ms |
| Final synthesis | 500-1000ms |
| Self-critique | 300-800ms |
A 3-hop query with iteration might take 5-10 seconds. Set user expectations or show progress indicators.
Observability
Log everything for debugging:
- Query classification decisions
- Generated plans
- Each retrieval query and results
- Reasoning traces (thoughts)
- Iteration decisions
- Self-critique findings
- Final confidence scores
Build dashboards showing:
- Average hops per query type
- Iteration frequency and success rate
- Self-critique catch rate
- Component-level latencies
Failure Modes
Retrieval loops: Agent keeps reformulating without progress → Solution: Max iteration limits, diversity in reformulations
Over-retrieval: Agent retrieves far more than necessary → Solution: Confidence thresholds for "enough information"
Under-retrieval: Agent stops too early with incomplete information → Solution: Self-critique to catch gaps
Hallucination in reasoning: Agent makes up facts in intermediate steps → Solution: Ground every claim in retrieved documents
Case Study: Legal Research Assistant
We built an agentic RAG system for legal research that needed to:
- Find relevant case law across jurisdictions
- Trace precedent chains (which cases cite which)
- Identify conflicting rulings
- Synthesize legal analysis
Traditional RAG performance: 52% of complex research questions answered satisfactorily
Agentic RAG performance: 84% satisfactory answers
Key improvements:
- Multi-hop precedent tracing (case A cites B, which overruled C)
- Automatic jurisdiction filtering based on query context
- Conflict detection across retrieved cases
- Iterative refinement when initial results were too broad
Conclusion
Agentic RAG represents the next evolution of retrieval-augmented systems. By adding reasoning, planning, and iteration, these systems handle the complex information needs that defeat single-shot RAG.
The tradeoff is complexity and latency. Not every application needs agentic capabilities—simple factual retrieval works fine with traditional RAG. But for research, analysis, and synthesis tasks, agentic RAG delivers substantially better results.
Start with traditional RAG. Add agentic capabilities when you identify query types that consistently fail. Measure the improvement to justify the added complexity.
Frequently Asked Questions
Related Articles
Building Production-Ready RAG Systems: Lessons from the Field
A comprehensive guide to building Retrieval-Augmented Generation systems that actually work in production, based on real-world experience at Goji AI.
Building Agentic AI Systems: A Complete Implementation Guide
A comprehensive guide to building AI agents—tool use, ReAct pattern, planning, memory, context management, MCP integration, and multi-agent orchestration. With full prompt examples and production patterns.
Building Deep Research AI: From Query to Comprehensive Report
How to build AI systems that conduct thorough, multi-source research and produce comprehensive reports rivaling human analysts.