When should I use agentic RAG vs. traditional RAG?

Use traditional RAG for simple factual queries where a single retrieval likely finds the answer. Use agentic RAG when queries require: (1) Multiple searches to answer, (2) Synthesis across documents, (3) Multi-hop reasoning, (4) Verification across sources. Route between them based on query complexity classification.

How do I prevent agentic systems from running forever?

Set hard limits: maximum iterations (typically 3-5), maximum total searches (typically 10-15), maximum tokens consumed, and wall-clock timeout. Implement early stopping when the agent expresses high confidence. Log and analyze queries that hit limits to identify improvement opportunities.

What's the latency overhead of agentic RAG?

Expect 3-10x latency compared to single-shot RAG, depending on query complexity. A simple query routed to traditional RAG: 1-2 seconds. A complex agentic query with multiple hops: 5-15 seconds. Budget for this in UX design—show progress, use streaming, or process asynchronously for complex queries.

How do I evaluate agentic RAG systems?

Beyond traditional metrics (retrieval recall, answer accuracy), measure: query decomposition quality, retrieval efficiency (searches per answer), self-correction effectiveness, and end-to-end task completion. Build test sets specifically for multi-hop and synthesis queries.

Can I use agentic RAG with open-source models?

Yes, but capability requirements are higher than traditional RAG. The agent needs strong reasoning and tool-use abilities. Models like Llama 3 70B, Mixtral, or fine-tuned smaller models can work, but expect some degradation compared to frontier models. Start with GPT-4 or Claude to establish baselines, then test open-source alternatives.

How do I handle conflicting information across sources?

Build explicit conflict detection into your synthesis agent. When conflicts are found: (1) Note the conflict explicitly in the response, (2) Prefer more recent or authoritative sources, (3) Retrieve additional sources to break ties, (4) Express uncertainty when conflicts can't be resolved. Don't silently pick one source—transparency builds trust.

Agentic RAG: When Retrieval Meets Autonomous Reasoning | Enrico Piovano

Beyond Single-Shot Retrieval

Traditional RAG has a fundamental limitation: it's reactive. User asks a question, system retrieves documents, LLM generates an answer. One shot. If the retrieval misses relevant information or the question requires synthesizing insights across multiple searches, the system fails silently.

Why single-shot RAG fails on complex questions: Consider: "How did our Q3 marketing campaign affect customer acquisition compared to last year?" This requires finding: (1) Q3 campaign details, (2) current customer acquisition metrics, (3) last year's metrics, (4) correlation between campaigns and acquisition. No single query retrieves all this. Single-shot RAG returns whatever documents match best—often missing crucial pieces. The LLM then generates a confident-sounding answer from incomplete information.

The iteration imperative: Humans don't research complex topics in one search. We search, read results, refine our understanding, search again with better queries, follow references, cross-check facts. Agentic RAG gives LLMs the same capability: observe results, reason about what's missing, decide what to search next. This closes the gap between "retrieval that finds something" and "retrieval that finds everything needed."

Agentic RAG changes this paradigm. Instead of a single retrieve-then-generate pipeline, an agentic system can:

Plan a retrieval strategy before searching
Decompose complex questions into sub-queries
Iterate on retrieval based on intermediate results
Verify information across multiple sources
Synthesize insights that require multi-hop reasoning

At Goji AI, we've seen agentic RAG systems achieve 40% higher answer accuracy on complex questions compared to traditional single-shot RAG. The difference is particularly stark for questions requiring synthesis across multiple documents or topics.

The Agentic RAG Architecture

Core Components

An agentic RAG system extends the traditional pipeline with reasoning and control loops:

Code

User Query
    ↓
[Query Analyzer] → Classify complexity, identify sub-questions
    ↓
[Strategy Planner] → Determine retrieval approach
    ↓
[Retrieval Agent] ←→ [Vector Store / Search APIs]
    ↓                      ↑
[Result Evaluator] ────────┘ (iterate if insufficient)
    ↓
[Synthesis Agent] → Combine findings, resolve conflicts
    ↓
[Response Generator] → Format final answer with citations
    ↓
[Self-Critic] → Verify answer quality, trigger refinement

Query Analysis and Planning

Not all questions need agentic capabilities. Simple factual queries ("What is the capital of France?") work fine with single-shot RAG. Complex queries benefit from planning:

Query Classification:

Query Type	Example	Strategy
Simple factual	"What's our refund policy?"	Single retrieval
Multi-faceted	"Compare our pricing with competitors"	Parallel retrieval
Multi-hop	"Which team lead approved the Q3 budget?"	Sequential retrieval
Exploratory	"What are emerging trends in our market?"	Iterative expansion
Verification	"Is it true that we launched in 2019?"	Multi-source confirmation

Query Decomposition: For complex queries, break them into atomic sub-questions:

Original: "How did our customer satisfaction change after we implemented the new support system, and what were the main drivers?"

Decomposed:

"What was customer satisfaction before the new support system?"
"When was the new support system implemented?"
"What is current customer satisfaction?"
"What factors correlate with satisfaction changes?"

Each sub-question can be retrieved independently, then synthesized.

Implementation: This query analyzer is the first critical component of agentic RAG. It determines whether to route to simple single-shot RAG or activate the full agentic pipeline. The classification is based on linguistic patterns (keywords like "compare", "analyze", "why"), question complexity (presence of conjunctions indicating multiple sub-questions), and learned patterns from historical data.

The decomposition uses an LLM with chain-of-thought reasoning to break complex questions into dependency-ordered sub-questions. Notice how the dependencies are tracked—some sub-questions must be answered before others (e.g., "When was X implemented?" must be answered before "What changed after X?").

Python

from openai import OpenAI
from typing import List, Dict, Tuple
from enum import Enum
import re

class QueryComplexity(Enum):
    SIMPLE = "simple"  # Single-shot RAG
    MULTI_FACETED = "multi_faceted"  # Parallel retrieval
    MULTI_HOP = "multi_hop"  # Sequential retrieval
    EXPLORATORY = "exploratory"  # Iterative expansion
    VERIFICATION = "verification"  # Multi-source check

class QueryAnalyzer:
    """
    Analyze query complexity and decompose into sub-questions.

    This is the routing layer that decides whether to use simple RAG
    or activate the full agentic system.
    """

    def __init__(self):
        self.client = OpenAI()

        # Keywords that indicate different query types
        self.comparison_keywords = ['compare', 'vs', 'versus', 'difference between', 'better']
        self.analysis_keywords = ['analyze', 'why', 'how did', 'what caused', 'drivers', 'factors']
        self.multi_hop_keywords = ['who manages', 'which team', 'related to', 'leads to']
        self.exploratory_keywords = ['trends', 'insights', 'overview', 'landscape', 'emerging']

    def analyze(self, query: str) -> Dict:
        """
        Analyze query to determine complexity and strategy.

        Returns:
            Dict with:
            - complexity: QueryComplexity enum
            - sub_questions: List of decomposed questions (if complex)
            - strategy: Recommended retrieval strategy
            - reasoning: Why this classification
        """
        # Classify complexity
        complexity = self._classify_complexity(query)

        # Decompose if needed
        sub_questions = []
        dependencies = []

        if complexity != QueryComplexity.SIMPLE:
            sub_questions, dependencies = self._decompose_query(query)

        # Determine strategy
        strategy = self._determine_strategy(complexity, sub_questions)

        return {
            'complexity': complexity,
            'sub_questions': sub_questions,
            'dependencies': dependencies,
            'strategy': strategy,
            'use_agentic': complexity != QueryComplexity.SIMPLE
        }

    def _classify_complexity(self, query: str) -> QueryComplexity:
        """
        Classify query complexity using rule-based + LLM approach.

        Rule-based catches common patterns quickly.
        LLM handles edge cases and nuanced queries.
        """
        query_lower = query.lower()

        # Rule-based classification (fast path)
        if any(kw in query_lower for kw in self.verification_keywords):
            return QueryComplexity.VERIFICATION

        if any(kw in query_lower for kw in self.multi_hop_keywords):
            return QueryComplexity.MULTI_HOP

        if any(kw in query_lower for kw in self.exploratory_keywords):
            return QueryComplexity.EXPLORATORY

        if any(kw in query_lower for kw in self.comparison_keywords):
            return QueryComplexity.MULTI_FACETED

        if any(kw in query_lower for kw in self.analysis_keywords):
            return QueryComplexity.MULTI_FACETED

        # Check for compound questions (multiple "and"/"or")
        if query_lower.count(' and ') >= 2 or query_lower.count(' or ') >= 2:
            return QueryComplexity.MULTI_FACETED

        # Check for question chaining ("... and what...", "... and why...")
        if re.search(r'and (what|why|how|when|where|who)', query_lower):
            return QueryComplexity.MULTI_HOP

        # LLM-based classification for ambiguous cases
        return self._llm_classify(query)

    def _llm_classify(self, query: str) -> QueryComplexity:
        """Use LLM to classify query complexity."""
        prompt = f"""Classify this query's complexity:

Query: {query}

Classifications:
- SIMPLE: Single factual question, one retrieval likely sufficient
- MULTI_FACETED: Multiple independent aspects, needs parallel searches
- MULTI_HOP: Requires chaining searches (answer to Q1 informs Q2)
- EXPLORATORY: Open-ended, requires iterative expansion
- VERIFICATION: Needs cross-checking across multiple sources

Return only the classification name.
"""

        response = self.client.chat.completions.create(
            model="gpt-4-turbo-preview",
            temperature=0,
            messages=[{"role": "user", "content": prompt}]
        )

        classification = response.choices[0].message.content.strip().upper()

        try:
            return QueryComplexity[classification]
        except KeyError:
            # Default to SIMPLE if unclear
            return QueryComplexity.SIMPLE

    def _decompose_query(self, query: str) -> Tuple[List[Dict], List[Tuple]]:
        """
        Decompose complex query into sub-questions with dependencies.

        Returns:
            Tuple of (sub_questions, dependencies)
            sub_questions: List of dicts with 'id', 'question', 'type'
            dependencies: List of (question_id, depends_on_id) tuples
        """
        prompt = f"""Decompose this complex query into atomic sub-questions.

Query: {query}

Requirements:
1. Each sub-question should be answerable independently
2. Identify dependencies (questions that must be answered first)
3. Order questions by dependency chain
4. Mark the type of each question (factual, comparison, temporal, causal)

Return JSON format:
{{
    "sub_questions": [
        {{
            "id": "q1",
            "question": "...",
            "type": "factual|comparison|temporal|causal",
            "depends_on": []  // IDs of questions that must be answered first
        }},
        ...
    ],
    "reasoning": "Why this decomposition?"
}}

JSON Response:"""

        response = self.client.chat.completions.create(
            model="gpt-4-turbo-preview",
            temperature=0,
            response_format={"type": "json_object"},
            messages=[{"role": "user", "content": prompt}]
        )

        import json
        result = json.loads(response.choices[0].message.content)

        sub_questions = result['sub_questions']

        # Build dependency list
        dependencies = []
        for sq in sub_questions:
            for dep_id in sq.get('depends_on', []):
                dependencies.append((sq['id'], dep_id))

        return sub_questions, dependencies

    def _determine_strategy(
        self,
        complexity: QueryComplexity,
        sub_questions: List[Dict]
    ) -> Dict:
        """Determine the optimal retrieval strategy."""

        if complexity == QueryComplexity.SIMPLE:
            return {
                'type': 'single_shot',
                'parallel': False,
                'max_iterations': 1
            }

        elif complexity == QueryComplexity.MULTI_FACETED:
            return {
                'type': 'parallel_retrieval',
                'parallel': True,
                'max_iterations': 1,
                'merge_strategy': 'interleave'
            }

        elif complexity == QueryComplexity.MULTI_HOP:
            return {
                'type': 'sequential_retrieval',
                'parallel': False,
                'max_iterations': len(sub_questions),
                'confidence_threshold': 0.7  # Move to next hop if confident enough
            }

        elif complexity == QueryComplexity.EXPLORATORY:
            return {
                'type': 'iterative_expansion',
                'parallel': False,
                'max_iterations': 5,
                'expansion_strategy': 'follow_references'
            }

        elif complexity == QueryComplexity.VERIFICATION:
            return {
                'type': 'multi_source_verification',
                'parallel': True,
                'min_sources': 3,
                'consistency_threshold': 0.8
            }

        return {'type': 'single_shot'}

# Usage
analyzer = QueryAnalyzer()

# Simple query
result1 = analyzer.analyze("What is our refund policy?")
print(f"Complexity: {result1['complexity']}")
print(f"Use agentic: {result1['use_agentic']}")

# Complex query
result2 = analyzer.analyze(
    "How did our customer satisfaction change after we implemented "
    "the new support system, and what were the main drivers?"
)
print(f"\nComplexity: {result2['complexity']}")
print(f"Sub-questions: {len(result2['sub_questions'])}")
for sq in result2['sub_questions']:
    print(f"  - [{sq['id']}] {sq['question']}")
    if sq['depends_on']:
        print(f"    Depends on: {sq['depends_on']}")
print(f"Strategy: {result2['strategy']}")

The Retrieval Agent

The retrieval agent is an LLM with access to search tools. Unlike traditional RAG where retrieval is a function call, here retrieval is a decision made by the agent:

Tools available to the retrieval agent:

vector_search(query, filters, top_k) — Semantic search
keyword_search(query, filters) — BM25/keyword matching
hybrid_search(query, filters, alpha) — Combined search
get_document(doc_id) — Fetch full document
get_related(doc_id) — Find similar documents
search_web(query) — External web search (if enabled)

Agent reasoning loop:

Code

1. Consider the current sub-question
2. Decide which search tool(s) to use
3. Execute search, receive results
4. Evaluate: Do I have enough information?
   - If yes: Move to next sub-question or synthesis
   - If no: Reformulate query or try different tool
5. Repeat until satisfied or max iterations reached

Iterative Refinement

The key differentiator of agentic RAG is iteration. When initial retrieval is insufficient:

Query reformulation:

Original: "revenue growth drivers" → Reformulated: "factors contributing to revenue increase 2024"
Original: "customer churn" → Expanded: "customer churn OR customer attrition OR subscription cancellation"

Result-driven expansion: If retrieved documents mention a concept the agent doesn't have enough context for, it can search for that concept:

Retrieved doc mentions "the Thompson acquisition"
Agent decides: "I need more context on the Thompson acquisition"
New search: "Thompson acquisition details terms"

Source triangulation: For high-stakes answers, retrieve from multiple independent sources and check consistency. Inconsistencies trigger deeper investigation.

Implementation Patterns

Pattern 1: ReAct-Style Agentic RAG

The ReAct (Reasoning + Acting) pattern interleaves thinking and tool use:

Code

Question: "What was the revenue impact of our European expansion?"

Thought: I need to find information about European expansion and its revenue impact. Let me start with a search for European expansion timeline.

Action: vector_search("European expansion timeline launch")
Observation: [Retrieved 3 documents about EU market entry in Q2 2023...]

Thought: Good, I found when we expanded. Now I need revenue data before and after Q2 2023.

Action: vector_search("revenue Europe Q1 Q2 Q3 2023 comparison")
Observation: [Retrieved financial reports showing revenue trends...]

Thought: I have timeline and revenue data. Let me verify with another source.

Action: keyword_search("European revenue 2023 quarterly")
Observation: [Retrieved quarterly earnings call transcript...]

Thought: Multiple sources confirm ~23% revenue increase attributed to EU. I have enough to answer.

Answer: The European expansion in Q2 2023 contributed to a 23% increase in revenue...

Implementation: This ReAct agent is the core of agentic RAG. The key design decision is making retrieval an agent action rather than a fixed pipeline step. The agent sees its tools (vector_search, keyword_search, etc.) and decides which to use based on the reasoning trace.

The thought-action-observation loop is implemented through careful prompt engineering. The system prompt defines the agent's behavior and available tools. The loop continues until the agent outputs a final answer or hits the iteration limit.

Critical implementation details: (1) Parse action calls from LLM output using regex, (2) Execute tools and append observations to context, (3) Track confidence to enable early stopping, (4) Limit iterations to prevent infinite loops.

Python

from openai import OpenAI
from typing import List, Dict, Callable, Optional
import re
import json

class ReActAgent:
    """
    ReAct-style agentic RAG: Reasoning + Acting in iterative loop.

    The agent:
    1. Observes the question and context
    2. Thinks about what to do next
    3. Acts using available tools
    4. Observes the results
    5. Repeats until confident in answer

    This implements the core agentic RAG pattern.
    """

    def __init__(
        self,
        retriever,  # HybridRetriever from previous code
        model: str = "gpt-4-turbo-preview",
        max_iterations: int = 5
    ):
        self.client = OpenAI()
        self.retriever = retriever
        self.model = model
        self.max_iterations = max_iterations

        # Define available tools
        self.tools = {
            'vector_search': self._vector_search,
            'keyword_search': self._keyword_search,
            'hybrid_search': self._hybrid_search,
            'get_document': self._get_document,
        }

    def run(self, query: str) -> Dict:
        """
        Run the ReAct loop to answer the query.

        Returns:
            Dict with:
            - answer: Final answer
            - reasoning_trace: List of thought-action-observation steps
            - total_iterations: Number of loops
            - confidence: Estimated confidence
        """
        # Initialize reasoning trace
        trace = []
        messages = [
            {"role": "system", "content": self._get_system_prompt()},
            {"role": "user", "content": f"Question: {query}"}
        ]

        iteration = 0
        answer = None

        while iteration < self.max_iterations and not answer:
            iteration += 1

            # Get agent's next thought + action
            response = self.client.chat.completions.create(
                model=self.model,
                temperature=0,
                messages=messages
            )

            agent_message = response.choices[0].message.content

            # Parse the response
            thought, action, action_input, final_answer = self._parse_agent_response(
                agent_message
            )

            # Log to trace
            step = {
                'iteration': iteration,
                'thought': thought,
                'action': action,
                'action_input': action_input
            }

            # Check if agent provided final answer
            if final_answer:
                answer = final_answer
                step['final_answer'] = answer
                trace.append(step)
                break

            # Execute action
            if action and action in self.tools:
                try:
                    observation = self.tools[action](action_input)
                    step['observation'] = observation

                    # Add observation to conversation
                    messages.append({"role": "assistant", "content": agent_message})
                    messages.append({
                        "role": "user",
                        "content": f"Observation: {observation}"
                    })

                except Exception as e:
                    step['observation'] = f"Error: {str(e)}"
                    messages.append({"role": "assistant", "content": agent_message})
                    messages.append({
                        "role": "user",
                        "content": f"Observation: Error executing {action}: {str(e)}"
                    })

            else:
                step['observation'] = f"Unknown action: {action}"
                messages.append({"role": "assistant", "content": agent_message})
                messages.append({
                    "role": "user",
                    "content": f"Observation: Unknown action '{action}'. Available actions: {list(self.tools.keys())}"
                })

            trace.append(step)

        # If no answer after max iterations, force generation
        if not answer:
            messages.append({
                "role": "user",
                "content": "You've reached the iteration limit. Based on what you've gathered, provide your final answer."
            })

            response = self.client.chat.completions.create(
                model=self.model,
                temperature=0,
                messages=messages
            )

            answer = response.choices[0].message.content

        # Estimate confidence based on retrieval quality
        confidence = self._estimate_confidence(trace)

        return {
            'answer': answer,
            'reasoning_trace': trace,
            'total_iterations': iteration,
            'confidence': confidence
        }

    def _get_system_prompt(self) -> str:
        """Define agent behavior and available tools."""
        return """You are a research agent that answers questions using a ReAct (Reasoning + Acting) approach.

Available tools:
- vector_search(query): Semantic search for relevant documents
- keyword_search(query): Keyword-based search (good for exact terms, acronyms)
- hybrid_search(query): Combined semantic + keyword search
- get_document(doc_id): Retrieve full document by ID

Format your response as:

Thought: [Your reasoning about what to do next]
Action: [tool_name]
Action Input: [input to the tool]

Or, when you have enough information:

Thought: [Your reasoning about why you can now answer]
Final Answer: [Your complete answer with citations]

Instructions:
1. Think step-by-step about what information you need
2. Use tools iteratively to gather information
3. If initial results are insufficient, reformulate and search again
4. Verify important facts across multiple sources when possible
5. Provide a Final Answer only when confident
6. Cite sources in your answer using [doc_id] notation

Begin!"""

    def _parse_agent_response(self, response: str) -> tuple:
        """
        Parse agent response to extract thought, action, and answer.

        Returns:
            (thought, action, action_input, final_answer)
        """
        # Extract thought
        thought_match = re.search(r'Thought:\s*(.*?)(?=\n|Action:|Final Answer:|$)', response, re.DOTALL)
        thought = thought_match.group(1).strip() if thought_match else None

        # Check for final answer
        final_answer_match = re.search(r'Final Answer:\s*(.*?)$', response, re.DOTALL)
        if final_answer_match:
            return thought, None, None, final_answer_match.group(1).strip()

        # Extract action
        action_match = re.search(r'Action:\s*(\w+)', response)
        action = action_match.group(1) if action_match else None

        # Extract action input
        action_input_match = re.search(r'Action Input:\s*(.*?)(?=\n|$)', response)
        action_input = action_input_match.group(1).strip() if action_input_match else None

        return thought, action, action_input, None

    # Tool implementations
    def _vector_search(self, query: str) -> str:
        """Execute vector search and format results."""
        results = self.retriever.retrieve(query)

        formatted = []
        for i, result in enumerate(results[:5], 1):  # Top 5
            formatted.append(
                f"[doc_{i}] (score: {result['final_score']:.2f})\n"
                f"{result['text'][:300]}..."
            )

        return "\n\n".join(formatted) if formatted else "No results found."

    def _keyword_search(self, query: str) -> str:
        """Execute keyword search (BM25)."""
        # Use BM25 component of hybrid retriever
        results = self.retriever.bm25_retriever.get_relevant_documents(query)

        formatted = []
        for i, result in enumerate(results[:5], 1):
            formatted.append(
                f"[doc_{i}]\n{result.page_content[:300]}..."
            )

        return "\n\n".join(formatted) if formatted else "No results found."

    def _hybrid_search(self, query: str) -> str:
        """Execute hybrid search."""
        return self._vector_search(query)  # Our retriever is already hybrid

    def _get_document(self, doc_id: str) -> str:
        """Retrieve full document by ID."""
        # Implementation depends on your document store
        return f"Full content of {doc_id}: [document content here]"

    def _estimate_confidence(self, trace: List[Dict]) -> float:
        """
        Estimate confidence based on reasoning trace.

        Factors:
        - Number of sources consulted
        - Retrieval scores
        - Whether verification was done
        """
        if not trace:
            return 0.0

        # More iterations with good results = higher confidence
        successful_retrievals = sum(
            1 for step in trace
            if step.get('observation') and 'No results found' not in step.get('observation', '')
        )

        # Base confidence on successful retrievals
        confidence = min(successful_retrievals / 3, 1.0)  # Max out at 3 good retrievals

        # Bonus for verification (multiple searches on same topic)
        actions = [step.get('action') for step in trace]
        if len(set(actions)) > 2:  # Used multiple different tools
            confidence *= 1.1

        return min(confidence, 1.0)

# Usage Example
from building_production_rag import HybridRetriever  # From previous post

# Initialize retriever (from previous code)
retriever = HybridRetriever(...)

# Initialize ReAct agent
agent = ReActAgent(retriever=retriever, max_iterations=5)

# Run on complex query
result = agent.run(
    "What was the revenue impact of our European expansion and how does it compare to our Asian markets?"
)

print("Answer:", result['answer'])
print(f"\nIterations: {result['total_iterations']}")
print(f"Confidence: {result['confidence']:.2%}")

print("\nReasoning Trace:")
for step in result['reasoning_trace']:
    print(f"\n--- Iteration {step['iteration']} ---")
    if step.get('thought'):
        print(f"Thought: {step['thought']}")
    if step.get('action'):
        print(f"Action: {step['action']}({step.get('action_input', '')})")
    if step.get('observation'):
        print(f"Observation: {step['observation'][:200]}...")

Pattern 2: Plan-and-Execute

For complex queries, generate a full plan upfront, then execute:

Planning phase:

Code

Query: "Analyze the competitive landscape for our new product launch"

Plan:
1. Identify our new product and its key features
2. Find direct competitors in this product category
3. Compare pricing across competitors
4. Analyze competitor strengths and weaknesses
5. Identify market gaps and opportunities
6. Synthesize competitive positioning recommendation

Execution phase: Execute each step, potentially parallelizing independent steps. Revise plan if new information suggests a different approach.

Pattern 3: Self-RAG with Reflection

After generating a response, the agent critiques its own answer:

Code

[Generate initial answer]
↓
[Self-critique]
- Is every claim supported by retrieved evidence?
- Are there logical gaps in the reasoning?
- Did I miss any relevant aspect of the question?
- Is my confidence level appropriate?
↓
[If issues found]
- Identify what additional information is needed
- Retrieve more context
- Regenerate improved answer
↓
[Final answer with confidence score]

Pattern 4: Multi-Agent RAG

Different agents specialize in different aspects:

Retriever Agent: Optimizes for recall, finds all potentially relevant documents Analyst Agent: Deep-dives into specific documents, extracts detailed insights Synthesizer Agent: Combines findings across sources, resolves conflicts Critic Agent: Evaluates answer quality, identifies gaps

Agents communicate through structured messages, building toward a comprehensive answer.

Handling Multi-Hop Questions

Multi-hop questions require chaining retrievals where the output of one search informs the next:

Example: "Who manages the team that built our highest-revenue product?"

Hop 1: Find highest-revenue product → Result: "Enterprise Analytics Platform"

Hop 2: Find team that built Enterprise Analytics Platform → Result: "Platform Engineering Team"

Hop 3: Find manager of Platform Engineering Team → Result: "Sarah Chen, VP of Engineering"

Final answer: "Sarah Chen manages the Platform Engineering Team, which built the Enterprise Analytics Platform—our highest-revenue product."

Implementation Considerations

Hop limits: Set maximum hops to prevent infinite loops. 3-5 hops handles most queries.

Caching intermediate results: Store intermediate findings to avoid re-retrieval if the agent needs to backtrack.

Confidence propagation: Track confidence at each hop. Low confidence early should trigger broader search rather than proceeding with potentially wrong information.

Parallel vs. sequential: Some hops can be parallelized if they're independent. Analyze the dependency graph.

Tool Integration

Agentic RAG becomes more powerful when the agent can use tools beyond search:

Calculator/Code Execution

For questions involving computation: "What percentage of our revenue comes from the top 10 customers?" → Retrieve customer revenue data → Calculate percentages → Format answer

Structured Data Queries

For precise data needs: "How many support tickets did we close last month?" → Generate SQL query → Execute against database → Interpret results

Web Search

For questions requiring external context: "How does our pricing compare to the market average?" → Retrieve internal pricing → Search web for competitor pricing → Compare

Knowledge Graph Traversal

For relationship-heavy queries: "Find all projects connected to the AI initiative" → Query knowledge graph → Expand relationships → Retrieve related documents

Evaluation and Optimization

Metrics for Agentic RAG

Beyond traditional RAG metrics, measure:

Metric	What It Measures	Target
Answer completeness	Did it address all aspects?	> 90%
Retrieval efficiency	Searches per successful answer	< 5
Multi-hop accuracy	Correct chains of reasoning	> 85%
Self-correction rate	Issues caught by self-critique	> 70%
Iteration value	Quality gain from iteration	> 10%

Ablation Testing

Measure the value of each agentic component:

Single-shot RAG (baseline)
- Query decomposition
- Iterative retrieval
- Self-critique
Full agentic system

This identifies which components provide value for your specific use case.

Cost Management

Agentic systems use more tokens than single-shot RAG. Optimize:

Early stopping: If first retrieval is high-confidence, skip iteration Query routing: Use simple RAG for simple queries, agentic for complex Caching: Cache intermediate reasoning for similar queries Model tiering: Use smaller models for planning, larger for final synthesis

Production Considerations

Latency Budgeting

Agentic systems are slower. Budget accordingly:

Component	Typical Latency
Query analysis	200-500ms
Each retrieval	100-300ms
Each reasoning step	500-1500ms
Final synthesis	500-1000ms
Self-critique	300-800ms

A 3-hop query with iteration might take 5-10 seconds. Set user expectations or show progress indicators.

Observability

Log everything for debugging:

Query classification decisions
Generated plans
Each retrieval query and results
Reasoning traces (thoughts)
Iteration decisions
Self-critique findings
Final confidence scores

Build dashboards showing:

Average hops per query type
Iteration frequency and success rate
Self-critique catch rate
Component-level latencies

Failure Modes

Retrieval loops: Agent keeps reformulating without progress → Solution: Max iteration limits, diversity in reformulations

Over-retrieval: Agent retrieves far more than necessary → Solution: Confidence thresholds for "enough information"

Under-retrieval: Agent stops too early with incomplete information → Solution: Self-critique to catch gaps

Hallucination in reasoning: Agent makes up facts in intermediate steps → Solution: Ground every claim in retrieved documents

Case Study: Legal Research Assistant

We built an agentic RAG system for legal research that needed to:

Find relevant case law across jurisdictions
Trace precedent chains (which cases cite which)
Identify conflicting rulings
Synthesize legal analysis

Traditional RAG performance: 52% of complex research questions answered satisfactorily

Agentic RAG performance: 84% satisfactory answers

Key improvements:

Multi-hop precedent tracing (case A cites B, which overruled C)
Automatic jurisdiction filtering based on query context
Conflict detection across retrieved cases
Iterative refinement when initial results were too broad

Conclusion

Agentic RAG represents the next evolution of retrieval-augmented systems. By adding reasoning, planning, and iteration, these systems handle the complex information needs that defeat single-shot RAG.

The tradeoff is complexity and latency. Not every application needs agentic capabilities—simple factual retrieval works fine with traditional RAG. But for research, analysis, and synthesis tasks, agentic RAG delivers substantially better results.

Start with traditional RAG. Add agentic capabilities when you identify query types that consistently fail. Measure the improvement to justify the added complexity.

Agentic RAG: When Retrieval Meets Autonomous Reasoning

Table of Contents

Beyond Single-Shot Retrieval

The Agentic RAG Architecture

Core Components

Query Analysis and Planning

The Retrieval Agent

Iterative Refinement

Implementation Patterns

Pattern 1: ReAct-Style Agentic RAG

Pattern 2: Plan-and-Execute

Pattern 3: Self-RAG with Reflection

Pattern 4: Multi-Agent RAG

Handling Multi-Hop Questions

Implementation Considerations

Tool Integration

Calculator/Code Execution

Structured Data Queries

Web Search

Knowledge Graph Traversal

Evaluation and Optimization

Metrics for Agentic RAG

Ablation Testing

Cost Management

Production Considerations

Latency Budgeting

Observability

Failure Modes

Case Study: Legal Research Assistant

Conclusion

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Building Production-Ready RAG Systems: Lessons from the Field

Building Agentic AI Systems: A Complete Implementation Guide

Building Deep Research AI: From Query to Comprehensive Report