Skip to main content
Back to Blog

RAG vs CAG: When Cache-Augmented Generation Beats Retrieval

A comprehensive comparison of Retrieval-Augmented Generation (RAG) and Cache-Augmented Generation (CAG). Learn when to use each approach, implementation patterns, and how to build hybrid systems.

11 min read
Share:

The Rise of CAG

In December 2024, a paper titled "Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks" (Chan et al., KAUST) challenged the dominance of RAG as the default approach for augmenting LLMs with external knowledge. The paper will be presented at ACM Web Conference 2025.

The core insight: with modern LLMs supporting 100K-1M+ token context windows, you can often skip retrieval entirely by preloading all relevant knowledge into the context and caching it.

This guide covers when to use RAG vs CAG, how each works under the hood, and how to build hybrid systems that combine both approaches.

Quick Comparison

AspectRAGCAG
RetrievalReal-time search at inferenceNone (preloaded)
LatencyHigher (retrieval + generation)Lower (generation only)
Knowledge sizeUnlimitedLimited by context window
FreshnessReal-time updates possibleRequires cache rebuild
ComplexityHigher (vector DB, embeddings, retrieval)Lower (just KV cache)
Best forLarge, dynamic knowledge basesStatic, bounded knowledge

How RAG Works

RAG (Retrieval-Augmented Generation) fetches relevant documents at query time:

Code
User Query
    ↓
[1. Embed Query]
    ↓
[2. Search Vector DB] → Top-K Documents
    ↓
[3. Construct Prompt]
    Query + Retrieved Docs
    ↓
[4. LLM Generation]
    ↓
Response

RAG Implementation

This implementation shows a production-ready RAG pipeline with MMR (Maximum Marginal Relevance) retrieval. MMR balances relevance and diversity—it finds relevant documents but penalizes redundancy, ensuring you get varied information rather than five similar documents saying the same thing.

The key configuration choice: k=5 returns 5 final documents, but fetch_k=20 retrieves 20 candidates first, then applies MMR to select the most diverse 5. This gives better coverage than simple top-k retrieval.

Python
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain.chains import RetrievalQA

class RAGSystem:
    def __init__(self, index_name: str):
        self.llm = ChatOpenAI(model="gpt-5.2")
        self.embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
        self.vectorstore = PineconeVectorStore.from_existing_index(
            index_name=index_name,
            embedding=self.embeddings
        )
        self.retriever = self.vectorstore.as_retriever(
            search_type="mmr",
            search_kwargs={"k": 5, "fetch_k": 20}
        )

    def query(self, question: str) -> str:
        """Query with real-time retrieval."""
        # 1. Retrieve relevant documents
        docs = self.retriever.invoke(question)

        # 2. Build context
        context = "\n\n".join([doc.page_content for doc in docs])

        # 3. Generate response
        prompt = f"""Answer the question based on the following context:

Context:
{context}

Question: {question}

Answer:"""

        response = self.llm.invoke(prompt)
        return response.content

RAG Latency Breakdown

Code
Total latency = Embedding (50ms) + Search (100ms) + Generation (500ms)
             = ~650ms per query

The retrieval step adds significant latency and potential failure modes:

  • Embedding service latency
  • Vector database query time
  • Network round-trips
  • Retrieval errors (wrong documents returned)

How CAG Works

CAG (Cache-Augmented Generation) preloads all knowledge and caches the model's internal state:

Code
[Offline: Knowledge Preloading]
    All Documents → LLM Context → KV Cache (saved to disk)

[Online: Query Time]
    User Query + Cached KV → LLM Generation → Response

The KV Cache Explained

When an LLM processes text, it computes Key-Value (KV) pairs for each token in the attention mechanism. These represent the model's "understanding" of the context.

Python
# Simplified attention mechanism
# For each layer, the model computes:
K = input @ W_k  # Keys
V = input @ W_v  # Values

# These are cached and reused for subsequent tokens
# Instead of recomputing for the entire context each time

CAG insight: If your knowledge base is static, compute the KV cache once and reuse it for every query.

CAG Implementation

CAG's power comes from precomputing the expensive part of LLM inference. When an LLM processes your 100K-token knowledge base, it computes Key-Value matrices at each attention layer. These matrices encode the model's "understanding" of the context.

The insight: these KV values only depend on the input, not the query. So compute them once, save to disk, and reload for every query. The query only needs to compute its own (tiny) KV values, then attention can reference the cached knowledge KVs.

The tradeoff: caches are large (tens of GB for big models) and model-specific. If you fine-tune the model, the cache is invalidated.

Python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import pickle

class CAGSystem:
    def __init__(self, model_name: str = "meta-llama/Llama-3.1-70B-Instruct"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.bfloat16,
            device_map="auto"
        )
        self.kv_cache = None

    def preload_knowledge(self, documents: list[str], cache_path: str):
        """Preload all documents and cache KV states."""
        # Combine all documents into context
        knowledge_text = "\n\n---\n\n".join(documents)

        # Create the preloaded prompt
        preload_prompt = f"""You are a helpful assistant with access to the following knowledge base:

{knowledge_text}

Use this knowledge to answer questions accurately. If the answer is not in the knowledge base, say so.

"""
        # Tokenize
        inputs = self.tokenizer(
            preload_prompt,
            return_tensors="pt",
            truncation=True,
            max_length=self.model.config.max_position_embeddings - 1000  # Leave room for query
        ).to(self.model.device)

        # Forward pass to generate KV cache
        with torch.no_grad():
            outputs = self.model(
                **inputs,
                use_cache=True,
                return_dict=True
            )
            self.kv_cache = outputs.past_key_values

        # Save cache to disk
        with open(cache_path, 'wb') as f:
            pickle.dump(self.kv_cache, f)

        print(f"Cached {len(documents)} documents ({inputs['input_ids'].shape[1]} tokens)")

    def load_cache(self, cache_path: str):
        """Load precomputed KV cache from disk."""
        with open(cache_path, 'rb') as f:
            self.kv_cache = pickle.load(f)

    def query(self, question: str) -> str:
        """Query using cached knowledge (no retrieval)."""
        # Format the query
        query_prompt = f"\n\nQuestion: {question}\n\nAnswer:"

        inputs = self.tokenizer(
            query_prompt,
            return_tensors="pt"
        ).to(self.model.device)

        # Generate using cached KV
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                past_key_values=self.kv_cache,
                max_new_tokens=500,
                do_sample=True,
                temperature=0.7
            )

        response = self.tokenizer.decode(
            outputs[0][inputs['input_ids'].shape[1]:],
            skip_special_tokens=True
        )
        return response

CAG Latency Breakdown

Code
Total latency = Generation only (500ms)
             = ~500ms per query

Speedup vs RAG: ~23% faster (no retrieval overhead)

On HotPotQA benchmark: CAG reduced generation time from 94.35s (RAG) to 2.33s — a 40x improvement.

Memory & Compute Requirements

Understanding the resource implications is critical for choosing between RAG and CAG.

The fundamental resource tradeoff: RAG trades runtime compute (retrieval + embedding per query) for reduced memory (only retrieved docs in context). CAG trades large upfront memory (full knowledge in KV cache) for minimal per-query compute (just decode the answer). Your infrastructure constraints often make this decision for you.

Why this matters in practice: If you have a single A100 (80GB), CAG with a 70B model at 100K tokens is impossible—the KV cache alone exceeds available memory. But if you have multiple GPUs with NVLink, CAG becomes viable. Conversely, RAG works on modest hardware but requires additional infrastructure (vector database, embedding service).

KV Cache Memory Formula

The KV cache size scales with model architecture and context length:

Code
KV Cache Size = 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_param

Example for Llama 3.1 70B (80 layers, 8 KV heads, 128 head_dim, bfloat16):
= 2 × 80 × 8 × 128 × seq_len × 2 bytes
= 327,680 × seq_len bytes
= ~312 KB per 1K tokens
= ~31.2 GB for 100K tokens

Memory Requirements by Model

ModelContextKV Cache (100K tokens)Total VRAM Needed
Llama 3.1 8B128K~4 GB~20 GB
Llama 3.1 70B128K~31 GB~170 GB
Llama 4 Scout10M~250 GB (1M tokens)500+ GB
Qwen 2.5 72B128K~35 GB~180 GB

Prefill vs Decode Costs

CAG shifts compute from per-query to one-time prefill:

Code
RAG (per query):
  - Embed query: ~50ms
  - Vector search: ~100ms
  - Prefill retrieved docs (~5K tokens): ~200ms
  - Decode response: ~300ms
  Total: ~650ms

CAG (per query):
  - Load cached KV: ~50ms (from SSD) or ~5ms (from RAM)
  - Decode response: ~300ms
  Total: ~350ms

CAG prefill (one-time):
  - Process 100K tokens: ~30-60 seconds
  - Amortized over 10K queries: ~3-6ms per query

Batch Inference Considerations

CAG with shared caches enables efficient batching:

Python
class BatchCAGSystem:
    """Efficient batch inference with shared KV cache."""

    def __init__(self, model, shared_cache):
        self.model = model
        self.shared_cache = shared_cache  # Same knowledge for all queries

    def batch_query(self, questions: list[str]) -> list[str]:
        # All queries share the same preloaded context
        # Only the query tokens differ per batch item
        batch_inputs = self.tokenizer(
            [f"\n\nQuestion: {q}\n\nAnswer:" for q in questions],
            padding=True,
            return_tensors="pt"
        )

        # Replicate KV cache for batch dimension
        batch_cache = self._expand_cache_for_batch(
            self.shared_cache,
            batch_size=len(questions)
        )

        outputs = self.model.generate(
            **batch_inputs,
            past_key_values=batch_cache,
            max_new_tokens=200
        )

        return self.tokenizer.batch_decode(outputs, skip_special_tokens=True)

Cost Comparison: Infrastructure

ComponentRAGCAG
Vector DB$200-2000/mo (managed)Not needed
Embedding API$0.0001/1K tokensNot needed
GPU Memory24-48 GB sufficient80-180 GB+ needed
StorageVector index (~10GB)KV cache files (~30GB)
ComplexityHigh (multiple services)Low (single model)

When to Use Each Approach

Use RAG When:

ScenarioWhy RAG
Large knowledge baseExceeds context window (>1M tokens)
Frequently updated dataNews, inventory, real-time analytics
Diverse query patternsDifferent queries need different docs
Multi-tenant systemsDifferent users need different knowledge
Cost constraintsSmaller context = fewer tokens = lower cost

Example RAG use cases:

  • Enterprise search across millions of documents
  • Customer support with ticket history
  • Real-time news Q&A
  • E-commerce product recommendations

Use CAG When:

ScenarioWhy CAG
Bounded knowledgeFits in context window
Static contentTechnical manuals, policies, FAQs
Low latency requiredReal-time applications
Simplicity valuedNo vector DB infrastructure
High query volumeAmortize preloading cost

Example CAG use cases:

  • Product documentation chatbot
  • Company policy Q&A
  • Legal contract analysis
  • Medical reference lookup (fixed guidelines)

Decision Framework

Code
Is your knowledge base > 500K tokens?
    ├── YES → Use RAG
    └── NO → Continue...

Does your knowledge change frequently?
    ├── YES (daily/hourly) → Use RAG
    └── NO (weekly/monthly) → Continue...

Is latency critical (<200ms)?
    ├── YES → Use CAG
    └── NO → Either works, consider complexity

Do you already have vector DB infrastructure?
    ├── YES → RAG may be easier
    └── NO → CAG avoids new infrastructure

Hybrid Approaches

The best systems often combine both approaches. The key insight: CAG and RAG have complementary strengths. CAG excels at fast, consistent answers for common questions. RAG excels at handling the long tail of rare queries and fresh information.

A well-designed hybrid uses CAG for the 80% of queries that hit core knowledge (fast path), and falls back to RAG for the 20% that need retrieval (comprehensive path). This gives you both speed and coverage.

Pattern 1: CAG Core + RAG Edge Cases

This pattern preloads your most-asked knowledge into CAG, using RAG only when CAG signals uncertainty. The needs_retrieval function is the router—it detects "I don't know" responses or low-confidence answers and triggers RAG fallback.

Python
class HybridSystem:
    """Preload common knowledge, retrieve for edge cases."""

    def __init__(self):
        self.cag = CAGSystem()
        self.rag = RAGSystem(index_name="edge-cases")
        self.cag.load_cache("core_knowledge.cache")

    def query(self, question: str) -> str:
        # Try CAG first (fast path)
        cag_response = self.cag.query(question)

        # Check confidence / detect "I don't know"
        if self.needs_retrieval(cag_response):
            # Fall back to RAG for edge cases
            return self.rag.query(question)

        return cag_response

    def needs_retrieval(self, response: str) -> bool:
        """Detect when CAG doesn't have the answer."""
        uncertainty_phrases = [
            "I don't have information",
            "not in my knowledge",
            "I'm not sure",
            "cannot find"
        ]
        return any(phrase in response.lower() for phrase in uncertainty_phrases)

Pattern 2: Tiered Caching

Python
class TieredCAGSystem:
    """Multiple cache tiers for different knowledge domains."""

    def __init__(self):
        self.caches = {
            "products": "product_docs.cache",
            "policies": "company_policies.cache",
            "technical": "tech_manual.cache"
        }
        self.active_cache = None
        self.classifier = self.load_query_classifier()

    def query(self, question: str) -> str:
        # Classify query to select appropriate cache
        domain = self.classifier.predict(question)

        # Load domain-specific cache
        if self.active_cache != domain:
            self.load_cache(self.caches[domain])
            self.active_cache = domain

        return self.generate(question)

Pattern 3: CAG + Real-Time RAG Augmentation

Python
class AugmentedCAGSystem:
    """CAG for static knowledge, RAG for real-time data."""

    def __init__(self):
        self.cag = CAGSystem()  # Product catalog, policies
        self.rag = RAGSystem()  # Inventory, prices, promotions

    def query(self, question: str) -> str:
        # Get static context from CAG
        static_context = self.cag.get_cached_context()

        # Get dynamic context from RAG
        dynamic_docs = self.rag.retrieve(question)
        dynamic_context = "\n".join([d.page_content for d in dynamic_docs])

        # Combine for generation
        prompt = f"""Static Knowledge:
{static_context}

Current Information:
{dynamic_context}

Question: {question}
Answer:"""

        return self.llm.invoke(prompt)

Limitations & Challenges

Both approaches have significant limitations that impact real-world deployments.

CAG Limitations

1. Lost in the Middle Problem

LLMs struggle to recall information from the middle of long contexts. Research from Stanford/Berkeley shows retrieval accuracy drops significantly for information positioned in the middle 50% of context:

Code
Position in Context → Recall Accuracy
Beginning (0-10%):    ~95%
Middle (40-60%):      ~60-70%
End (90-100%):        ~90%

Mitigation strategies:

  • Place critical information at the beginning and end
  • Use structured formatting with clear section headers
  • Implement attention sinks (repetition of key facts)
  • Consider document ordering by importance

2. Attention Dilution

As context grows, attention becomes diluted across more tokens:

Python
# Simplified attention visualization
def attention_per_token(context_length: int, query_tokens: int = 100):
    """Average attention each context token receives."""
    # Softmax distributes attention across all tokens
    # More tokens = less attention per token
    return query_tokens / context_length

# 10K context: 1% attention per token
# 100K context: 0.1% attention per token
# 1M context: 0.01% attention per token

This manifests as:

  • Subtle details getting ignored
  • Conflicting information not being reconciled
  • Reduced reasoning over distant context

3. Cache Staleness

CAG caches become stale when knowledge changes:

Update FrequencyCAG ViabilityRecommendation
Real-timePoorUse RAG
HourlyPoorUse RAG or hybrid
DailyModerateScheduled cache rebuilds
WeeklyGoodBatch updates
StaticExcellentPure CAG

RAG Limitations

1. Retrieval Failures

RAG can fail silently when retrieval misses relevant documents:

Failure ModeCauseImpact
Semantic gapQuery/doc embedding mismatchWrong docs retrieved
Chunking artifactsAnswer split across chunksPartial information
Sparse coverageRare topics under-representedNo relevant docs
Adversarial queriesIntentionally confusing queriesHallucinated answers
Python
# Example: Retrieval failure detection
def detect_retrieval_failure(query: str, docs: list, response: str) -> bool:
    """Heuristics for detecting poor retrieval."""
    signals = []

    # Low relevance scores
    if all(doc.score < 0.5 for doc in docs):
        signals.append("low_relevance")

    # Response contradicts retrieved docs
    if contains_contradiction(response, docs):
        signals.append("contradiction")

    # Response uses knowledge not in docs (hallucination)
    if contains_external_knowledge(response, docs):
        signals.append("potential_hallucination")

    return len(signals) > 0

2. Multi-Hop Reasoning Failures

RAG struggles with questions requiring information synthesis:

Code
Question: "What is the revenue of the company that acquired the
           startup founded by the person who invented X?"

Required reasoning:
1. Who invented X? → Person A
2. What startup did Person A found? → Startup B
3. Who acquired Startup B? → Company C
4. What is Company C's revenue? → $Y

RAG typically retrieves docs for step 1, missing steps 2-4.

3. Latency Variance

RAG latency is unpredictable due to retrieval variability:

Code
Latency distribution (p50/p95/p99):
- Embedding: 30ms / 80ms / 200ms
- Vector search: 50ms / 150ms / 500ms
- Network: 20ms / 100ms / 300ms
- Total retrieval: 100ms / 330ms / 1000ms

CAG has consistent latency (no retrieval variance)

Comparison: Failure Modes

Failure ModeRAGCAG
Missing informationRetrieval missContext too long
Stale informationIndex lagCache staleness
Conflicting sourcesChunk disagreementIn-context conflicts
HallucinationConfabulation from bad retrievalLess common (full context)
Latency spikesNetwork/DB issuesRare (local inference)

Performance Comparison

From the research paper and real-world benchmarks:

Accuracy (HotPotQA)

MethodExact MatchF1 Score
RAG (Top-5)45.2%58.1%
CAG (Full Context)51.3%64.7%
CAG (Optimized)52.1%65.2%

Why CAG wins on accuracy: The model sees ALL relevant context, not just retrieved chunks. This helps with multi-hop reasoning where the answer requires connecting information from multiple sources.

Latency

MethodLatency (HotPotQA Large)
RAG94.35 seconds
CAG2.33 seconds

Cost Analysis

For 10,000 queries/day on a 100K token knowledge base:

ApproachDaily CostNotes
RAG~$150Retrieval + 5K context per query
CAG~$200Full 100K context per query
CAG (cached)~$50KV cache reduces compute

CAG with KV caching can be significantly cheaper because you only compute the knowledge encoding once.

Implementation Best Practices

For RAG

  1. Chunk strategically: Use semantic chunking, not fixed-size
  2. Hybrid search: Combine vector + keyword (BM25)
  3. Reranking: Use a cross-encoder to rerank top results
  4. Query expansion: Rewrite queries for better retrieval

For CAG

  1. Optimize context order: Put most important info first (primacy effect)
  2. Use structured formatting: Headers, sections, clear delineation
  3. Compress where possible: Summarize verbose documents
  4. Version your caches: Track which documents are in each cache

For Hybrid

  1. Monitor cache hits: Track when CAG succeeds vs needs RAG
  2. A/B test: Compare latency and accuracy in production
  3. Warm caches: Pre-compute during off-peak hours
  4. Graceful degradation: If cache fails, fall back to RAG

Production Considerations

Deploying RAG or CAG at scale requires careful attention to operational concerns.

Cache Management for CAG

Versioning Strategy

Python
import hashlib
from datetime import datetime
from dataclasses import dataclass

@dataclass
class CacheMetadata:
    version: str
    created_at: datetime
    document_hashes: dict[str, str]
    model_name: str
    token_count: int

class VersionedCAGCache:
    def __init__(self, cache_dir: str):
        self.cache_dir = cache_dir

    def create_cache(self, documents: dict[str, str], model) -> str:
        """Create versioned cache with metadata."""
        # Hash documents for change detection
        doc_hashes = {
            name: hashlib.sha256(content.encode()).hexdigest()[:16]
            for name, content in documents.items()
        }

        # Version based on content + model
        version_string = f"{sorted(doc_hashes.items())}-{model.name}"
        version = hashlib.sha256(version_string.encode()).hexdigest()[:12]

        # Build cache
        kv_cache = self._build_kv_cache(documents, model)

        # Save with metadata
        metadata = CacheMetadata(
            version=version,
            created_at=datetime.now(),
            document_hashes=doc_hashes,
            model_name=model.name,
            token_count=sum(len(d.split()) for d in documents.values())
        )

        self._save_cache(version, kv_cache, metadata)
        return version

    def needs_rebuild(self, current_docs: dict[str, str]) -> bool:
        """Check if cache needs rebuilding due to document changes."""
        latest = self._load_latest_metadata()
        if not latest:
            return True

        current_hashes = {
            name: hashlib.sha256(content.encode()).hexdigest()[:16]
            for name, content in current_docs.items()
        }

        return current_hashes != latest.document_hashes

Cache Invalidation Patterns

PatternUse CaseImplementation
Time-basedPredictable updatesCron job rebuilds cache daily/weekly
Content-hashChange detectionRebuild when document hashes change
Manual triggerControlled releasesCI/CD pipeline triggers rebuild
HybridProduction systemsHash check + maximum age

Monitoring & Observability

Track these metrics to ensure system health:

Python
from prometheus_client import Counter, Histogram, Gauge

# RAG-specific metrics
rag_retrieval_latency = Histogram(
    'rag_retrieval_latency_seconds',
    'Time spent on retrieval',
    buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.5]
)
rag_retrieval_count = Counter(
    'rag_retrieval_total',
    'Total retrieval operations',
    ['status']  # success, failure, timeout
)
rag_relevance_score = Histogram(
    'rag_top_doc_relevance',
    'Relevance score of top retrieved document',
    buckets=[0.1, 0.3, 0.5, 0.7, 0.9]
)

# CAG-specific metrics
cag_cache_load_latency = Histogram(
    'cag_cache_load_latency_seconds',
    'Time to load KV cache'
)
cag_cache_size_bytes = Gauge(
    'cag_cache_size_bytes',
    'Size of loaded KV cache'
)
cag_cache_hit_rate = Gauge(
    'cag_cache_hit_rate',
    'Percentage of queries served from cache'
)

# Shared metrics
generation_latency = Histogram(
    'llm_generation_latency_seconds',
    'Time spent on LLM generation',
    ['method']  # rag, cag, hybrid
)
response_quality_score = Histogram(
    'response_quality_score',
    'Quality score from evaluation model'
)

A/B Testing Framework

Python
import random
from typing import Literal

class RAGvsCAGExperiment:
    """A/B test RAG vs CAG in production."""

    def __init__(self, rag_system, cag_system, cag_traffic_pct: float = 0.1):
        self.rag = rag_system
        self.cag = cag_system
        self.cag_traffic_pct = cag_traffic_pct
        self.metrics = MetricsCollector()

    def query(self, question: str, user_id: str) -> tuple[str, Literal["rag", "cag"]]:
        # Deterministic assignment based on user_id
        use_cag = int(hashlib.md5(user_id.encode()).hexdigest(), 16) % 100 < self.cag_traffic_pct * 100

        method = "cag" if use_cag else "rag"
        start_time = time.time()

        try:
            if use_cag:
                response = self.cag.query(question)
            else:
                response = self.rag.query(question)

            latency = time.time() - start_time
            self.metrics.record(method, latency, success=True)

            return response, method

        except Exception as e:
            self.metrics.record(method, time.time() - start_time, success=False)
            # Fallback to RAG on CAG failure
            if use_cag:
                return self.rag.query(question), "rag"
            raise

    def get_results(self) -> dict:
        """Get A/B test results."""
        return {
            "rag": {
                "p50_latency": self.metrics.percentile("rag", 50),
                "p95_latency": self.metrics.percentile("rag", 95),
                "success_rate": self.metrics.success_rate("rag"),
                "sample_size": self.metrics.count("rag")
            },
            "cag": {
                "p50_latency": self.metrics.percentile("cag", 50),
                "p95_latency": self.metrics.percentile("cag", 95),
                "success_rate": self.metrics.success_rate("cag"),
                "sample_size": self.metrics.count("cag")
            }
        }

Multi-Tenant Isolation

For SaaS applications serving multiple customers:

Python
class MultiTenantCAG:
    """Isolated CAG caches per tenant."""

    def __init__(self, model, cache_dir: str):
        self.model = model
        self.cache_dir = cache_dir
        self.loaded_tenant: str | None = None
        self.current_cache = None

    def query(self, tenant_id: str, question: str) -> str:
        # Load tenant-specific cache if needed
        if self.loaded_tenant != tenant_id:
            cache_path = f"{self.cache_dir}/{tenant_id}/knowledge.cache"

            if not os.path.exists(cache_path):
                raise TenantCacheNotFound(tenant_id)

            self.current_cache = self._load_cache(cache_path)
            self.loaded_tenant = tenant_id

        return self._generate(question, self.current_cache)

    def build_tenant_cache(self, tenant_id: str, documents: list[str]):
        """Build isolated cache for a tenant."""
        cache_path = f"{self.cache_dir}/{tenant_id}/knowledge.cache"
        os.makedirs(os.path.dirname(cache_path), exist_ok=True)

        # Tenant data never mixes with other tenants
        kv_cache = self._build_cache(documents)
        self._save_cache(cache_path, kv_cache)

Graceful Degradation

Python
class ResilientKnowledgeSystem:
    """Production system with fallbacks."""

    def __init__(self, cag: CAGSystem, rag: RAGSystem):
        self.cag = cag
        self.rag = rag
        self.circuit_breaker = CircuitBreaker(
            failure_threshold=5,
            recovery_timeout=60
        )

    async def query(self, question: str) -> str:
        # Try CAG first (fast path)
        if self.circuit_breaker.is_closed("cag"):
            try:
                return await asyncio.wait_for(
                    self.cag.query(question),
                    timeout=5.0
                )
            except (asyncio.TimeoutError, CacheLoadError) as e:
                self.circuit_breaker.record_failure("cag")
                logger.warning(f"CAG failed, falling back to RAG: {e}")

        # Fallback to RAG
        try:
            return await self.rag.query(question)
        except Exception as e:
            logger.error(f"Both CAG and RAG failed: {e}")
            return "I'm having trouble accessing my knowledge base. Please try again."

The Future: Longer Context Windows

CAG becomes more viable as context windows grow:

ModelContext WindowCAG Viability
GPT-4 (2023)128KModerate
Claude 3.5 Sonnet (2024)200KGood
Gemini 1.5 Pro (2024)2MExcellent
GPT-5.2 (Dec 2025)400KVery Good
Claude Sonnet 4.5 (2025)200K-1MExcellent
Claude Opus 4.5 (2025)200KGood (Infinite Chat)
Gemini 3 Pro (Nov 2025)1MExcellent
Llama 4 Scout (2025)10MMost use cases

With 10M token context windows, CAG can handle knowledge bases that would have required RAG infrastructure just two years ago.

Evaluation & Benchmarking

Before choosing RAG or CAG, benchmark both on your specific use case.

Evaluation Metrics

MetricDescriptionHow to Measure
AccuracyCorrectness of answersHuman evaluation or LLM-as-judge
FaithfulnessGrounded in provided contextCheck if claims appear in sources
RelevanceAnswer addresses the questionSemantic similarity scoring
CompletenessAll aspects answeredChecklist against expected points
LatencyEnd-to-end response timep50, p95, p99 percentiles
CostTotal cost per queryAPI costs + infrastructure

Benchmarking Framework

Python
import json
from dataclasses import dataclass
from typing import Callable

@dataclass
class BenchmarkResult:
    method: str
    accuracy: float
    faithfulness: float
    avg_latency_ms: float
    p95_latency_ms: float
    cost_per_query: float
    total_queries: int

class RAGvsCAGBenchmark:
    """Comprehensive benchmark for comparing RAG and CAG."""

    def __init__(self, rag_system, cag_system, eval_model):
        self.rag = rag_system
        self.cag = cag_system
        self.eval_model = eval_model  # For LLM-as-judge

    def run_benchmark(self, test_cases: list[dict]) -> dict[str, BenchmarkResult]:
        """
        test_cases format:
        [
            {
                "question": "What is the return policy?",
                "expected_answer": "30-day money-back guarantee",
                "source_docs": ["policy.md"],  # For faithfulness check
                "difficulty": "easy"  # easy, medium, hard
            }
        ]
        """
        results = {"rag": [], "cag": []}

        for test in test_cases:
            # Run both systems
            rag_result = self._evaluate_single(self.rag, test, "rag")
            cag_result = self._evaluate_single(self.cag, test, "cag")

            results["rag"].append(rag_result)
            results["cag"].append(cag_result)

        return {
            "rag": self._aggregate_results(results["rag"], "rag"),
            "cag": self._aggregate_results(results["cag"], "cag")
        }

    def _evaluate_single(self, system, test: dict, method: str) -> dict:
        start = time.time()
        response = system.query(test["question"])
        latency = (time.time() - start) * 1000

        # LLM-as-judge evaluation
        eval_prompt = f"""Evaluate this response:

Question: {test["question"]}
Expected: {test["expected_answer"]}
Response: {response}

Rate on a scale of 1-5:
1. Accuracy (correctness):
2. Completeness (covers all aspects):
3. Faithfulness (no hallucinations):

Return JSON: {{"accuracy": X, "completeness": X, "faithfulness": X}}"""

        scores = json.loads(self.eval_model.invoke(eval_prompt))

        return {
            "latency_ms": latency,
            "accuracy": scores["accuracy"] / 5,
            "faithfulness": scores["faithfulness"] / 5,
            "response": response
        }

    def _aggregate_results(self, results: list, method: str) -> BenchmarkResult:
        latencies = [r["latency_ms"] for r in results]
        return BenchmarkResult(
            method=method,
            accuracy=sum(r["accuracy"] for r in results) / len(results),
            faithfulness=sum(r["faithfulness"] for r in results) / len(results),
            avg_latency_ms=sum(latencies) / len(latencies),
            p95_latency_ms=sorted(latencies)[int(len(latencies) * 0.95)],
            cost_per_query=self._calculate_cost(method),
            total_queries=len(results)
        )

Needle-in-Haystack Testing

Test how well each system retrieves specific information:

Python
def needle_in_haystack_test(system, knowledge_base: str, needles: list[dict]):
    """
    Test retrieval of specific facts buried in large context.

    needles format:
    [
        {
            "fact": "The secret code is ALPHA-7742",
            "position": 0.5,  # Middle of context
            "question": "What is the secret code?"
        }
    ]
    """
    results = []

    for needle in needles:
        # Insert needle at specified position
        insert_pos = int(len(knowledge_base) * needle["position"])
        test_context = (
            knowledge_base[:insert_pos] +
            f"\n{needle['fact']}\n" +
            knowledge_base[insert_pos:]
        )

        # Query the system
        response = system.query(needle["question"], context=test_context)

        # Check if needle was found
        found = needle["fact"].lower() in response.lower()

        results.append({
            "position": needle["position"],
            "found": found,
            "response": response
        })

    # Aggregate by position
    position_accuracy = {}
    for r in results:
        pos_bucket = round(r["position"], 1)
        if pos_bucket not in position_accuracy:
            position_accuracy[pos_bucket] = []
        position_accuracy[pos_bucket].append(r["found"])

    return {
        pos: sum(found) / len(found)
        for pos, found in position_accuracy.items()
    }

Decision Checklist

Use this checklist to evaluate which approach fits your use case:

Markdown
## RAG vs CAG Decision Checklist

### Knowledge Base Characteristics
- [ ] Size < 500K tokens? → Favor CAG
- [ ] Size > 1M tokens? → Favor RAG
- [ ] Updates daily or more? → Favor RAG
- [ ] Static or monthly updates? → Favor CAG

### Query Patterns
- [ ] Multi-hop reasoning required? → Favor CAG
- [ ] Queries span many topics? → Favor RAG
- [ ] Queries focused on specific domain? → Favor CAG

### Infrastructure
- [ ] Already have vector DB? → RAG easier
- [ ] Have high-memory GPUs? → CAG viable
- [ ] Serverless deployment? → RAG more flexible

### Requirements
- [ ] Latency < 500ms required? → Favor CAG
- [ ] Real-time data needed? → Favor RAG
- [ ] Multi-tenant isolation needed? → Both work, different tradeoffs

### Score
- More CAG checkmarks: Start with CAG
- More RAG checkmarks: Start with RAG
- Mixed: Consider hybrid approach

Conclusion

RAG and CAG are complementary, not competing approaches.

  • RAG excels for large, dynamic knowledge bases where you can't fit everything in context
  • CAG excels for bounded, static knowledge where latency and simplicity matter
  • Hybrid approaches combine the best of both worlds

The key insight: don't default to RAG just because it's popular. Evaluate whether your knowledge base fits in modern context windows. If it does, CAG offers significant latency and accuracy improvements with simpler infrastructure.

As context windows continue growing, expect CAG to become the default for an increasing number of use cases.

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles