Is CAG just "stuff everything in the prompt"?

Not quite. The key innovation is **KV cache persistence**. Instead of recomputing the entire context for every query, you compute it once and reuse the cached attention states. This makes CAG economically viable for high-query-volume applications.

What if my knowledge base is too large for CAG?

Use RAG, or consider a hybrid approach where you preload the most frequently accessed knowledge (CAG) and retrieve edge cases on demand (RAG). You can also use tiered caching with multiple domain-specific caches.

Does CAG work with all LLMs?

CAG requires models that support KV cache export/import. Most modern open-source models (Llama, Mistral, Qwen) support this. API-based models like GPT-4 and Claude don't expose KV caches directly, but you can simulate CAG by including full context in every request (less efficient but still eliminates retrieval latency).

How do I update knowledge in CAG?

Rebuild the cache. For frequently updated knowledge, consider a hybrid approach where static content uses CAG and dynamic content uses RAG. You can also implement incremental cache updates for append-only knowledge bases.

What's the maximum practical size for CAG?

Depends on your model and hardware. With Llama 4 Scout (10M context) on high-memory GPUs, you can cache millions of tokens. For most practical applications, 100K-500K tokens covers extensive documentation, FAQs, and policy documents.

How do I handle multi-turn conversations with CAG?

Multi-turn conversations require appending conversation history to the cached context. Two approaches: **Approach 1: Extend the cache per turn** **Approach 2: Include history in query (simpler)** The tradeoff: Approach 1 is more efficient for long conversations but requires managing per-session caches. Approach 2 is simpler but re-processes history each turn.

Can I use CAG with API-based models like GPT-4 or Claude?

Yes, but without true KV cache benefits. API models don't expose their internal caches, so you must include the full knowledge base in every request. This works well for: **Smaller knowledge bases** (<50K tokens), **Claude with prompt caching** - Anthropic caches repeated prefixes, giving CAG-like benefits, **Gemini context caching** - Google offers explicit context caching for repeated contexts.

What about security and data isolation?

Both RAG and CAG require careful attention to data isolation: **RAG security concerns:** **CAG security concerns:** For multi-tenant systems, ensure each tenant's data is stored and processed separately. Never mix knowledge bases in shared caches, and implement proper access controls at both the cache file and inference levels. Vector DB access controls, Embedding model data exposure, Query injection attacks, Cache file access permissions, Multi-tenant cache isolation, Memory protection for loaded caches.

RAG vs CAG: When Cache-Augmented Generation Beats Retrieval | Enrico Piovano

Q: What about security and data isolation?

Both RAG and CAG require careful attention to data isolation: **RAG security concerns:** **CAG security concerns:** For multi-tenant systems, ensure each tenant's data is stored and processed separately. Never mix knowledge bases in shared caches, and implement proper access controls at both the cache file and inference levels. Vector DB access controls, Embedding model data exposure, Query injection attacks, Cache file access permissions, Multi-tenant cache isolation, Memory protection for loaded caches.

The Rise of CAG

In December 2024, a paper titled "Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks" (Chan et al., KAUST) challenged the dominance of RAG as the default approach for augmenting LLMs with external knowledge. The paper will be presented at ACM Web Conference 2025.

The core insight: with modern LLMs supporting 100K-1M+ token context windows, you can often skip retrieval entirely by preloading all relevant knowledge into the context and caching it.

This guide covers when to use RAG vs CAG, how each works under the hood, and how to build hybrid systems that combine both approaches.

Quick Comparison

Aspect	RAG	CAG
Retrieval	Real-time search at inference	None (preloaded)
Latency	Higher (retrieval + generation)	Lower (generation only)
Knowledge size	Unlimited	Limited by context window
Freshness	Real-time updates possible	Requires cache rebuild
Complexity	Higher (vector DB, embeddings, retrieval)	Lower (just KV cache)
Best for	Large, dynamic knowledge bases	Static, bounded knowledge

How RAG Works

RAG (Retrieval-Augmented Generation) fetches relevant documents at query time:

Code

User Query
    ↓
[1. Embed Query]
    ↓
[2. Search Vector DB] → Top-K Documents
    ↓
[3. Construct Prompt]
    Query + Retrieved Docs
    ↓
[4. LLM Generation]
    ↓
Response

RAG Implementation

This implementation shows a production-ready RAG pipeline with MMR (Maximum Marginal Relevance) retrieval. MMR balances relevance and diversity—it finds relevant documents but penalizes redundancy, ensuring you get varied information rather than five similar documents saying the same thing.

The key configuration choice: k=5 returns 5 final documents, but fetch_k=20 retrieves 20 candidates first, then applies MMR to select the most diverse 5. This gives better coverage than simple top-k retrieval.

Python

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain.chains import RetrievalQA

class RAGSystem:
    def __init__(self, index_name: str):
        self.llm = ChatOpenAI(model="gpt-5.2")
        self.embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
        self.vectorstore = PineconeVectorStore.from_existing_index(
            index_name=index_name,
            embedding=self.embeddings
        )
        self.retriever = self.vectorstore.as_retriever(
            search_type="mmr",
            search_kwargs={"k": 5, "fetch_k": 20}
        )

    def query(self, question: str) -> str:
        """Query with real-time retrieval."""
        # 1. Retrieve relevant documents
        docs = self.retriever.invoke(question)

        # 2. Build context
        context = "\n\n".join([doc.page_content for doc in docs])

        # 3. Generate response
        prompt = f"""Answer the question based on the following context:

Context:
{context}

Question: {question}

Answer:"""

        response = self.llm.invoke(prompt)
        return response.content

RAG Latency Breakdown

Code

Total latency = Embedding (50ms) + Search (100ms) + Generation (500ms)
             = ~650ms per query

The retrieval step adds significant latency and potential failure modes:

Embedding service latency
Vector database query time
Network round-trips
Retrieval errors (wrong documents returned)

How CAG Works

CAG (Cache-Augmented Generation) preloads all knowledge and caches the model's internal state:

Code

[Offline: Knowledge Preloading]
    All Documents → LLM Context → KV Cache (saved to disk)

[Online: Query Time]
    User Query + Cached KV → LLM Generation → Response

The KV Cache Explained

When an LLM processes text, it computes Key-Value (KV) pairs for each token in the attention mechanism. These represent the model's "understanding" of the context.

Python

# Simplified attention mechanism
# For each layer, the model computes:
K = input @ W_k  # Keys
V = input @ W_v  # Values

# These are cached and reused for subsequent tokens
# Instead of recomputing for the entire context each time

CAG insight: If your knowledge base is static, compute the KV cache once and reuse it for every query.

CAG Implementation

CAG's power comes from precomputing the expensive part of LLM inference. When an LLM processes your 100K-token knowledge base, it computes Key-Value matrices at each attention layer. These matrices encode the model's "understanding" of the context.

The insight: these KV values only depend on the input, not the query. So compute them once, save to disk, and reload for every query. The query only needs to compute its own (tiny) KV values, then attention can reference the cached knowledge KVs.

The tradeoff: caches are large (tens of GB for big models) and model-specific. If you fine-tune the model, the cache is invalidated.

Python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import pickle

class CAGSystem:
    def __init__(self, model_name: str = "meta-llama/Llama-3.1-70B-Instruct"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.bfloat16,
            device_map="auto"
        )
        self.kv_cache = None

    def preload_knowledge(self, documents: list[str], cache_path: str):
        """Preload all documents and cache KV states."""
        # Combine all documents into context
        knowledge_text = "\n\n---\n\n".join(documents)

        # Create the preloaded prompt
        preload_prompt = f"""You are a helpful assistant with access to the following knowledge base:

{knowledge_text}

Use this knowledge to answer questions accurately. If the answer is not in the knowledge base, say so.

"""
        # Tokenize
        inputs = self.tokenizer(
            preload_prompt,
            return_tensors="pt",
            truncation=True,
            max_length=self.model.config.max_position_embeddings - 1000  # Leave room for query
        ).to(self.model.device)

        # Forward pass to generate KV cache
        with torch.no_grad():
            outputs = self.model(
                **inputs,
                use_cache=True,
                return_dict=True
            )
            self.kv_cache = outputs.past_key_values

        # Save cache to disk
        with open(cache_path, 'wb') as f:
            pickle.dump(self.kv_cache, f)

        print(f"Cached {len(documents)} documents ({inputs['input_ids'].shape[1]} tokens)")

    def load_cache(self, cache_path: str):
        """Load precomputed KV cache from disk."""
        with open(cache_path, 'rb') as f:
            self.kv_cache = pickle.load(f)

    def query(self, question: str) -> str:
        """Query using cached knowledge (no retrieval)."""
        # Format the query
        query_prompt = f"\n\nQuestion: {question}\n\nAnswer:"

        inputs = self.tokenizer(
            query_prompt,
            return_tensors="pt"
        ).to(self.model.device)

        # Generate using cached KV
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                past_key_values=self.kv_cache,
                max_new_tokens=500,
                do_sample=True,
                temperature=0.7
            )

        response = self.tokenizer.decode(
            outputs[0][inputs['input_ids'].shape[1]:],
            skip_special_tokens=True
        )
        return response

CAG Latency Breakdown

Code

Total latency = Generation only (500ms)
             = ~500ms per query

Speedup vs RAG: ~23% faster (no retrieval overhead)

On HotPotQA benchmark: CAG reduced generation time from 94.35s (RAG) to 2.33s — a 40x improvement.

Memory & Compute Requirements

Understanding the resource implications is critical for choosing between RAG and CAG.

The fundamental resource tradeoff: RAG trades runtime compute (retrieval + embedding per query) for reduced memory (only retrieved docs in context). CAG trades large upfront memory (full knowledge in KV cache) for minimal per-query compute (just decode the answer). Your infrastructure constraints often make this decision for you.

Why this matters in practice: If you have a single A100 (80GB), CAG with a 70B model at 100K tokens is impossible—the KV cache alone exceeds available memory. But if you have multiple GPUs with NVLink, CAG becomes viable. Conversely, RAG works on modest hardware but requires additional infrastructure (vector database, embedding service).

KV Cache Memory Formula

The KV cache size scales with model architecture and context length:

Code

KV Cache Size = 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_param

Example for Llama 3.1 70B (80 layers, 8 KV heads, 128 head_dim, bfloat16):
= 2 × 80 × 8 × 128 × seq_len × 2 bytes
= 327,680 × seq_len bytes
= ~312 KB per 1K tokens
= ~31.2 GB for 100K tokens

Memory Requirements by Model

Model	Context	KV Cache (100K tokens)	Total VRAM Needed
Llama 3.1 8B	128K	~4 GB	~20 GB
Llama 3.1 70B	128K	~31 GB	~170 GB
Llama 4 Scout	10M	~250 GB (1M tokens)	500+ GB
Qwen 2.5 72B	128K	~35 GB	~180 GB

Prefill vs Decode Costs

CAG shifts compute from per-query to one-time prefill:

Code

RAG (per query):
  - Embed query: ~50ms
  - Vector search: ~100ms
  - Prefill retrieved docs (~5K tokens): ~200ms
  - Decode response: ~300ms
  Total: ~650ms

CAG (per query):
  - Load cached KV: ~50ms (from SSD) or ~5ms (from RAM)
  - Decode response: ~300ms
  Total: ~350ms

CAG prefill (one-time):
  - Process 100K tokens: ~30-60 seconds
  - Amortized over 10K queries: ~3-6ms per query

Batch Inference Considerations

CAG with shared caches enables efficient batching:

Python

class BatchCAGSystem:
    """Efficient batch inference with shared KV cache."""

    def __init__(self, model, shared_cache):
        self.model = model
        self.shared_cache = shared_cache  # Same knowledge for all queries

    def batch_query(self, questions: list[str]) -> list[str]:
        # All queries share the same preloaded context
        # Only the query tokens differ per batch item
        batch_inputs = self.tokenizer(
            [f"\n\nQuestion: {q}\n\nAnswer:" for q in questions],
            padding=True,
            return_tensors="pt"
        )

        # Replicate KV cache for batch dimension
        batch_cache = self._expand_cache_for_batch(
            self.shared_cache,
            batch_size=len(questions)
        )

        outputs = self.model.generate(
            **batch_inputs,
            past_key_values=batch_cache,
            max_new_tokens=200
        )

        return self.tokenizer.batch_decode(outputs, skip_special_tokens=True)

Cost Comparison: Infrastructure

Component	RAG	CAG
Vector DB	$200-2000/mo (managed)	Not needed
Embedding API	$0.0001/1K tokens	Not needed
GPU Memory	24-48 GB sufficient	80-180 GB+ needed
Storage	Vector index (~10GB)	KV cache files (~30GB)
Complexity	High (multiple services)	Low (single model)

When to Use Each Approach

Use RAG When:

Scenario	Why RAG
Large knowledge base	Exceeds context window (>1M tokens)
Frequently updated data	News, inventory, real-time analytics
Diverse query patterns	Different queries need different docs
Multi-tenant systems	Different users need different knowledge
Cost constraints	Smaller context = fewer tokens = lower cost

Example RAG use cases:

Enterprise search across millions of documents
Customer support with ticket history
Real-time news Q&A
E-commerce product recommendations

Use CAG When:

Scenario	Why CAG
Bounded knowledge	Fits in context window
Static content	Technical manuals, policies, FAQs
Low latency required	Real-time applications
Simplicity valued	No vector DB infrastructure
High query volume	Amortize preloading cost

Example CAG use cases:

Product documentation chatbot
Company policy Q&A
Legal contract analysis
Medical reference lookup (fixed guidelines)

Decision Framework

Code

Is your knowledge base > 500K tokens?
    ├── YES → Use RAG
    └── NO → Continue...

Does your knowledge change frequently?
    ├── YES (daily/hourly) → Use RAG
    └── NO (weekly/monthly) → Continue...

Is latency critical (<200ms)?
    ├── YES → Use CAG
    └── NO → Either works, consider complexity

Do you already have vector DB infrastructure?
    ├── YES → RAG may be easier
    └── NO → CAG avoids new infrastructure

Hybrid Approaches

The best systems often combine both approaches. The key insight: CAG and RAG have complementary strengths. CAG excels at fast, consistent answers for common questions. RAG excels at handling the long tail of rare queries and fresh information.

A well-designed hybrid uses CAG for the 80% of queries that hit core knowledge (fast path), and falls back to RAG for the 20% that need retrieval (comprehensive path). This gives you both speed and coverage.

Pattern 1: CAG Core + RAG Edge Cases

This pattern preloads your most-asked knowledge into CAG, using RAG only when CAG signals uncertainty. The needs_retrieval function is the router—it detects "I don't know" responses or low-confidence answers and triggers RAG fallback.

Python

class HybridSystem:
    """Preload common knowledge, retrieve for edge cases."""

    def __init__(self):
        self.cag = CAGSystem()
        self.rag = RAGSystem(index_name="edge-cases")
        self.cag.load_cache("core_knowledge.cache")

    def query(self, question: str) -> str:
        # Try CAG first (fast path)
        cag_response = self.cag.query(question)

        # Check confidence / detect "I don't know"
        if self.needs_retrieval(cag_response):
            # Fall back to RAG for edge cases
            return self.rag.query(question)

        return cag_response

    def needs_retrieval(self, response: str) -> bool:
        """Detect when CAG doesn't have the answer."""
        uncertainty_phrases = [
            "I don't have information",
            "not in my knowledge",
            "I'm not sure",
            "cannot find"
        ]
        return any(phrase in response.lower() for phrase in uncertainty_phrases)

Pattern 2: Tiered Caching

Python

class TieredCAGSystem:
    """Multiple cache tiers for different knowledge domains."""

    def __init__(self):
        self.caches = {
            "products": "product_docs.cache",
            "policies": "company_policies.cache",
            "technical": "tech_manual.cache"
        }
        self.active_cache = None
        self.classifier = self.load_query_classifier()

    def query(self, question: str) -> str:
        # Classify query to select appropriate cache
        domain = self.classifier.predict(question)

        # Load domain-specific cache
        if self.active_cache != domain:
            self.load_cache(self.caches[domain])
            self.active_cache = domain

        return self.generate(question)

Pattern 3: CAG + Real-Time RAG Augmentation

Python

class AugmentedCAGSystem:
    """CAG for static knowledge, RAG for real-time data."""

    def __init__(self):
        self.cag = CAGSystem()  # Product catalog, policies
        self.rag = RAGSystem()  # Inventory, prices, promotions

    def query(self, question: str) -> str:
        # Get static context from CAG
        static_context = self.cag.get_cached_context()

        # Get dynamic context from RAG
        dynamic_docs = self.rag.retrieve(question)
        dynamic_context = "\n".join([d.page_content for d in dynamic_docs])

        # Combine for generation
        prompt = f"""Static Knowledge:
{static_context}

Current Information:
{dynamic_context}

Question: {question}
Answer:"""

        return self.llm.invoke(prompt)

Limitations & Challenges

Both approaches have significant limitations that impact real-world deployments.

CAG Limitations

1. Lost in the Middle Problem

LLMs struggle to recall information from the middle of long contexts. Research from Stanford/Berkeley shows retrieval accuracy drops significantly for information positioned in the middle 50% of context:

Code

Position in Context → Recall Accuracy
Beginning (0-10%):    ~95%
Middle (40-60%):      ~60-70%
End (90-100%):        ~90%

Mitigation strategies:

Place critical information at the beginning and end
Use structured formatting with clear section headers
Implement attention sinks (repetition of key facts)
Consider document ordering by importance

2. Attention Dilution

As context grows, attention becomes diluted across more tokens:

Python

# Simplified attention visualization
def attention_per_token(context_length: int, query_tokens: int = 100):
    """Average attention each context token receives."""
    # Softmax distributes attention across all tokens
    # More tokens = less attention per token
    return query_tokens / context_length

# 10K context: 1% attention per token
# 100K context: 0.1% attention per token
# 1M context: 0.01% attention per token

This manifests as:

Subtle details getting ignored
Conflicting information not being reconciled
Reduced reasoning over distant context

3. Cache Staleness

CAG caches become stale when knowledge changes:

Update Frequency	CAG Viability	Recommendation
Real-time	Poor	Use RAG
Hourly	Poor	Use RAG or hybrid
Daily	Moderate	Scheduled cache rebuilds
Weekly	Good	Batch updates
Static	Excellent	Pure CAG

RAG Limitations

1. Retrieval Failures

RAG can fail silently when retrieval misses relevant documents:

Failure Mode	Cause	Impact
Semantic gap	Query/doc embedding mismatch	Wrong docs retrieved
Chunking artifacts	Answer split across chunks	Partial information
Sparse coverage	Rare topics under-represented	No relevant docs
Adversarial queries	Intentionally confusing queries	Hallucinated answers

Python

# Example: Retrieval failure detection
def detect_retrieval_failure(query: str, docs: list, response: str) -> bool:
    """Heuristics for detecting poor retrieval."""
    signals = []

    # Low relevance scores
    if all(doc.score < 0.5 for doc in docs):
        signals.append("low_relevance")

    # Response contradicts retrieved docs
    if contains_contradiction(response, docs):
        signals.append("contradiction")

    # Response uses knowledge not in docs (hallucination)
    if contains_external_knowledge(response, docs):
        signals.append("potential_hallucination")

    return len(signals) > 0

2. Multi-Hop Reasoning Failures

RAG struggles with questions requiring information synthesis:

Code

Question: "What is the revenue of the company that acquired the
           startup founded by the person who invented X?"

Required reasoning:
1. Who invented X? → Person A
2. What startup did Person A found? → Startup B
3. Who acquired Startup B? → Company C
4. What is Company C's revenue? → $Y

RAG typically retrieves docs for step 1, missing steps 2-4.

3. Latency Variance

RAG latency is unpredictable due to retrieval variability:

Code

Latency distribution (p50/p95/p99):
- Embedding: 30ms / 80ms / 200ms
- Vector search: 50ms / 150ms / 500ms
- Network: 20ms / 100ms / 300ms
- Total retrieval: 100ms / 330ms / 1000ms

CAG has consistent latency (no retrieval variance)

Comparison: Failure Modes

Failure Mode	RAG	CAG
Missing information	Retrieval miss	Context too long
Stale information	Index lag	Cache staleness
Conflicting sources	Chunk disagreement	In-context conflicts
Hallucination	Confabulation from bad retrieval	Less common (full context)
Latency spikes	Network/DB issues	Rare (local inference)

Performance Comparison

From the research paper and real-world benchmarks:

Accuracy (HotPotQA)

Method	Exact Match	F1 Score
RAG (Top-5)	45.2%	58.1%
CAG (Full Context)	51.3%	64.7%
CAG (Optimized)	52.1%	65.2%

Why CAG wins on accuracy: The model sees ALL relevant context, not just retrieved chunks. This helps with multi-hop reasoning where the answer requires connecting information from multiple sources.

Latency

Method	Latency (HotPotQA Large)
RAG	94.35 seconds
CAG	2.33 seconds

Cost Analysis

For 10,000 queries/day on a 100K token knowledge base:

Approach	Daily Cost	Notes
RAG	~$150	Retrieval + 5K context per query
CAG	~$200	Full 100K context per query
CAG (cached)	~$50	KV cache reduces compute

CAG with KV caching can be significantly cheaper because you only compute the knowledge encoding once.

Implementation Best Practices

For RAG

Chunk strategically: Use semantic chunking, not fixed-size
Hybrid search: Combine vector + keyword (BM25)
Reranking: Use a cross-encoder to rerank top results
Query expansion: Rewrite queries for better retrieval

For CAG

Optimize context order: Put most important info first (primacy effect)
Use structured formatting: Headers, sections, clear delineation
Compress where possible: Summarize verbose documents
Version your caches: Track which documents are in each cache

For Hybrid

Monitor cache hits: Track when CAG succeeds vs needs RAG
A/B test: Compare latency and accuracy in production
Warm caches: Pre-compute during off-peak hours
Graceful degradation: If cache fails, fall back to RAG

Production Considerations

Deploying RAG or CAG at scale requires careful attention to operational concerns.

Cache Management for CAG

Versioning Strategy

Python

import hashlib
from datetime import datetime
from dataclasses import dataclass

@dataclass
class CacheMetadata:
    version: str
    created_at: datetime
    document_hashes: dict[str, str]
    model_name: str
    token_count: int

class VersionedCAGCache:
    def __init__(self, cache_dir: str):
        self.cache_dir = cache_dir

    def create_cache(self, documents: dict[str, str], model) -> str:
        """Create versioned cache with metadata."""
        # Hash documents for change detection
        doc_hashes = {
            name: hashlib.sha256(content.encode()).hexdigest()[:16]
            for name, content in documents.items()
        }

        # Version based on content + model
        version_string = f"{sorted(doc_hashes.items())}-{model.name}"
        version = hashlib.sha256(version_string.encode()).hexdigest()[:12]

        # Build cache
        kv_cache = self._build_kv_cache(documents, model)

        # Save with metadata
        metadata = CacheMetadata(
            version=version,
            created_at=datetime.now(),
            document_hashes=doc_hashes,
            model_name=model.name,
            token_count=sum(len(d.split()) for d in documents.values())
        )

        self._save_cache(version, kv_cache, metadata)
        return version

    def needs_rebuild(self, current_docs: dict[str, str]) -> bool:
        """Check if cache needs rebuilding due to document changes."""
        latest = self._load_latest_metadata()
        if not latest:
            return True

        current_hashes = {
            name: hashlib.sha256(content.encode()).hexdigest()[:16]
            for name, content in current_docs.items()
        }

        return current_hashes != latest.document_hashes

Cache Invalidation Patterns

Pattern	Use Case	Implementation
Time-based	Predictable updates	Cron job rebuilds cache daily/weekly
Content-hash	Change detection	Rebuild when document hashes change
Manual trigger	Controlled releases	CI/CD pipeline triggers rebuild
Hybrid	Production systems	Hash check + maximum age

Monitoring & Observability

Track these metrics to ensure system health:

Python

from prometheus_client import Counter, Histogram, Gauge

# RAG-specific metrics
rag_retrieval_latency = Histogram(
    'rag_retrieval_latency_seconds',
    'Time spent on retrieval',
    buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.5]
)
rag_retrieval_count = Counter(
    'rag_retrieval_total',
    'Total retrieval operations',
    ['status']  # success, failure, timeout
)
rag_relevance_score = Histogram(
    'rag_top_doc_relevance',
    'Relevance score of top retrieved document',
    buckets=[0.1, 0.3, 0.5, 0.7, 0.9]
)

# CAG-specific metrics
cag_cache_load_latency = Histogram(
    'cag_cache_load_latency_seconds',
    'Time to load KV cache'
)
cag_cache_size_bytes = Gauge(
    'cag_cache_size_bytes',
    'Size of loaded KV cache'
)
cag_cache_hit_rate = Gauge(
    'cag_cache_hit_rate',
    'Percentage of queries served from cache'
)

# Shared metrics
generation_latency = Histogram(
    'llm_generation_latency_seconds',
    'Time spent on LLM generation',
    ['method']  # rag, cag, hybrid
)
response_quality_score = Histogram(
    'response_quality_score',
    'Quality score from evaluation model'
)

A/B Testing Framework

Python

import random
from typing import Literal

class RAGvsCAGExperiment:
    """A/B test RAG vs CAG in production."""

    def __init__(self, rag_system, cag_system, cag_traffic_pct: float = 0.1):
        self.rag = rag_system
        self.cag = cag_system
        self.cag_traffic_pct = cag_traffic_pct
        self.metrics = MetricsCollector()

    def query(self, question: str, user_id: str) -> tuple[str, Literal["rag", "cag"]]:
        # Deterministic assignment based on user_id
        use_cag = int(hashlib.md5(user_id.encode()).hexdigest(), 16) % 100 < self.cag_traffic_pct * 100

        method = "cag" if use_cag else "rag"
        start_time = time.time()

        try:
            if use_cag:
                response = self.cag.query(question)
            else:
                response = self.rag.query(question)

            latency = time.time() - start_time
            self.metrics.record(method, latency, success=True)

            return response, method

        except Exception as e:
            self.metrics.record(method, time.time() - start_time, success=False)
            # Fallback to RAG on CAG failure
            if use_cag:
                return self.rag.query(question), "rag"
            raise

    def get_results(self) -> dict:
        """Get A/B test results."""
        return {
            "rag": {
                "p50_latency": self.metrics.percentile("rag", 50),
                "p95_latency": self.metrics.percentile("rag", 95),
                "success_rate": self.metrics.success_rate("rag"),
                "sample_size": self.metrics.count("rag")
            },
            "cag": {
                "p50_latency": self.metrics.percentile("cag", 50),
                "p95_latency": self.metrics.percentile("cag", 95),
                "success_rate": self.metrics.success_rate("cag"),
                "sample_size": self.metrics.count("cag")
            }
        }

Multi-Tenant Isolation

For SaaS applications serving multiple customers:

Python

class MultiTenantCAG:
    """Isolated CAG caches per tenant."""

    def __init__(self, model, cache_dir: str):
        self.model = model
        self.cache_dir = cache_dir
        self.loaded_tenant: str | None = None
        self.current_cache = None

    def query(self, tenant_id: str, question: str) -> str:
        # Load tenant-specific cache if needed
        if self.loaded_tenant != tenant_id:
            cache_path = f"{self.cache_dir}/{tenant_id}/knowledge.cache"

            if not os.path.exists(cache_path):
                raise TenantCacheNotFound(tenant_id)

            self.current_cache = self._load_cache(cache_path)
            self.loaded_tenant = tenant_id

        return self._generate(question, self.current_cache)

    def build_tenant_cache(self, tenant_id: str, documents: list[str]):
        """Build isolated cache for a tenant."""
        cache_path = f"{self.cache_dir}/{tenant_id}/knowledge.cache"
        os.makedirs(os.path.dirname(cache_path), exist_ok=True)

        # Tenant data never mixes with other tenants
        kv_cache = self._build_cache(documents)
        self._save_cache(cache_path, kv_cache)

Graceful Degradation

Python

class ResilientKnowledgeSystem:
    """Production system with fallbacks."""

    def __init__(self, cag: CAGSystem, rag: RAGSystem):
        self.cag = cag
        self.rag = rag
        self.circuit_breaker = CircuitBreaker(
            failure_threshold=5,
            recovery_timeout=60
        )

    async def query(self, question: str) -> str:
        # Try CAG first (fast path)
        if self.circuit_breaker.is_closed("cag"):
            try:
                return await asyncio.wait_for(
                    self.cag.query(question),
                    timeout=5.0
                )
            except (asyncio.TimeoutError, CacheLoadError) as e:
                self.circuit_breaker.record_failure("cag")
                logger.warning(f"CAG failed, falling back to RAG: {e}")

        # Fallback to RAG
        try:
            return await self.rag.query(question)
        except Exception as e:
            logger.error(f"Both CAG and RAG failed: {e}")
            return "I'm having trouble accessing my knowledge base. Please try again."

The Future: Longer Context Windows

CAG becomes more viable as context windows grow:

Model	Context Window	CAG Viability
GPT-4 (2023)	128K	Moderate
Claude 3.5 Sonnet (2024)	200K	Good
Gemini 1.5 Pro (2024)	2M	Excellent
GPT-5.2 (Dec 2025)	400K	Very Good
Claude Sonnet 4.5 (2025)	200K-1M	Excellent
Claude Opus 4.5 (2025)	200K	Good (Infinite Chat)
Gemini 3 Pro (Nov 2025)	1M	Excellent
Llama 4 Scout (2025)	10M	Most use cases

With 10M token context windows, CAG can handle knowledge bases that would have required RAG infrastructure just two years ago.

Evaluation & Benchmarking

Before choosing RAG or CAG, benchmark both on your specific use case.

Evaluation Metrics

Metric	Description	How to Measure
Accuracy	Correctness of answers	Human evaluation or LLM-as-judge
Faithfulness	Grounded in provided context	Check if claims appear in sources
Relevance	Answer addresses the question	Semantic similarity scoring
Completeness	All aspects answered	Checklist against expected points
Latency	End-to-end response time	p50, p95, p99 percentiles
Cost	Total cost per query	API costs + infrastructure

Benchmarking Framework

Python

import json
from dataclasses import dataclass
from typing import Callable

@dataclass
class BenchmarkResult:
    method: str
    accuracy: float
    faithfulness: float
    avg_latency_ms: float
    p95_latency_ms: float
    cost_per_query: float
    total_queries: int

class RAGvsCAGBenchmark:
    """Comprehensive benchmark for comparing RAG and CAG."""

    def __init__(self, rag_system, cag_system, eval_model):
        self.rag = rag_system
        self.cag = cag_system
        self.eval_model = eval_model  # For LLM-as-judge

    def run_benchmark(self, test_cases: list[dict]) -> dict[str, BenchmarkResult]:
        """
        test_cases format:
        [
            {
                "question": "What is the return policy?",
                "expected_answer": "30-day money-back guarantee",
                "source_docs": ["policy.md"],  # For faithfulness check
                "difficulty": "easy"  # easy, medium, hard
            }
        ]
        """
        results = {"rag": [], "cag": []}

        for test in test_cases:
            # Run both systems
            rag_result = self._evaluate_single(self.rag, test, "rag")
            cag_result = self._evaluate_single(self.cag, test, "cag")

            results["rag"].append(rag_result)
            results["cag"].append(cag_result)

        return {
            "rag": self._aggregate_results(results["rag"], "rag"),
            "cag": self._aggregate_results(results["cag"], "cag")
        }

    def _evaluate_single(self, system, test: dict, method: str) -> dict:
        start = time.time()
        response = system.query(test["question"])
        latency = (time.time() - start) * 1000

        # LLM-as-judge evaluation
        eval_prompt = f"""Evaluate this response:

Question: {test["question"]}
Expected: {test["expected_answer"]}
Response: {response}

Rate on a scale of 1-5:
1. Accuracy (correctness):
2. Completeness (covers all aspects):
3. Faithfulness (no hallucinations):

Return JSON: {{"accuracy": X, "completeness": X, "faithfulness": X}}"""

        scores = json.loads(self.eval_model.invoke(eval_prompt))

        return {
            "latency_ms": latency,
            "accuracy": scores["accuracy"] / 5,
            "faithfulness": scores["faithfulness"] / 5,
            "response": response
        }

    def _aggregate_results(self, results: list, method: str) -> BenchmarkResult:
        latencies = [r["latency_ms"] for r in results]
        return BenchmarkResult(
            method=method,
            accuracy=sum(r["accuracy"] for r in results) / len(results),
            faithfulness=sum(r["faithfulness"] for r in results) / len(results),
            avg_latency_ms=sum(latencies) / len(latencies),
            p95_latency_ms=sorted(latencies)[int(len(latencies) * 0.95)],
            cost_per_query=self._calculate_cost(method),
            total_queries=len(results)
        )

Needle-in-Haystack Testing

Test how well each system retrieves specific information:

Python

def needle_in_haystack_test(system, knowledge_base: str, needles: list[dict]):
    """
    Test retrieval of specific facts buried in large context.

    needles format:
    [
        {
            "fact": "The secret code is ALPHA-7742",
            "position": 0.5,  # Middle of context
            "question": "What is the secret code?"
        }
    ]
    """
    results = []

    for needle in needles:
        # Insert needle at specified position
        insert_pos = int(len(knowledge_base) * needle["position"])
        test_context = (
            knowledge_base[:insert_pos] +
            f"\n{needle['fact']}\n" +
            knowledge_base[insert_pos:]
        )

        # Query the system
        response = system.query(needle["question"], context=test_context)

        # Check if needle was found
        found = needle["fact"].lower() in response.lower()

        results.append({
            "position": needle["position"],
            "found": found,
            "response": response
        })

    # Aggregate by position
    position_accuracy = {}
    for r in results:
        pos_bucket = round(r["position"], 1)
        if pos_bucket not in position_accuracy:
            position_accuracy[pos_bucket] = []
        position_accuracy[pos_bucket].append(r["found"])

    return {
        pos: sum(found) / len(found)
        for pos, found in position_accuracy.items()
    }

Decision Checklist

Use this checklist to evaluate which approach fits your use case:

Markdown

## RAG vs CAG Decision Checklist

### Knowledge Base Characteristics
- [ ] Size < 500K tokens? → Favor CAG
- [ ] Size > 1M tokens? → Favor RAG
- [ ] Updates daily or more? → Favor RAG
- [ ] Static or monthly updates? → Favor CAG

### Query Patterns
- [ ] Multi-hop reasoning required? → Favor CAG
- [ ] Queries span many topics? → Favor RAG
- [ ] Queries focused on specific domain? → Favor CAG

### Infrastructure
- [ ] Already have vector DB? → RAG easier
- [ ] Have high-memory GPUs? → CAG viable
- [ ] Serverless deployment? → RAG more flexible

### Requirements
- [ ] Latency < 500ms required? → Favor CAG
- [ ] Real-time data needed? → Favor RAG
- [ ] Multi-tenant isolation needed? → Both work, different tradeoffs

### Score
- More CAG checkmarks: Start with CAG
- More RAG checkmarks: Start with RAG
- Mixed: Consider hybrid approach

Conclusion

RAG and CAG are complementary, not competing approaches.

RAG excels for large, dynamic knowledge bases where you can't fit everything in context
CAG excels for bounded, static knowledge where latency and simplicity matter
Hybrid approaches combine the best of both worlds

The key insight: don't default to RAG just because it's popular. Evaluate whether your knowledge base fits in modern context windows. If it does, CAG offers significant latency and accuracy improvements with simpler infrastructure.

As context windows continue growing, expect CAG to become the default for an increasing number of use cases.

Table of Contents

The Rise of CAG

Quick Comparison

How RAG Works

RAG Implementation

RAG Latency Breakdown

How CAG Works

The KV Cache Explained

CAG Implementation

CAG Latency Breakdown

Memory & Compute Requirements

KV Cache Memory Formula

Memory Requirements by Model

Prefill vs Decode Costs

Batch Inference Considerations

Cost Comparison: Infrastructure

When to Use Each Approach

Use RAG When:

Use CAG When:

Decision Framework

Hybrid Approaches

Pattern 1: CAG Core + RAG Edge Cases

Pattern 2: Tiered Caching

Pattern 3: CAG + Real-Time RAG Augmentation

Limitations & Challenges

CAG Limitations

1. Lost in the Middle Problem

2. Attention Dilution

3. Cache Staleness

RAG Limitations

1. Retrieval Failures

2. Multi-Hop Reasoning Failures

3. Latency Variance

Comparison: Failure Modes

Performance Comparison

Accuracy (HotPotQA)

Latency

Cost Analysis

Implementation Best Practices

For RAG

For CAG

For Hybrid

Production Considerations

Cache Management for CAG

Versioning Strategy

Cache Invalidation Patterns

Monitoring & Observability

A/B Testing Framework

Multi-Tenant Isolation

Graceful Degradation

The Future: Longer Context Windows

Evaluation & Benchmarking

Evaluation Metrics

Benchmarking Framework

Needle-in-Haystack Testing

Decision Checklist

Conclusion

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Building Production-Ready RAG Systems: Lessons from the Field

Agentic RAG: When Retrieval Meets Autonomous Reasoning

Mastering LLM Context Windows: Strategies for Long-Context Applications