RAG vs CAG: When Cache-Augmented Generation Beats Retrieval
A comprehensive comparison of Retrieval-Augmented Generation (RAG) and Cache-Augmented Generation (CAG). Learn when to use each approach, implementation patterns, and how to build hybrid systems.
Table of Contents
The Rise of CAG
In December 2024, a paper titled "Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks" (Chan et al., KAUST) challenged the dominance of RAG as the default approach for augmenting LLMs with external knowledge. The paper will be presented at ACM Web Conference 2025.
The core insight: with modern LLMs supporting 100K-1M+ token context windows, you can often skip retrieval entirely by preloading all relevant knowledge into the context and caching it.
This guide covers when to use RAG vs CAG, how each works under the hood, and how to build hybrid systems that combine both approaches.
Quick Comparison
| Aspect | RAG | CAG |
|---|---|---|
| Retrieval | Real-time search at inference | None (preloaded) |
| Latency | Higher (retrieval + generation) | Lower (generation only) |
| Knowledge size | Unlimited | Limited by context window |
| Freshness | Real-time updates possible | Requires cache rebuild |
| Complexity | Higher (vector DB, embeddings, retrieval) | Lower (just KV cache) |
| Best for | Large, dynamic knowledge bases | Static, bounded knowledge |
How RAG Works
RAG (Retrieval-Augmented Generation) fetches relevant documents at query time:
User Query
↓
[1. Embed Query]
↓
[2. Search Vector DB] → Top-K Documents
↓
[3. Construct Prompt]
Query + Retrieved Docs
↓
[4. LLM Generation]
↓
Response
RAG Implementation
This implementation shows a production-ready RAG pipeline with MMR (Maximum Marginal Relevance) retrieval. MMR balances relevance and diversity—it finds relevant documents but penalizes redundancy, ensuring you get varied information rather than five similar documents saying the same thing.
The key configuration choice: k=5 returns 5 final documents, but fetch_k=20 retrieves 20 candidates first, then applies MMR to select the most diverse 5. This gives better coverage than simple top-k retrieval.
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain.chains import RetrievalQA
class RAGSystem:
def __init__(self, index_name: str):
self.llm = ChatOpenAI(model="gpt-5.2")
self.embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
self.vectorstore = PineconeVectorStore.from_existing_index(
index_name=index_name,
embedding=self.embeddings
)
self.retriever = self.vectorstore.as_retriever(
search_type="mmr",
search_kwargs={"k": 5, "fetch_k": 20}
)
def query(self, question: str) -> str:
"""Query with real-time retrieval."""
# 1. Retrieve relevant documents
docs = self.retriever.invoke(question)
# 2. Build context
context = "\n\n".join([doc.page_content for doc in docs])
# 3. Generate response
prompt = f"""Answer the question based on the following context:
Context:
{context}
Question: {question}
Answer:"""
response = self.llm.invoke(prompt)
return response.content
RAG Latency Breakdown
Total latency = Embedding (50ms) + Search (100ms) + Generation (500ms)
= ~650ms per query
The retrieval step adds significant latency and potential failure modes:
- Embedding service latency
- Vector database query time
- Network round-trips
- Retrieval errors (wrong documents returned)
How CAG Works
CAG (Cache-Augmented Generation) preloads all knowledge and caches the model's internal state:
[Offline: Knowledge Preloading]
All Documents → LLM Context → KV Cache (saved to disk)
[Online: Query Time]
User Query + Cached KV → LLM Generation → Response
The KV Cache Explained
When an LLM processes text, it computes Key-Value (KV) pairs for each token in the attention mechanism. These represent the model's "understanding" of the context.
# Simplified attention mechanism
# For each layer, the model computes:
K = input @ W_k # Keys
V = input @ W_v # Values
# These are cached and reused for subsequent tokens
# Instead of recomputing for the entire context each time
CAG insight: If your knowledge base is static, compute the KV cache once and reuse it for every query.
CAG Implementation
CAG's power comes from precomputing the expensive part of LLM inference. When an LLM processes your 100K-token knowledge base, it computes Key-Value matrices at each attention layer. These matrices encode the model's "understanding" of the context.
The insight: these KV values only depend on the input, not the query. So compute them once, save to disk, and reload for every query. The query only needs to compute its own (tiny) KV values, then attention can reference the cached knowledge KVs.
The tradeoff: caches are large (tens of GB for big models) and model-specific. If you fine-tune the model, the cache is invalidated.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import pickle
class CAGSystem:
def __init__(self, model_name: str = "meta-llama/Llama-3.1-70B-Instruct"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
self.kv_cache = None
def preload_knowledge(self, documents: list[str], cache_path: str):
"""Preload all documents and cache KV states."""
# Combine all documents into context
knowledge_text = "\n\n---\n\n".join(documents)
# Create the preloaded prompt
preload_prompt = f"""You are a helpful assistant with access to the following knowledge base:
{knowledge_text}
Use this knowledge to answer questions accurately. If the answer is not in the knowledge base, say so.
"""
# Tokenize
inputs = self.tokenizer(
preload_prompt,
return_tensors="pt",
truncation=True,
max_length=self.model.config.max_position_embeddings - 1000 # Leave room for query
).to(self.model.device)
# Forward pass to generate KV cache
with torch.no_grad():
outputs = self.model(
**inputs,
use_cache=True,
return_dict=True
)
self.kv_cache = outputs.past_key_values
# Save cache to disk
with open(cache_path, 'wb') as f:
pickle.dump(self.kv_cache, f)
print(f"Cached {len(documents)} documents ({inputs['input_ids'].shape[1]} tokens)")
def load_cache(self, cache_path: str):
"""Load precomputed KV cache from disk."""
with open(cache_path, 'rb') as f:
self.kv_cache = pickle.load(f)
def query(self, question: str) -> str:
"""Query using cached knowledge (no retrieval)."""
# Format the query
query_prompt = f"\n\nQuestion: {question}\n\nAnswer:"
inputs = self.tokenizer(
query_prompt,
return_tensors="pt"
).to(self.model.device)
# Generate using cached KV
with torch.no_grad():
outputs = self.model.generate(
**inputs,
past_key_values=self.kv_cache,
max_new_tokens=500,
do_sample=True,
temperature=0.7
)
response = self.tokenizer.decode(
outputs[0][inputs['input_ids'].shape[1]:],
skip_special_tokens=True
)
return response
CAG Latency Breakdown
Total latency = Generation only (500ms)
= ~500ms per query
Speedup vs RAG: ~23% faster (no retrieval overhead)
On HotPotQA benchmark: CAG reduced generation time from 94.35s (RAG) to 2.33s — a 40x improvement.
Memory & Compute Requirements
Understanding the resource implications is critical for choosing between RAG and CAG.
The fundamental resource tradeoff: RAG trades runtime compute (retrieval + embedding per query) for reduced memory (only retrieved docs in context). CAG trades large upfront memory (full knowledge in KV cache) for minimal per-query compute (just decode the answer). Your infrastructure constraints often make this decision for you.
Why this matters in practice: If you have a single A100 (80GB), CAG with a 70B model at 100K tokens is impossible—the KV cache alone exceeds available memory. But if you have multiple GPUs with NVLink, CAG becomes viable. Conversely, RAG works on modest hardware but requires additional infrastructure (vector database, embedding service).
KV Cache Memory Formula
The KV cache size scales with model architecture and context length:
KV Cache Size = 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_param
Example for Llama 3.1 70B (80 layers, 8 KV heads, 128 head_dim, bfloat16):
= 2 × 80 × 8 × 128 × seq_len × 2 bytes
= 327,680 × seq_len bytes
= ~312 KB per 1K tokens
= ~31.2 GB for 100K tokens
Memory Requirements by Model
| Model | Context | KV Cache (100K tokens) | Total VRAM Needed |
|---|---|---|---|
| Llama 3.1 8B | 128K | ~4 GB | ~20 GB |
| Llama 3.1 70B | 128K | ~31 GB | ~170 GB |
| Llama 4 Scout | 10M | ~250 GB (1M tokens) | 500+ GB |
| Qwen 2.5 72B | 128K | ~35 GB | ~180 GB |
Prefill vs Decode Costs
CAG shifts compute from per-query to one-time prefill:
RAG (per query):
- Embed query: ~50ms
- Vector search: ~100ms
- Prefill retrieved docs (~5K tokens): ~200ms
- Decode response: ~300ms
Total: ~650ms
CAG (per query):
- Load cached KV: ~50ms (from SSD) or ~5ms (from RAM)
- Decode response: ~300ms
Total: ~350ms
CAG prefill (one-time):
- Process 100K tokens: ~30-60 seconds
- Amortized over 10K queries: ~3-6ms per query
Batch Inference Considerations
CAG with shared caches enables efficient batching:
class BatchCAGSystem:
"""Efficient batch inference with shared KV cache."""
def __init__(self, model, shared_cache):
self.model = model
self.shared_cache = shared_cache # Same knowledge for all queries
def batch_query(self, questions: list[str]) -> list[str]:
# All queries share the same preloaded context
# Only the query tokens differ per batch item
batch_inputs = self.tokenizer(
[f"\n\nQuestion: {q}\n\nAnswer:" for q in questions],
padding=True,
return_tensors="pt"
)
# Replicate KV cache for batch dimension
batch_cache = self._expand_cache_for_batch(
self.shared_cache,
batch_size=len(questions)
)
outputs = self.model.generate(
**batch_inputs,
past_key_values=batch_cache,
max_new_tokens=200
)
return self.tokenizer.batch_decode(outputs, skip_special_tokens=True)
Cost Comparison: Infrastructure
| Component | RAG | CAG |
|---|---|---|
| Vector DB | $200-2000/mo (managed) | Not needed |
| Embedding API | $0.0001/1K tokens | Not needed |
| GPU Memory | 24-48 GB sufficient | 80-180 GB+ needed |
| Storage | Vector index (~10GB) | KV cache files (~30GB) |
| Complexity | High (multiple services) | Low (single model) |
When to Use Each Approach
Use RAG When:
| Scenario | Why RAG |
|---|---|
| Large knowledge base | Exceeds context window (>1M tokens) |
| Frequently updated data | News, inventory, real-time analytics |
| Diverse query patterns | Different queries need different docs |
| Multi-tenant systems | Different users need different knowledge |
| Cost constraints | Smaller context = fewer tokens = lower cost |
Example RAG use cases:
- Enterprise search across millions of documents
- Customer support with ticket history
- Real-time news Q&A
- E-commerce product recommendations
Use CAG When:
| Scenario | Why CAG |
|---|---|
| Bounded knowledge | Fits in context window |
| Static content | Technical manuals, policies, FAQs |
| Low latency required | Real-time applications |
| Simplicity valued | No vector DB infrastructure |
| High query volume | Amortize preloading cost |
Example CAG use cases:
- Product documentation chatbot
- Company policy Q&A
- Legal contract analysis
- Medical reference lookup (fixed guidelines)
Decision Framework
Is your knowledge base > 500K tokens?
├── YES → Use RAG
└── NO → Continue...
Does your knowledge change frequently?
├── YES (daily/hourly) → Use RAG
└── NO (weekly/monthly) → Continue...
Is latency critical (<200ms)?
├── YES → Use CAG
└── NO → Either works, consider complexity
Do you already have vector DB infrastructure?
├── YES → RAG may be easier
└── NO → CAG avoids new infrastructure
Hybrid Approaches
The best systems often combine both approaches. The key insight: CAG and RAG have complementary strengths. CAG excels at fast, consistent answers for common questions. RAG excels at handling the long tail of rare queries and fresh information.
A well-designed hybrid uses CAG for the 80% of queries that hit core knowledge (fast path), and falls back to RAG for the 20% that need retrieval (comprehensive path). This gives you both speed and coverage.
Pattern 1: CAG Core + RAG Edge Cases
This pattern preloads your most-asked knowledge into CAG, using RAG only when CAG signals uncertainty. The needs_retrieval function is the router—it detects "I don't know" responses or low-confidence answers and triggers RAG fallback.
class HybridSystem:
"""Preload common knowledge, retrieve for edge cases."""
def __init__(self):
self.cag = CAGSystem()
self.rag = RAGSystem(index_name="edge-cases")
self.cag.load_cache("core_knowledge.cache")
def query(self, question: str) -> str:
# Try CAG first (fast path)
cag_response = self.cag.query(question)
# Check confidence / detect "I don't know"
if self.needs_retrieval(cag_response):
# Fall back to RAG for edge cases
return self.rag.query(question)
return cag_response
def needs_retrieval(self, response: str) -> bool:
"""Detect when CAG doesn't have the answer."""
uncertainty_phrases = [
"I don't have information",
"not in my knowledge",
"I'm not sure",
"cannot find"
]
return any(phrase in response.lower() for phrase in uncertainty_phrases)
Pattern 2: Tiered Caching
class TieredCAGSystem:
"""Multiple cache tiers for different knowledge domains."""
def __init__(self):
self.caches = {
"products": "product_docs.cache",
"policies": "company_policies.cache",
"technical": "tech_manual.cache"
}
self.active_cache = None
self.classifier = self.load_query_classifier()
def query(self, question: str) -> str:
# Classify query to select appropriate cache
domain = self.classifier.predict(question)
# Load domain-specific cache
if self.active_cache != domain:
self.load_cache(self.caches[domain])
self.active_cache = domain
return self.generate(question)
Pattern 3: CAG + Real-Time RAG Augmentation
class AugmentedCAGSystem:
"""CAG for static knowledge, RAG for real-time data."""
def __init__(self):
self.cag = CAGSystem() # Product catalog, policies
self.rag = RAGSystem() # Inventory, prices, promotions
def query(self, question: str) -> str:
# Get static context from CAG
static_context = self.cag.get_cached_context()
# Get dynamic context from RAG
dynamic_docs = self.rag.retrieve(question)
dynamic_context = "\n".join([d.page_content for d in dynamic_docs])
# Combine for generation
prompt = f"""Static Knowledge:
{static_context}
Current Information:
{dynamic_context}
Question: {question}
Answer:"""
return self.llm.invoke(prompt)
Limitations & Challenges
Both approaches have significant limitations that impact real-world deployments.
CAG Limitations
1. Lost in the Middle Problem
LLMs struggle to recall information from the middle of long contexts. Research from Stanford/Berkeley shows retrieval accuracy drops significantly for information positioned in the middle 50% of context:
Position in Context → Recall Accuracy
Beginning (0-10%): ~95%
Middle (40-60%): ~60-70%
End (90-100%): ~90%
Mitigation strategies:
- Place critical information at the beginning and end
- Use structured formatting with clear section headers
- Implement attention sinks (repetition of key facts)
- Consider document ordering by importance
2. Attention Dilution
As context grows, attention becomes diluted across more tokens:
# Simplified attention visualization
def attention_per_token(context_length: int, query_tokens: int = 100):
"""Average attention each context token receives."""
# Softmax distributes attention across all tokens
# More tokens = less attention per token
return query_tokens / context_length
# 10K context: 1% attention per token
# 100K context: 0.1% attention per token
# 1M context: 0.01% attention per token
This manifests as:
- Subtle details getting ignored
- Conflicting information not being reconciled
- Reduced reasoning over distant context
3. Cache Staleness
CAG caches become stale when knowledge changes:
| Update Frequency | CAG Viability | Recommendation |
|---|---|---|
| Real-time | Poor | Use RAG |
| Hourly | Poor | Use RAG or hybrid |
| Daily | Moderate | Scheduled cache rebuilds |
| Weekly | Good | Batch updates |
| Static | Excellent | Pure CAG |
RAG Limitations
1. Retrieval Failures
RAG can fail silently when retrieval misses relevant documents:
| Failure Mode | Cause | Impact |
|---|---|---|
| Semantic gap | Query/doc embedding mismatch | Wrong docs retrieved |
| Chunking artifacts | Answer split across chunks | Partial information |
| Sparse coverage | Rare topics under-represented | No relevant docs |
| Adversarial queries | Intentionally confusing queries | Hallucinated answers |
# Example: Retrieval failure detection
def detect_retrieval_failure(query: str, docs: list, response: str) -> bool:
"""Heuristics for detecting poor retrieval."""
signals = []
# Low relevance scores
if all(doc.score < 0.5 for doc in docs):
signals.append("low_relevance")
# Response contradicts retrieved docs
if contains_contradiction(response, docs):
signals.append("contradiction")
# Response uses knowledge not in docs (hallucination)
if contains_external_knowledge(response, docs):
signals.append("potential_hallucination")
return len(signals) > 0
2. Multi-Hop Reasoning Failures
RAG struggles with questions requiring information synthesis:
Question: "What is the revenue of the company that acquired the
startup founded by the person who invented X?"
Required reasoning:
1. Who invented X? → Person A
2. What startup did Person A found? → Startup B
3. Who acquired Startup B? → Company C
4. What is Company C's revenue? → $Y
RAG typically retrieves docs for step 1, missing steps 2-4.
3. Latency Variance
RAG latency is unpredictable due to retrieval variability:
Latency distribution (p50/p95/p99):
- Embedding: 30ms / 80ms / 200ms
- Vector search: 50ms / 150ms / 500ms
- Network: 20ms / 100ms / 300ms
- Total retrieval: 100ms / 330ms / 1000ms
CAG has consistent latency (no retrieval variance)
Comparison: Failure Modes
| Failure Mode | RAG | CAG |
|---|---|---|
| Missing information | Retrieval miss | Context too long |
| Stale information | Index lag | Cache staleness |
| Conflicting sources | Chunk disagreement | In-context conflicts |
| Hallucination | Confabulation from bad retrieval | Less common (full context) |
| Latency spikes | Network/DB issues | Rare (local inference) |
Performance Comparison
From the research paper and real-world benchmarks:
Accuracy (HotPotQA)
| Method | Exact Match | F1 Score |
|---|---|---|
| RAG (Top-5) | 45.2% | 58.1% |
| CAG (Full Context) | 51.3% | 64.7% |
| CAG (Optimized) | 52.1% | 65.2% |
Why CAG wins on accuracy: The model sees ALL relevant context, not just retrieved chunks. This helps with multi-hop reasoning where the answer requires connecting information from multiple sources.
Latency
| Method | Latency (HotPotQA Large) |
|---|---|
| RAG | 94.35 seconds |
| CAG | 2.33 seconds |
Cost Analysis
For 10,000 queries/day on a 100K token knowledge base:
| Approach | Daily Cost | Notes |
|---|---|---|
| RAG | ~$150 | Retrieval + 5K context per query |
| CAG | ~$200 | Full 100K context per query |
| CAG (cached) | ~$50 | KV cache reduces compute |
CAG with KV caching can be significantly cheaper because you only compute the knowledge encoding once.
Implementation Best Practices
For RAG
- Chunk strategically: Use semantic chunking, not fixed-size
- Hybrid search: Combine vector + keyword (BM25)
- Reranking: Use a cross-encoder to rerank top results
- Query expansion: Rewrite queries for better retrieval
For CAG
- Optimize context order: Put most important info first (primacy effect)
- Use structured formatting: Headers, sections, clear delineation
- Compress where possible: Summarize verbose documents
- Version your caches: Track which documents are in each cache
For Hybrid
- Monitor cache hits: Track when CAG succeeds vs needs RAG
- A/B test: Compare latency and accuracy in production
- Warm caches: Pre-compute during off-peak hours
- Graceful degradation: If cache fails, fall back to RAG
Production Considerations
Deploying RAG or CAG at scale requires careful attention to operational concerns.
Cache Management for CAG
Versioning Strategy
import hashlib
from datetime import datetime
from dataclasses import dataclass
@dataclass
class CacheMetadata:
version: str
created_at: datetime
document_hashes: dict[str, str]
model_name: str
token_count: int
class VersionedCAGCache:
def __init__(self, cache_dir: str):
self.cache_dir = cache_dir
def create_cache(self, documents: dict[str, str], model) -> str:
"""Create versioned cache with metadata."""
# Hash documents for change detection
doc_hashes = {
name: hashlib.sha256(content.encode()).hexdigest()[:16]
for name, content in documents.items()
}
# Version based on content + model
version_string = f"{sorted(doc_hashes.items())}-{model.name}"
version = hashlib.sha256(version_string.encode()).hexdigest()[:12]
# Build cache
kv_cache = self._build_kv_cache(documents, model)
# Save with metadata
metadata = CacheMetadata(
version=version,
created_at=datetime.now(),
document_hashes=doc_hashes,
model_name=model.name,
token_count=sum(len(d.split()) for d in documents.values())
)
self._save_cache(version, kv_cache, metadata)
return version
def needs_rebuild(self, current_docs: dict[str, str]) -> bool:
"""Check if cache needs rebuilding due to document changes."""
latest = self._load_latest_metadata()
if not latest:
return True
current_hashes = {
name: hashlib.sha256(content.encode()).hexdigest()[:16]
for name, content in current_docs.items()
}
return current_hashes != latest.document_hashes
Cache Invalidation Patterns
| Pattern | Use Case | Implementation |
|---|---|---|
| Time-based | Predictable updates | Cron job rebuilds cache daily/weekly |
| Content-hash | Change detection | Rebuild when document hashes change |
| Manual trigger | Controlled releases | CI/CD pipeline triggers rebuild |
| Hybrid | Production systems | Hash check + maximum age |
Monitoring & Observability
Track these metrics to ensure system health:
from prometheus_client import Counter, Histogram, Gauge
# RAG-specific metrics
rag_retrieval_latency = Histogram(
'rag_retrieval_latency_seconds',
'Time spent on retrieval',
buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.5]
)
rag_retrieval_count = Counter(
'rag_retrieval_total',
'Total retrieval operations',
['status'] # success, failure, timeout
)
rag_relevance_score = Histogram(
'rag_top_doc_relevance',
'Relevance score of top retrieved document',
buckets=[0.1, 0.3, 0.5, 0.7, 0.9]
)
# CAG-specific metrics
cag_cache_load_latency = Histogram(
'cag_cache_load_latency_seconds',
'Time to load KV cache'
)
cag_cache_size_bytes = Gauge(
'cag_cache_size_bytes',
'Size of loaded KV cache'
)
cag_cache_hit_rate = Gauge(
'cag_cache_hit_rate',
'Percentage of queries served from cache'
)
# Shared metrics
generation_latency = Histogram(
'llm_generation_latency_seconds',
'Time spent on LLM generation',
['method'] # rag, cag, hybrid
)
response_quality_score = Histogram(
'response_quality_score',
'Quality score from evaluation model'
)
A/B Testing Framework
import random
from typing import Literal
class RAGvsCAGExperiment:
"""A/B test RAG vs CAG in production."""
def __init__(self, rag_system, cag_system, cag_traffic_pct: float = 0.1):
self.rag = rag_system
self.cag = cag_system
self.cag_traffic_pct = cag_traffic_pct
self.metrics = MetricsCollector()
def query(self, question: str, user_id: str) -> tuple[str, Literal["rag", "cag"]]:
# Deterministic assignment based on user_id
use_cag = int(hashlib.md5(user_id.encode()).hexdigest(), 16) % 100 < self.cag_traffic_pct * 100
method = "cag" if use_cag else "rag"
start_time = time.time()
try:
if use_cag:
response = self.cag.query(question)
else:
response = self.rag.query(question)
latency = time.time() - start_time
self.metrics.record(method, latency, success=True)
return response, method
except Exception as e:
self.metrics.record(method, time.time() - start_time, success=False)
# Fallback to RAG on CAG failure
if use_cag:
return self.rag.query(question), "rag"
raise
def get_results(self) -> dict:
"""Get A/B test results."""
return {
"rag": {
"p50_latency": self.metrics.percentile("rag", 50),
"p95_latency": self.metrics.percentile("rag", 95),
"success_rate": self.metrics.success_rate("rag"),
"sample_size": self.metrics.count("rag")
},
"cag": {
"p50_latency": self.metrics.percentile("cag", 50),
"p95_latency": self.metrics.percentile("cag", 95),
"success_rate": self.metrics.success_rate("cag"),
"sample_size": self.metrics.count("cag")
}
}
Multi-Tenant Isolation
For SaaS applications serving multiple customers:
class MultiTenantCAG:
"""Isolated CAG caches per tenant."""
def __init__(self, model, cache_dir: str):
self.model = model
self.cache_dir = cache_dir
self.loaded_tenant: str | None = None
self.current_cache = None
def query(self, tenant_id: str, question: str) -> str:
# Load tenant-specific cache if needed
if self.loaded_tenant != tenant_id:
cache_path = f"{self.cache_dir}/{tenant_id}/knowledge.cache"
if not os.path.exists(cache_path):
raise TenantCacheNotFound(tenant_id)
self.current_cache = self._load_cache(cache_path)
self.loaded_tenant = tenant_id
return self._generate(question, self.current_cache)
def build_tenant_cache(self, tenant_id: str, documents: list[str]):
"""Build isolated cache for a tenant."""
cache_path = f"{self.cache_dir}/{tenant_id}/knowledge.cache"
os.makedirs(os.path.dirname(cache_path), exist_ok=True)
# Tenant data never mixes with other tenants
kv_cache = self._build_cache(documents)
self._save_cache(cache_path, kv_cache)
Graceful Degradation
class ResilientKnowledgeSystem:
"""Production system with fallbacks."""
def __init__(self, cag: CAGSystem, rag: RAGSystem):
self.cag = cag
self.rag = rag
self.circuit_breaker = CircuitBreaker(
failure_threshold=5,
recovery_timeout=60
)
async def query(self, question: str) -> str:
# Try CAG first (fast path)
if self.circuit_breaker.is_closed("cag"):
try:
return await asyncio.wait_for(
self.cag.query(question),
timeout=5.0
)
except (asyncio.TimeoutError, CacheLoadError) as e:
self.circuit_breaker.record_failure("cag")
logger.warning(f"CAG failed, falling back to RAG: {e}")
# Fallback to RAG
try:
return await self.rag.query(question)
except Exception as e:
logger.error(f"Both CAG and RAG failed: {e}")
return "I'm having trouble accessing my knowledge base. Please try again."
The Future: Longer Context Windows
CAG becomes more viable as context windows grow:
| Model | Context Window | CAG Viability |
|---|---|---|
| GPT-4 (2023) | 128K | Moderate |
| Claude 3.5 Sonnet (2024) | 200K | Good |
| Gemini 1.5 Pro (2024) | 2M | Excellent |
| GPT-5.2 (Dec 2025) | 400K | Very Good |
| Claude Sonnet 4.5 (2025) | 200K-1M | Excellent |
| Claude Opus 4.5 (2025) | 200K | Good (Infinite Chat) |
| Gemini 3 Pro (Nov 2025) | 1M | Excellent |
| Llama 4 Scout (2025) | 10M | Most use cases |
With 10M token context windows, CAG can handle knowledge bases that would have required RAG infrastructure just two years ago.
Evaluation & Benchmarking
Before choosing RAG or CAG, benchmark both on your specific use case.
Evaluation Metrics
| Metric | Description | How to Measure |
|---|---|---|
| Accuracy | Correctness of answers | Human evaluation or LLM-as-judge |
| Faithfulness | Grounded in provided context | Check if claims appear in sources |
| Relevance | Answer addresses the question | Semantic similarity scoring |
| Completeness | All aspects answered | Checklist against expected points |
| Latency | End-to-end response time | p50, p95, p99 percentiles |
| Cost | Total cost per query | API costs + infrastructure |
Benchmarking Framework
import json
from dataclasses import dataclass
from typing import Callable
@dataclass
class BenchmarkResult:
method: str
accuracy: float
faithfulness: float
avg_latency_ms: float
p95_latency_ms: float
cost_per_query: float
total_queries: int
class RAGvsCAGBenchmark:
"""Comprehensive benchmark for comparing RAG and CAG."""
def __init__(self, rag_system, cag_system, eval_model):
self.rag = rag_system
self.cag = cag_system
self.eval_model = eval_model # For LLM-as-judge
def run_benchmark(self, test_cases: list[dict]) -> dict[str, BenchmarkResult]:
"""
test_cases format:
[
{
"question": "What is the return policy?",
"expected_answer": "30-day money-back guarantee",
"source_docs": ["policy.md"], # For faithfulness check
"difficulty": "easy" # easy, medium, hard
}
]
"""
results = {"rag": [], "cag": []}
for test in test_cases:
# Run both systems
rag_result = self._evaluate_single(self.rag, test, "rag")
cag_result = self._evaluate_single(self.cag, test, "cag")
results["rag"].append(rag_result)
results["cag"].append(cag_result)
return {
"rag": self._aggregate_results(results["rag"], "rag"),
"cag": self._aggregate_results(results["cag"], "cag")
}
def _evaluate_single(self, system, test: dict, method: str) -> dict:
start = time.time()
response = system.query(test["question"])
latency = (time.time() - start) * 1000
# LLM-as-judge evaluation
eval_prompt = f"""Evaluate this response:
Question: {test["question"]}
Expected: {test["expected_answer"]}
Response: {response}
Rate on a scale of 1-5:
1. Accuracy (correctness):
2. Completeness (covers all aspects):
3. Faithfulness (no hallucinations):
Return JSON: {{"accuracy": X, "completeness": X, "faithfulness": X}}"""
scores = json.loads(self.eval_model.invoke(eval_prompt))
return {
"latency_ms": latency,
"accuracy": scores["accuracy"] / 5,
"faithfulness": scores["faithfulness"] / 5,
"response": response
}
def _aggregate_results(self, results: list, method: str) -> BenchmarkResult:
latencies = [r["latency_ms"] for r in results]
return BenchmarkResult(
method=method,
accuracy=sum(r["accuracy"] for r in results) / len(results),
faithfulness=sum(r["faithfulness"] for r in results) / len(results),
avg_latency_ms=sum(latencies) / len(latencies),
p95_latency_ms=sorted(latencies)[int(len(latencies) * 0.95)],
cost_per_query=self._calculate_cost(method),
total_queries=len(results)
)
Needle-in-Haystack Testing
Test how well each system retrieves specific information:
def needle_in_haystack_test(system, knowledge_base: str, needles: list[dict]):
"""
Test retrieval of specific facts buried in large context.
needles format:
[
{
"fact": "The secret code is ALPHA-7742",
"position": 0.5, # Middle of context
"question": "What is the secret code?"
}
]
"""
results = []
for needle in needles:
# Insert needle at specified position
insert_pos = int(len(knowledge_base) * needle["position"])
test_context = (
knowledge_base[:insert_pos] +
f"\n{needle['fact']}\n" +
knowledge_base[insert_pos:]
)
# Query the system
response = system.query(needle["question"], context=test_context)
# Check if needle was found
found = needle["fact"].lower() in response.lower()
results.append({
"position": needle["position"],
"found": found,
"response": response
})
# Aggregate by position
position_accuracy = {}
for r in results:
pos_bucket = round(r["position"], 1)
if pos_bucket not in position_accuracy:
position_accuracy[pos_bucket] = []
position_accuracy[pos_bucket].append(r["found"])
return {
pos: sum(found) / len(found)
for pos, found in position_accuracy.items()
}
Decision Checklist
Use this checklist to evaluate which approach fits your use case:
## RAG vs CAG Decision Checklist
### Knowledge Base Characteristics
- [ ] Size < 500K tokens? → Favor CAG
- [ ] Size > 1M tokens? → Favor RAG
- [ ] Updates daily or more? → Favor RAG
- [ ] Static or monthly updates? → Favor CAG
### Query Patterns
- [ ] Multi-hop reasoning required? → Favor CAG
- [ ] Queries span many topics? → Favor RAG
- [ ] Queries focused on specific domain? → Favor CAG
### Infrastructure
- [ ] Already have vector DB? → RAG easier
- [ ] Have high-memory GPUs? → CAG viable
- [ ] Serverless deployment? → RAG more flexible
### Requirements
- [ ] Latency < 500ms required? → Favor CAG
- [ ] Real-time data needed? → Favor RAG
- [ ] Multi-tenant isolation needed? → Both work, different tradeoffs
### Score
- More CAG checkmarks: Start with CAG
- More RAG checkmarks: Start with RAG
- Mixed: Consider hybrid approach
Conclusion
RAG and CAG are complementary, not competing approaches.
- RAG excels for large, dynamic knowledge bases where you can't fit everything in context
- CAG excels for bounded, static knowledge where latency and simplicity matter
- Hybrid approaches combine the best of both worlds
The key insight: don't default to RAG just because it's popular. Evaluate whether your knowledge base fits in modern context windows. If it does, CAG offers significant latency and accuracy improvements with simpler infrastructure.
As context windows continue growing, expect CAG to become the default for an increasing number of use cases.
Frequently Asked Questions
Related Articles
Building Production-Ready RAG Systems: Lessons from the Field
A comprehensive guide to building Retrieval-Augmented Generation systems that actually work in production, based on real-world experience at Goji AI.
Agentic RAG: When Retrieval Meets Autonomous Reasoning
How to build RAG systems that don't just retrieve—they reason, plan, and iteratively refine their searches to solve complex information needs.
Mastering LLM Context Windows: Strategies for Long-Context Applications
Practical techniques for managing context windows in production LLM applications—from compression to hierarchical processing to infinite context architectures.