Building Production-Ready RAG Systems: Lessons from the Field
A comprehensive guide to building Retrieval-Augmented Generation systems that actually work in production, based on real-world experience at Goji AI.
Table of Contents
Introduction
Retrieval-Augmented Generation (RAG) has become the dominant architecture for building LLM applications that need access to specific, up-to-date, or proprietary information. The concept is deceptively simple: retrieve relevant context, feed it to an LLM, generate a grounded response. Yet the gap between a weekend RAG prototype and a production system handling millions of queries is enormous.
After building RAG systems at Goji AI that process over 2 million queries monthly across legal, financial, and technical domains, I've learned that success lies not in any single technique, but in the careful orchestration of dozens of interdependent components. This post shares the hard-won lessons from that journey.
The Anatomy of Production RAG
A production RAG system is far more complex than the typical tutorial suggests. Here's what a real architecture looks like:
Ingestion Pipeline:
- Document acquisition and normalization
- Content extraction (PDF, HTML, Office formats)
- Preprocessing and cleaning
- Chunking and overlap management
- Metadata extraction and enrichment
- Embedding generation
- Vector store indexing
Query Pipeline:
- Query understanding and classification
- Query expansion and reformulation
- Hybrid retrieval (dense + sparse)
- Re-ranking and filtering
- Context assembly and compression
- Prompt construction
- LLM generation
- Response post-processing
- Citation extraction
Supporting Infrastructure:
- Evaluation and monitoring
- Caching layers
- Rate limiting and load balancing
- Feedback collection
- Continuous improvement loops
Let's dive into the critical decisions at each stage.
Document Processing: The Foundation
The quality of your RAG system is bounded by the quality of your document processing. No amount of sophisticated retrieval can compensate for poorly chunked or corrupted content.
Why document processing is the most underinvested area in RAG: Teams spend weeks tuning embedding models and retrieval algorithms, but rush through document processing. This is backwards. If your extraction corrupts content, misses tables, or destroys document structure, you've created a ceiling that no amount of downstream optimization can break through. We've seen teams improve retrieval quality by 20%+ simply by fixing extraction issues, without touching any other component.
The garbage-in-garbage-out reality: Consider what happens when a PDF with two-column layout gets extracted as a single stream of text. Sentences from the left column interleave with sentences from the right column, creating nonsensical content. The embedding model faithfully embeds this garbage. The retrieval system dutifully returns it. The LLM generates confidently wrong answers from corrupted input. Every downstream component works perfectly—they just operate on corrupted data.
Content Extraction
PDF extraction alone is a minefield. Academic papers, scanned documents, tables, multi-column layouts—each requires different handling. We use a tiered approach:
- Native text extraction for born-digital PDFs using PyMuPDF
- OCR fallback with Tesseract for scanned documents, validated against confidence scores
- Layout analysis using LayoutLM or similar models for complex documents
- Table extraction with Camelot or Tabula, stored separately with structural metadata
Implementation: Here's our production document extraction pipeline that handles different PDF types intelligently. The key insight is that different PDFs require different extraction strategies—born-digital PDFs can use fast text extraction, while scanned documents need OCR. We also extract tables separately because they require different chunking strategies than prose text.
The confidence scoring is critical: OCR can produce gibberish on poor-quality scans. By checking the OCR confidence score, we can flag low-quality extractions for manual review rather than indexing garbage into our RAG system.
import fitz # PyMuPDF
from PIL import Image
import pytesseract
from typing import Dict, List, Tuple
import logging
class DocumentExtractor:
"""Production-grade document extraction with fallback strategies."""
def __init__(self, ocr_confidence_threshold: float = 0.7):
self.ocr_threshold = ocr_confidence_threshold
self.logger = logging.getLogger(__name__)
def extract_pdf(self, pdf_path: str) -> Dict[str, any]:
"""
Extract text from PDF with automatic strategy selection.
Returns a dict with:
- text: extracted content
- method: extraction method used (text/ocr/hybrid)
- confidence: quality score (0-1)
- tables: list of extracted tables
- metadata: PDF metadata
"""
doc = fitz.open(pdf_path)
# Try native text extraction first
text_content = []
low_text_pages = []
for page_num, page in enumerate(doc):
text = page.get_text()
# If page has very little text, it's likely scanned
if len(text.strip()) < 50:
low_text_pages.append(page_num)
else:
text_content.append(text)
# Use OCR for pages with little/no text
if low_text_pages:
self.logger.info(f"Using OCR for {len(low_text_pages)} pages")
ocr_results = self._ocr_pages(doc, low_text_pages)
text_content.extend(ocr_results['text'])
confidence = ocr_results['avg_confidence']
method = "hybrid" if text_content else "ocr"
else:
confidence = 1.0
method = "text"
# Extract tables separately
tables = self._extract_tables(pdf_path)
return {
'text': '\n\n'.join(text_content),
'method': method,
'confidence': confidence,
'tables': tables,
'metadata': doc.metadata,
'page_count': len(doc)
}
def _ocr_pages(self, doc, page_numbers: List[int]) -> Dict:
"""OCR specific pages and return text with confidence scores."""
texts = []
confidences = []
for page_num in page_numbers:
page = doc[page_num]
# Render page to image at 300 DPI for good OCR quality
pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72))
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
# Get OCR data with confidence scores
ocr_data = pytesseract.image_to_data(
img,
output_type=pytesseract.Output.DICT
)
# Calculate average confidence for this page
conf_scores = [int(c) for c in ocr_data['conf'] if c != '-1']
avg_conf = sum(conf_scores) / len(conf_scores) if conf_scores else 0
# Only include text if confidence is acceptable
if avg_conf >= self.ocr_threshold * 100:
page_text = ' '.join(ocr_data['text'])
texts.append(page_text)
confidences.append(avg_conf / 100)
else:
self.logger.warning(
f"Page {page_num} OCR confidence too low: {avg_conf:.2f}%"
)
return {
'text': texts,
'avg_confidence': sum(confidences) / len(confidences) if confidences else 0
}
def _extract_tables(self, pdf_path: str) -> List[Dict]:
"""Extract tables with structure preserved."""
try:
import camelot
tables = camelot.read_pdf(pdf_path, pages='all', flavor='lattice')
return [{
'page': table.page,
'data': table.df.to_dict(),
'shape': table.shape,
'accuracy': table.accuracy
} for table in tables]
except Exception as e:
self.logger.error(f"Table extraction failed: {e}")
return []
# Usage example
extractor = DocumentExtractor(ocr_confidence_threshold=0.7)
result = extractor.extract_pdf("complex_document.pdf")
if result['confidence'] < 0.8:
print(f"⚠️ Low confidence extraction ({result['confidence']:.2%})")
print(f"Method used: {result['method']}")
# Flag for manual review in production
For HTML content, we've found that simple libraries like BeautifulSoup often miss important context. We use a custom extraction pipeline that preserves:
- Heading hierarchy (critical for chunking)
- List structures
- Table relationships
- Link context and anchor text
- Image alt text and captions
Implementation: This HTML extractor preserves document structure that's essential for high-quality chunking. The key difference from naive extraction is maintaining the heading hierarchy—this lets us chunk by section rather than by arbitrary character counts, which dramatically improves retrieval quality.
Notice how we preserve the heading level and position metadata. This allows our chunking strategy to keep related content together (everything under "Installation Instructions" stays together) rather than splitting mid-section.
from bs4 import BeautifulSoup
from typing import List, Dict
import re
class HTMLExtractor:
"""Extract structured content from HTML preserving document structure."""
def __init__(self):
self.heading_tags = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']
def extract_with_structure(self, html_content: str) -> Dict:
"""
Extract HTML with preserved structure for better chunking.
Returns sections with heading hierarchy, allowing semantic chunking
that respects document structure rather than arbitrary splits.
"""
soup = BeautifulSoup(html_content, 'html.parser')
# Remove script, style, and navigation elements
for tag in soup(['script', 'style', 'nav', 'footer', 'aside']):
tag.decompose()
sections = []
current_section = {
'heading': '',
'level': 0,
'content': [],
'position': 0
}
for elem in soup.find_all(['p', 'li', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'table']):
if elem.name in self.heading_tags:
# Save previous section if it has content
if current_section['content']:
sections.append(current_section)
# Start new section
level = int(elem.name[1]) # h1 -> 1, h2 -> 2, etc.
current_section = {
'heading': elem.get_text(strip=True),
'level': level,
'content': [],
'position': len(sections)
}
elif elem.name == 'table':
# Store tables separately with row/column structure
current_section['content'].append({
'type': 'table',
'data': self._extract_table(elem)
})
elif elem.name in ['p', 'li']:
text = elem.get_text(strip=True)
if text:
# Preserve link context
links = elem.find_all('a')
link_context = [
{'text': a.get_text(), 'href': a.get('href')}
for a in links
]
current_section['content'].append({
'type': 'text',
'text': text,
'links': link_context
})
# Add final section
if current_section['content']:
sections.append(current_section)
return {
'sections': sections,
'title': soup.find('title').get_text() if soup.find('title') else '',
'metadata': self._extract_metadata(soup)
}
def _extract_table(self, table_elem) -> Dict:
"""Extract table with structure preserved."""
rows = []
for tr in table_elem.find_all('tr'):
cells = [td.get_text(strip=True) for td in tr.find_all(['td', 'th'])]
rows.append(cells)
return {
'headers': rows[0] if rows else [],
'rows': rows[1:] if len(rows) > 1 else []
}
def _extract_metadata(self, soup) -> Dict:
"""Extract page metadata from meta tags."""
metadata = {}
# Common meta tags
for meta in soup.find_all('meta'):
name = meta.get('name') or meta.get('property')
content = meta.get('content')
if name and content:
metadata[name] = content
return metadata
# Usage
extractor = HTMLExtractor()
result = extractor.extract_with_structure(html_content)
# Now you can chunk by section, preserving semantic boundaries
for section in result['sections']:
print(f"{'#' * section['level']} {section['heading']}")
print(f"Position: {section['position']}")
# This section stays together during chunking
Chunking Strategy: The Most Underrated Decision
Chunking strategy has more impact on retrieval quality than your choice of embedding model. After extensive experimentation, here's what we've learned.
Why chunking matters more than embedding models: A mediocre embedding model on well-chunked content will outperform an excellent embedding model on poorly-chunked content. Why? Embedding models are trained on coherent text—sentences that follow logically, paragraphs with complete thoughts. When you feed them a chunk that starts mid-sentence and ends mid-paragraph, the embedding doesn't capture the meaning because there is no coherent meaning to capture. The model can only embed what exists.
The chunk size tradeoff nobody talks about: Smaller chunks are more precise—they answer specific questions with less noise. Larger chunks provide more context—they help when the question requires understanding relationships across sentences. There's no universally optimal chunk size. We typically target 200-400 tokens as a baseline, but we adapt based on document type: short chunks for Q&A knowledge bases, longer chunks for narrative documents like reports.
Semantic Chunking Over Fixed-Size
Fixed token counts (e.g., 512 tokens) are convenient but destroy semantic coherence. A chunk that splits mid-sentence or mid-paragraph loses critical context. We chunk on semantic boundaries:
- Section breaks (identified by headings)
- Paragraph boundaries
- Sentence boundaries as a last resort
- Maximum chunk size as a hard limit, not a target
Implementation: This semantic chunker respects document structure instead of blindly splitting at token boundaries. The critical insight: a chunk should be a complete thought, not an arbitrary slice of text. By chunking at semantic boundaries (sections, paragraphs), each chunk is self-contained and embeds cleanly.
The overlap strategy is also semantic—we overlap by full sentences, not character counts. This ensures that if a key concept spans two chunks, both chunks get the complete sentences about that concept, not a truncated mid-sentence fragment.
from typing import List, Dict
import tiktoken
from nltk.tokenize import sent_tokenize, ParagraphTokenizer
import nltk
# Download required NLTK data
nltk.download('punkt', quiet=True)
class SemanticChunker:
"""Chunk text at semantic boundaries, not arbitrary token counts."""
def __init__(
self,
model_name: str = "gpt-4",
target_chunk_size: int = 300,
max_chunk_size: int = 500,
overlap_sentences: int = 2
):
self.tokenizer = tiktoken.encoding_for_model(model_name)
self.target_chunk_size = target_chunk_size
self.max_chunk_size = max_chunk_size
self.overlap_sentences = overlap_sentences
def chunk_document(self, sections: List[Dict]) -> List[Dict]:
"""
Chunk document by semantic boundaries.
Input: Structured sections from HTMLExtractor or similar
Output: Chunks with metadata and overlap
Each chunk maintains:
- The heading hierarchy (crucial for context)
- Position in document
- Overlapping sentences with adjacent chunks
"""
chunks = []
for section in sections:
section_heading = section['heading']
section_level = section['level']
# Extract all text content from section
section_text = self._extract_section_text(section)
# Split into paragraphs first (respects document structure)
paragraphs = section_text.split('\n\n')
current_chunk_paras = []
current_tokens = 0
for para in paragraphs:
para = para.strip()
if not para:
continue
para_tokens = len(self.tokenizer.encode(para))
# If single paragraph exceeds max, split by sentences
if para_tokens > self.max_chunk_size:
# Finish current chunk if any
if current_chunk_paras:
chunk_text = '\n\n'.join(current_chunk_paras)
chunks.append(self._create_chunk(
chunk_text, section_heading, section_level, section['position']
))
current_chunk_paras = []
current_tokens = 0
# Split long paragraph by sentences
chunks.extend(
self._chunk_long_paragraph(
para, section_heading, section_level, section['position']
)
)
continue
# If adding this para exceeds target, finish current chunk
if current_tokens + para_tokens > self.target_chunk_size and current_chunk_paras:
chunk_text = '\n\n'.join(current_chunk_paras)
chunks.append(self._create_chunk(
chunk_text, section_heading, section_level, section['position']
))
current_chunk_paras = []
current_tokens = 0
# Add paragraph to current chunk
current_chunk_paras.append(para)
current_tokens += para_tokens
# Add final chunk if any content remains
if current_chunk_paras:
chunk_text = '\n\n'.join(current_chunk_paras)
chunks.append(self._create_chunk(
chunk_text, section_heading, section_level, section['position']
))
# Add overlaps between adjacent chunks
chunks = self._add_overlaps(chunks)
return chunks
def _chunk_long_paragraph(
self, paragraph: str, heading: str, level: int, position: int
) -> List[Dict]:
"""Split long paragraph by sentences."""
sentences = sent_tokenize(paragraph)
chunks = []
current_sentences = []
current_tokens = 0
for sentence in sentences:
sent_tokens = len(self.tokenizer.encode(sentence))
if current_tokens + sent_tokens > self.target_chunk_size and current_sentences:
chunk_text = ' '.join(current_sentences)
chunks.append(self._create_chunk(chunk_text, heading, level, position))
current_sentences = []
current_tokens = 0
current_sentences.append(sentence)
current_tokens += sent_tokens
if current_sentences:
chunk_text = ' '.join(current_sentences)
chunks.append(self._create_chunk(chunk_text, heading, level, position))
return chunks
def _create_chunk(self, text: str, heading: str, level: int, position: int) -> Dict:
"""Create chunk with metadata."""
tokens = len(self.tokenizer.encode(text))
return {
'text': text,
'heading': heading,
'heading_level': level,
'section_position': position,
'token_count': tokens,
'sentences': sent_tokenize(text)
}
def _add_overlaps(self, chunks: List[Dict]) -> List[Dict]:
"""Add sentence-level overlap between adjacent chunks."""
for i in range(len(chunks) - 1):
current_chunk = chunks[i]
next_chunk = chunks[i + 1]
# Take last N sentences from current chunk
overlap_sentences = current_chunk['sentences'][-self.overlap_sentences:]
# Add to next chunk's beginning
next_chunk['overlap_prev'] = ' '.join(overlap_sentences)
next_chunk['text'] = next_chunk['overlap_prev'] + '\n\n' + next_chunk['text']
return chunks
def _extract_section_text(self, section: Dict) -> str:
"""Extract plain text from structured section."""
texts = []
for item in section.get('content', []):
if item['type'] == 'text':
texts.append(item['text'])
elif item['type'] == 'table':
# Convert table to text representation
table_text = self._table_to_text(item['data'])
texts.append(table_text)
return '\n\n'.join(texts)
def _table_to_text(self, table_data: Dict) -> str:
"""Convert structured table to readable text."""
lines = []
# Headers
if table_data.get('headers'):
lines.append(' | '.join(table_data['headers']))
lines.append('-' * 40)
# Rows
for row in table_data.get('rows', []):
lines.append(' | '.join(str(cell) for cell in row))
return '\n'.join(lines)
# Usage
chunker = SemanticChunker(
model_name="gpt-4",
target_chunk_size=300,
max_chunk_size=500,
overlap_sentences=2
)
# From our HTML extractor
html_extractor = HTMLExtractor()
structured_doc = html_extractor.extract_with_structure(html_content)
# Chunk by semantic boundaries
chunks = chunker.chunk_document(structured_doc['sections'])
for chunk in chunks:
print(f"Heading: {chunk['heading']} (Level {chunk['heading_level']})")
print(f"Tokens: {chunk['token_count']}")
print(f"Text: {chunk['text'][:100]}...")
print()
Hierarchical Chunking
We maintain multiple representations of each document. This is one of the most powerful techniques for production RAG, but also one of the most underutilized because it requires more complex indexing:
| Level | Content | Use Case |
|---|---|---|
| Document | Full text summary (LLM-generated) | High-level relevance |
| Section | Section with heading context | Topical queries |
| Paragraph | Individual paragraphs | Specific fact retrieval |
| Sentence | Key sentences with context | Precise answers |
Why hierarchical chunking transforms retrieval quality: Consider the query "What are the main findings of the research?" A paragraph-level chunk might contain one finding. But the question asks for "main findings" (plural)—it needs document or section level context. Without hierarchical chunks, you'd retrieve multiple paragraph chunks, each with partial information, and hope the LLM synthesizes them correctly. With hierarchical chunks, you retrieve the section summary that already aggregates the findings.
The implementation complexity is worth it: Hierarchical chunking means 4x the embeddings, 4x the storage, and more complex query logic (which level to query?). But the quality gains are substantial. We typically query multiple levels in parallel and let re-ranking sort out which level had the best match for this particular query.
This allows retrieval at the appropriate granularity. A question like "What is the company's revenue?" needs paragraph-level retrieval. "Summarize the key findings" benefits from section or document level.
Overlap and Context Windows
We use 15-20% overlap between adjacent chunks, but more importantly, we store contextual metadata with each chunk:
- Parent section heading
- Document title
- Surrounding chunk summaries
- Position in document (early/middle/late)
This metadata is crucial for re-ranking and helps the LLM understand where information comes from.
Metadata Enrichment
Raw text isn't enough. We extract and store:
- Temporal metadata: Publication date, last updated, time references in content
- Entity metadata: Named entities (people, organizations, products)
- Structural metadata: Document type, section type, confidence scores
- Source metadata: URL, author, domain authority
- Custom metadata: Domain-specific tags, classification labels
This metadata enables powerful filtering. "Find information about GDPR compliance published after 2023 from official EU sources" becomes a simple filtered query.
Embedding Strategy
Model Selection
The embedding model landscape has matured significantly. Our current recommendations:
For General Use:
- OpenAI text-embedding-3-large (excellent quality, reasonable cost)
- Cohere embed-v3 (strong multilingual support)
- Voyage AI voyage-large-2 (best-in-class for code and technical content)
For Cost-Sensitive Applications:
- OpenAI text-embedding-3-small (80% of large quality at 20% cost)
- Open-source alternatives: BGE-large, E5-large-v2
For Specialized Domains: Consider fine-tuning. We fine-tuned an embedding model on legal documents and saw retrieval precision improve from 72% to 89% on our benchmark.
Embedding Best Practices
-
Embed chunks with context: Don't embed raw chunk text. Prepend the document title and section heading. This dramatically improves retrieval for ambiguous queries.
-
Separate query and document embeddings: Some models (like E5) are trained with different prefixes for queries vs. documents. Use them correctly.
-
Normalize embeddings: Ensure your vector store uses normalized embeddings for consistent cosine similarity scores.
-
Version your embeddings: When you change embedding models or strategies, you need to re-embed everything. Track embedding versions in metadata.
Retrieval: Where Most Systems Fail
Retrieval is the heart of RAG, and it's where most production systems struggle. Pure vector search is not enough.
The retrieval quality ceiling: No matter how good your LLM is, it can only work with the context you provide. If your retrieval returns irrelevant chunks, the LLM will either hallucinate (ignoring the useless context) or generate wrong answers (trusting the wrong context). We've seen systems where improving retrieval quality from 70% to 90% precision doubled the end-to-end answer accuracy—without changing anything else.
Why demos work but production fails: In demos, you control the queries. You know what's in your documents and craft questions that match. In production, users ask questions you never anticipated, in phrasing you didn't expect, about edge cases in your content. The gap between demo performance and production performance is almost entirely a retrieval problem.
Hybrid Search is Non-Negotiable
Dense retrieval (embeddings) captures semantic similarity but misses keyword matches. Sparse retrieval (BM25) captures exact matches but misses paraphrases. You need both.
The failure modes are complementary: Vector search fails when the query uses different vocabulary than the documents. User asks about "compensation" but documents say "salary"—vector search might miss this if the embedding doesn't capture the synonym relationship strongly enough. BM25 fails when the query is conceptually related but shares no words. User asks "how to handle angry customers" but the document says "de-escalation techniques for difficult client interactions." Hybrid search handles both.
Our hybrid approach:
- Dense retrieval: Top 50 candidates via ANN search
- Sparse retrieval: Top 50 candidates via BM25
- Reciprocal Rank Fusion: Combine results using RRF
- Re-ranking: Cross-encoder on top 20 combined results
The performance difference is substantial:
| Approach | Recall@10 | Precision@10 |
|---|---|---|
| Dense only | 76% | 58% |
| Sparse only | 69% | 62% |
| Hybrid (no rerank) | 84% | 65% |
| Hybrid + rerank | 91% | 78% |
Implementation: This hybrid retriever combines the best of both worlds: BM25's exact keyword matching and vector search's semantic understanding. The Reciprocal Rank Fusion algorithm is elegant—it combines rankings without needing to normalize scores across different retrieval systems (which is notoriously difficult).
The key insight in RRF: a document that appears high in both rankings is probably very relevant. RRF gives each document a score based on its rank position in each result set, automatically handling the score normalization problem.
After RRF combines the results, we use a cross-encoder for precise re-ranking. Cross-encoders are expensive (they must process each query-document pair), but by running them only on the top 20 candidates from RRF, we get their accuracy at acceptable latency.
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings
from sentence_transformers import CrossEncoder
from typing import List, Dict, Tuple
import numpy as np
class HybridRetriever:
"""
Production-grade hybrid retriever combining BM25 and vector search.
Key design decisions:
- BM25 for exact keyword matches (handles acronyms, proper nouns)
- Vector search for semantic similarity (handles paraphrases, synonyms)
- RRF for score-free fusion (no score normalization needed)
- Cross-encoder for precise re-ranking (sees query + doc together)
"""
def __init__(
self,
vector_store,
documents: List[str],
embeddings,
cross_encoder_model: str = "cross-encoder/ms-marco-MiniLM-L-12-v2",
bm25_weight: float = 0.4,
vector_weight: float = 0.6,
top_k_initial: int = 50,
top_k_rerank: int = 20,
top_k_final: int = 10
):
"""
Args:
vector_store: Vector store (Pinecone, Weaviate, etc.)
documents: Raw documents for BM25 indexing
embeddings: Embedding model for vector search
cross_encoder_model: Model for re-ranking
bm25_weight: Weight for BM25 scores (default 0.4)
vector_weight: Weight for vector scores (default 0.6)
top_k_initial: Candidates to retrieve from each method
top_k_rerank: Candidates to re-rank with cross-encoder
top_k_final: Final results to return
"""
# Initialize BM25 retriever
self.bm25_retriever = BM25Retriever.from_texts(
documents,
k=top_k_initial
)
# Vector store retriever
self.vector_retriever = vector_store.as_retriever(
search_kwargs={"k": top_k_initial}
)
# Cross-encoder for re-ranking
self.cross_encoder = CrossEncoder(cross_encoder_model)
# Parameters
self.bm25_weight = bm25_weight
self.vector_weight = vector_weight
self.top_k_rerank = top_k_rerank
self.top_k_final = top_k_final
self.documents = documents
def retrieve(self, query: str) -> List[Dict]:
"""
Retrieve documents using hybrid search with re-ranking.
Steps:
1. BM25 retrieval (keyword-based)
2. Vector retrieval (semantic)
3. Reciprocal Rank Fusion
4. Cross-encoder re-ranking
5. Return top K
Returns list of dicts with:
- text: document text
- score: final relevance score
- rank: final ranking position
- retrieval_method: how it was found (bm25/vector/both)
"""
# Step 1: BM25 retrieval
bm25_results = self.bm25_retriever.get_relevant_documents(query)
# Step 2: Vector retrieval
vector_results = self.vector_retriever.get_relevant_documents(query)
# Step 3: Reciprocal Rank Fusion
fused_results = self._reciprocal_rank_fusion(
query, bm25_results, vector_results
)
# Step 4: Cross-encoder re-ranking on top candidates
reranked_results = self._rerank_with_cross_encoder(
query,
fused_results[:self.top_k_rerank]
)
# Step 5: Return top K
return reranked_results[:self.top_k_final]
def _reciprocal_rank_fusion(
self,
query: str,
bm25_results: List,
vector_results: List,
k: int = 60 # RRF constant, typically 60
) -> List[Dict]:
"""
Combine rankings using Reciprocal Rank Fusion.
RRF formula for each document d:
RRF(d) = Σ(1 / (k + rank_i(d)))
Where rank_i(d) is the rank of document d in retrieval method i.
Why RRF works:
- No score normalization needed (works with ranks only)
- Documents appearing high in multiple rankings get boosted
- Robust to score scale differences between methods
"""
# Build rank dictionaries
bm25_ranks = {doc.page_content: rank for rank, doc in enumerate(bm25_results)}
vector_ranks = {doc.page_content: rank for rank, doc in enumerate(vector_results)}
# All unique documents
all_docs = set(bm25_ranks.keys()) | set(vector_ranks.keys())
# Calculate RRF scores
rrf_scores = {}
for doc in all_docs:
score = 0
retrieval_methods = []
if doc in bm25_ranks:
score += self.bm25_weight / (k + bm25_ranks[doc])
retrieval_methods.append('bm25')
if doc in vector_ranks:
score += self.vector_weight / (k + vector_ranks[doc])
retrieval_methods.append('vector')
rrf_scores[doc] = {
'score': score,
'retrieval_method': '+'.join(retrieval_methods),
'bm25_rank': bm25_ranks.get(doc, None),
'vector_rank': vector_ranks.get(doc, None)
}
# Sort by RRF score
sorted_docs = sorted(
rrf_scores.items(),
key=lambda x: x[1]['score'],
reverse=True
)
return [
{
'text': doc,
'rrf_score': data['score'],
'retrieval_method': data['retrieval_method'],
'bm25_rank': data['bm25_rank'],
'vector_rank': data['vector_rank']
}
for doc, data in sorted_docs
]
def _rerank_with_cross_encoder(
self,
query: str,
candidates: List[Dict]
) -> List[Dict]:
"""
Re-rank candidates using cross-encoder.
Cross-encoders see query and document together, enabling
much more accurate relevance scoring than bi-encoders.
Trade-off: Much slower (must run inference for each pair),
so only use on top candidates from initial retrieval.
"""
if not candidates:
return []
# Prepare query-document pairs
pairs = [(query, cand['text']) for cand in candidates]
# Get cross-encoder scores
# Shape: (num_candidates,) with scores 0-1
ce_scores = self.cross_encoder.predict(pairs)
# Add cross-encoder scores to candidates
for cand, score in zip(candidates, ce_scores):
cand['cross_encoder_score'] = float(score)
# Weighted combination of RRF and cross-encoder
cand['final_score'] = 0.3 * cand['rrf_score'] + 0.7 * score
# Sort by final score
candidates.sort(key=lambda x: x['final_score'], reverse=True)
# Add final ranks
for rank, cand in enumerate(candidates):
cand['rank'] = rank + 1
return candidates
# Usage example
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings
import pinecone
# Initialize vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
pinecone.init(api_key="your-key", environment="us-west1-gcp")
vector_store = Pinecone.from_documents(
documents,
embeddings,
index_name="production-rag"
)
# Initialize hybrid retriever
retriever = HybridRetriever(
vector_store=vector_store,
documents=[doc.page_content for doc in documents],
embeddings=embeddings,
bm25_weight=0.4,
vector_weight=0.6,
top_k_final=10
)
# Retrieve
query = "What are the key findings about climate change?"
results = retriever.retrieve(query)
for result in results:
print(f"Rank {result['rank']}: Score {result['final_score']:.3f}")
print(f" Method: {result['retrieval_method']}")
print(f" BM25 rank: {result['bm25_rank']}, Vector rank: {result['vector_rank']}")
print(f" Text: {result['text'][:150]}...")
print()
Query Understanding
Real user queries are messy. They contain typos, ambiguity, implicit context, and mixed intents. We process queries through several stages:
Query Classification:
- Is this a factual question, a comparison, a summarization request?
- Does it reference previous conversation context?
- Is it in-domain or out-of-domain?
Query Expansion:
- Synonym expansion using a domain-specific thesaurus
- Acronym expansion
- LLM-based query reformulation for complex queries
Query Decomposition: For complex queries like "Compare the privacy policies of Apple and Google regarding data retention," we decompose into sub-queries:
- "Apple privacy policy data retention"
- "Google privacy policy data retention" Then merge and deduplicate results.
Re-ranking: The Secret Weapon
Initial retrieval optimizes for recall—casting a wide net. Re-ranking optimizes for precision. We use a cross-encoder model (ms-marco-MiniLM-L-12-v2 or similar) that sees both query and document together.
Why cross-encoders outperform bi-encoders for ranking: Bi-encoders (used in initial retrieval) embed query and document separately, then compare the embeddings. This is fast but loses information—the model can't see how specific words in the query relate to specific words in the document. Cross-encoders see query and document together, enabling much richer comparison. The model can learn that "capital" in the query matches "headquarters" in a document about companies, but not in a document about finance. This contextual understanding is impossible with separate embeddings.
The two-stage architecture is essential at scale: You can't run cross-encoder inference on 100,000 documents—it would take minutes. Instead, you use fast bi-encoder search to get top 50-100 candidates, then expensive cross-encoder re-ranking on just those candidates. This gives you the best of both worlds: broad coverage from bi-encoders, precise ranking from cross-encoders.
Re-ranking is computationally expensive, so we only apply it to top candidates. The latency/quality tradeoff:
| Candidates Re-ranked | Latency Added | Quality Gain |
|---|---|---|
| 10 | ~50ms | +8% precision |
| 20 | ~100ms | +12% precision |
| 50 | ~250ms | +14% precision |
We typically re-rank top 20 as the sweet spot.
Generation: Beyond Simple Prompting
With good retrieval, generation becomes more straightforward—but there are still critical decisions.
The generation step is often the easiest part: If your retrieval returns the right information, modern LLMs are remarkably good at synthesizing it into coherent answers. The hard work is upstream. That said, poor generation can still squander good retrieval. The most common failure modes: the LLM ignores the provided context and hallucinates, the LLM fails to cite sources making answers unverifiable, or the LLM produces verbose responses when concise answers were needed.
Context window management becomes critical at scale: GPT-4 can handle 128K tokens, but should you use all of it? More context means higher cost, higher latency, and—surprisingly—sometimes lower quality. LLMs can get "lost in the middle" when important information is surrounded by less relevant content. We've found that 4-8K tokens of highly relevant context often outperforms 32K tokens of moderately relevant context.
Context Assembly
Don't just concatenate retrieved chunks. Structure matters:
- Order by relevance: Most relevant first (LLMs have primacy bias)
- Include metadata: Source titles, dates help the LLM assess reliability
- Add structural markers: Clear delimiters between chunks
- Compress when needed: For very long contexts, use LLM-based summarization
Prompt Engineering for RAG
Our generation prompt includes:
- Clear instruction on task and format
- Explicit grounding requirements ("Only use information from the provided context")
- Citation instructions
- Handling for insufficient information ("If the context doesn't contain enough information to answer, say so")
- Output format specification
Implementation: The prompt is where you enforce grounding—the LLM's tendency to hallucinate must be actively suppressed through clear instructions. Notice how we explicitly tell the model what to do when information is missing (say so, don't make it up), how to cite sources (inline with [source_id]), and how to structure the answer.
The system message sets the behavior globally (you are a helpful assistant that strictly uses provided context). The user message provides the actual context and question. By separating these, we can easily swap system messages for different use cases (concise answers, detailed explanations, etc.) without changing the context assembly logic.
from openai import OpenAI
from typing import List, Dict
import json
class RAGGenerator:
"""Generate responses from retrieved context with proper grounding."""
def __init__(
self,
model: str = "gpt-4-turbo-preview",
temperature: float = 0.0 # Low temp for factual responses
):
self.client = OpenAI()
self.model = model
self.temperature = temperature
def generate(
self,
query: str,
retrieved_chunks: List[Dict],
response_style: str = "concise" # concise, detailed, technical
) -> Dict:
"""
Generate answer from retrieved context.
Args:
query: User question
retrieved_chunks: List of dicts with 'text', 'heading', 'score'
response_style: Answer style (concise/detailed/technical)
Returns:
Dict with:
- answer: Generated response
- sources: List of source chunks used
- confidence: Estimated confidence (based on retrieval scores)
"""
# Assemble context with source IDs
context = self._assemble_context(retrieved_chunks)
# Select appropriate system message
system_message = self._get_system_message(response_style)
# Build user message
user_message = self._build_user_message(query, context)
# Generate response
response = self.client.chat.completions.create(
model=self.model,
temperature=self.temperature,
messages=[
{"role": "system", "content": system_message},
{"role": "user", "content": user_message}
]
)
answer = response.choices[0].message.content
# Extract citations from answer
sources = self._extract_citations(answer, retrieved_chunks)
# Estimate confidence based on retrieval scores
confidence = self._estimate_confidence(retrieved_chunks)
return {
'answer': answer,
'sources': sources,
'confidence': confidence,
'model': self.model,
'tokens_used': response.usage.total_tokens
}
def _get_system_message(self, style: str) -> str:
"""Get system message based on response style."""
base_instruction = """You are a helpful assistant that answers questions using ONLY the provided context.
CRITICAL RULES:
1. Base your answer EXCLUSIVELY on the provided context
2. If the context doesn't contain enough information, say "I don't have enough information to answer this question"
3. Cite sources inline using [Source X] notation
4. Do not make up information or use external knowledge
5. If multiple sources contradict each other, mention the contradiction"""
style_instructions = {
"concise": "\n6. Provide concise, direct answers without unnecessary elaboration",
"detailed": "\n6. Provide comprehensive, detailed explanations with examples",
"technical": "\n6. Use technical language and include relevant technical details"
}
return base_instruction + style_instructions.get(style, "")
def _assemble_context(self, chunks: List[Dict]) -> str:
"""
Assemble retrieved chunks into formatted context.
Key decisions:
- Order by relevance (highest first) - LLMs have primacy bias
- Include metadata (heading, source) for context
- Add source IDs for citation
- Use clear delimiters between chunks
"""
context_parts = []
for i, chunk in enumerate(chunks, 1):
# Format: [Source X] (from "Section Heading")
# Content here...
source_id = f"[Source {i}]"
heading = chunk.get('heading', 'Unknown Section')
text = chunk['text']
score = chunk.get('final_score', chunk.get('score', 0))
context_parts.append(
f"{source_id} (from \"{heading}\", relevance: {score:.2f})\n{text}"
)
return "\n\n---\n\n".join(context_parts)
def _build_user_message(self, query: str, context: str) -> str:
"""Build user message with context and question."""
return f"""Context:
{context}
---
Question: {query}
Instructions:
- Answer based ONLY on the context above
- Cite sources using [Source X] notation
- If information is insufficient, explicitly state what's missing
- If sources contradict, mention it
Answer:"""
def _extract_citations(
self,
answer: str,
chunks: List[Dict]
) -> List[Dict]:
"""
Extract which sources were cited in the answer.
This is important for:
- Showing users where information came from
- Validating that model actually used the context
- Computing faithfulness metrics
"""
import re
# Find all [Source X] citations
citations = re.findall(r'\[Source (\d+)\]', answer)
cited_indices = set(int(c) - 1 for c in citations) # Convert to 0-indexed
sources = []
for idx in cited_indices:
if idx < len(chunks):
chunk = chunks[idx]
sources.append({
'source_id': idx + 1,
'text': chunk['text'],
'heading': chunk.get('heading', 'Unknown'),
'score': chunk.get('final_score', chunk.get('score', 0))
})
return sources
def _estimate_confidence(self, chunks: List[Dict]) -> float:
"""
Estimate answer confidence based on retrieval quality.
Heuristics:
- Higher retrieval scores → higher confidence
- Multiple high-scoring chunks → higher confidence
- Low diversity in scores → lower confidence (might be guessing)
"""
if not chunks:
return 0.0
scores = [
chunk.get('final_score', chunk.get('score', 0))
for chunk in chunks
]
# Average of top 3 scores (or all if fewer than 3)
top_scores = sorted(scores, reverse=True)[:3]
avg_score = sum(top_scores) / len(top_scores)
# Confidence is roughly proportional to retrieval quality
# Scale: 0.0-1.0
confidence = min(avg_score, 1.0)
return confidence
# Complete RAG Pipeline Example
class ProductionRAGSystem:
"""End-to-end RAG system combining all components."""
def __init__(self):
# Initialize components (assuming they're already set up)
self.retriever = HybridRetriever(...) # From previous code
self.generator = RAGGenerator(model="gpt-4-turbo-preview")
def query(
self,
user_query: str,
response_style: str = "concise",
min_confidence: float = 0.5
) -> Dict:
"""
Process user query through complete RAG pipeline.
Args:
user_query: User's question
response_style: Answer style (concise/detailed/technical)
min_confidence: Minimum confidence threshold to return answer
Returns:
Dict with answer, sources, confidence, and metadata
"""
# Step 1: Retrieve relevant chunks
retrieved_chunks = self.retriever.retrieve(user_query)
# Step 2: Generate answer
result = self.generator.generate(
query=user_query,
retrieved_chunks=retrieved_chunks,
response_style=response_style
)
# Step 3: Check confidence threshold
if result['confidence'] < min_confidence:
result['warning'] = (
f"Low confidence ({result['confidence']:.2f}). "
"Answer may be unreliable."
)
# Step 4: Log for monitoring
self._log_query(user_query, result, retrieved_chunks)
return result
def _log_query(self, query: str, result: Dict, chunks: List[Dict]):
"""Log query for monitoring and improvement."""
# In production, log to your observability system
log_entry = {
'query': query,
'confidence': result['confidence'],
'num_sources': len(result['sources']),
'tokens_used': result['tokens_used'],
'top_retrieval_score': chunks[0]['final_score'] if chunks else 0
}
# logger.info(json.dumps(log_entry))
# Usage
rag_system = ProductionRAGSystem()
result = rag_system.query(
user_query="What are the main benefits of hybrid search in RAG systems?",
response_style="detailed"
)
print("Answer:", result['answer'])
print("\nSources used:")
for source in result['sources']:
print(f" - Source {source['source_id']}: {source['heading']}")
print(f"\nConfidence: {result['confidence']:.2%}")
if 'warning' in result:
print(f"⚠️ {result['warning']}")
Citation Generation
Users need to verify AI-generated answers. We extract citations by:
- Asking the LLM to cite sources inline
- Post-processing to map citations to actual retrieved chunks
- Validating that cited claims actually appear in cited sources
- Generating confidence scores based on citation coverage
Evaluation: The Hardest Part
You can't improve what you don't measure, but measuring RAG quality is genuinely difficult.
Why RAG evaluation is uniquely challenging: Traditional ML evaluation compares predictions to ground truth labels. RAG has multiple components (retrieval, generation, citations) each with different failure modes. A wrong answer might be caused by bad retrieval (right answer wasn't in context), bad generation (right answer was in context but LLM missed it), or bad source data (retrieved content was wrong). You need component-level evaluation to know what to fix.
The evaluation data problem: Good evaluation requires representative queries with known answers. Where do these come from? You could manually create them (expensive, limited coverage), extract them from logs (selection bias toward easy queries), or generate them synthetically (may not match real usage patterns). We use all three: a core set of 200 hand-curated queries for regression testing, augmented with synthetic queries for coverage, and sampled production queries for ongoing monitoring.
Offline vs. online evaluation: Offline evaluation (on a held-out test set) tells you if changes improved quality on your benchmark. Online evaluation (in production) tells you if changes improved quality on real user queries. These can diverge. A change might improve offline metrics but hurt online metrics if your test set doesn't represent production distribution. We use offline evaluation for rapid iteration and online evaluation for final validation.
Retrieval Evaluation
| Metric | What It Measures | Target |
|---|---|---|
| Recall@K | % of relevant docs in top K | > 90% |
| Precision@K | % of top K that are relevant | > 70% |
| MRR | Mean reciprocal rank of first relevant | > 0.8 |
| NDCG | Ranking quality with graded relevance | > 0.85 |
Build a golden dataset of queries with labeled relevant documents. We have 500 queries across our domains, each with human-labeled relevant documents.
Implementation: This evaluation framework provides automated metrics for RAG quality. The key is having a golden dataset—queries with known relevant documents and ideal answers. You can create this by:
- Sampling real user queries from logs
- Having domain experts label which documents are relevant
- Optionally writing ideal reference answers
The framework measures both retrieval quality (are the right documents retrieved?) and generation quality (is the answer correct, grounded, and complete?). The faithfulness check is critical—it verifies that every claim in the answer appears in the context, preventing hallucinations.
from typing import List, Dict, Set
import numpy as np
from openai import OpenAI
import json
class RAGEvaluator:
"""Comprehensive evaluation framework for RAG systems."""
def __init__(self):
self.client = OpenAI()
def evaluate_retrieval(
self,
query: str,
retrieved_docs: List[Dict],
relevant_doc_ids: Set[str],
k: int = 10
) -> Dict:
"""
Evaluate retrieval quality against golden dataset.
Args:
query: The search query
retrieved_docs: List of retrieved documents with 'id' field
relevant_doc_ids: Set of known relevant document IDs
k: Number of top results to evaluate
Returns:
Dict with precision, recall, MRR, NDCG metrics
"""
retrieved_ids = [doc['id'] for doc in retrieved_docs[:k]]
# Precision@K: What fraction of retrieved docs are relevant?
relevant_retrieved = set(retrieved_ids) & relevant_doc_ids
precision_at_k = len(relevant_retrieved) / k if k > 0 else 0
# Recall@K: What fraction of relevant docs were retrieved?
recall_at_k = (
len(relevant_retrieved) / len(relevant_doc_ids)
if relevant_doc_ids else 0
)
# Mean Reciprocal Rank: Position of first relevant doc
mrr = 0
for i, doc_id in enumerate(retrieved_ids, 1):
if doc_id in relevant_doc_ids:
mrr = 1 / i
break
# NDCG: Ranking quality with graded relevance
# (simplified version with binary relevance)
dcg = sum(
(1 if doc_id in relevant_doc_ids else 0) / np.log2(i + 1)
for i, doc_id in enumerate(retrieved_ids, 1)
)
# Ideal DCG (all relevant docs at top)
ideal_ranking = [1] * min(len(relevant_doc_ids), k)
idcg = sum(1 / np.log2(i + 1) for i in range(1, len(ideal_ranking) + 1))
ndcg = dcg / idcg if idcg > 0 else 0
return {
'precision@k': precision_at_k,
'recall@k': recall_at_k,
'mrr': mrr,
'ndcg': ndcg,
'num_relevant_retrieved': len(relevant_retrieved),
'num_relevant_total': len(relevant_doc_ids)
}
def evaluate_generation(
self,
query: str,
generated_answer: str,
context: str,
reference_answer: str = None
) -> Dict:
"""
Evaluate generation quality.
Metrics:
- Faithfulness: Are claims grounded in context?
- Relevance: Does it answer the question?
- Completeness: Does it cover key points?
- Answer similarity: How close to reference? (if provided)
Uses LLM-as-judge for automated evaluation.
"""
results = {}
# Faithfulness: Check if answer is grounded in context
results['faithfulness'] = self._check_faithfulness(
generated_answer, context
)
# Relevance: Does answer address the query?
results['relevance'] = self._check_relevance(
query, generated_answer
)
# If reference answer provided, compute similarity
if reference_answer:
results['answer_similarity'] = self._compute_answer_similarity(
generated_answer, reference_answer
)
return results
def _check_faithfulness(self, answer: str, context: str) -> Dict:
"""
Check if answer is faithful to context (no hallucinations).
Returns:
- is_faithful: bool
- unfaithful_claims: List of claims not in context
- score: 0-1 faithfulness score
"""
prompt = f"""Given the following context and answer, determine if the answer is faithful to the context.
An answer is faithful if every factual claim in the answer is supported by the context.
Context:
{context}
Answer:
{answer}
Evaluate faithfulness:
1. List each factual claim in the answer
2. For each claim, verify if it's supported by the context
3. Return your evaluation in JSON format:
{{
"is_faithful": true/false,
"claims": [
{{"claim": "...", "supported": true/false, "evidence": "..."}},
...
],
"score": 0.0-1.0,
"explanation": "..."
}}
JSON Response:"""
response = self.client.chat.completions.create(
model="gpt-4-turbo-preview",
temperature=0,
response_format={"type": "json_object"},
messages=[{"role": "user", "content": prompt}]
)
result = json.loads(response.choices[0].message.content)
return result
def _check_relevance(self, query: str, answer: str) -> Dict:
"""Check if answer relevantly addresses the query."""
prompt = f"""Evaluate if the following answer relevantly addresses the query.
Query: {query}
Answer: {answer}
Return JSON evaluation:
{{
"is_relevant": true/false,
"relevance_score": 0.0-1.0,
"explanation": "Why is/isn't this relevant?",
"missing_aspects": ["What key aspects of the query are not addressed?"]
}}
JSON Response:"""
response = self.client.chat.completions.create(
model="gpt-4-turbo-preview",
temperature=0,
response_format={"type": "json_object"},
messages=[{"role": "user", "content": prompt}]
)
result = json.loads(response.choices[0].message.content)
return result
def _compute_answer_similarity(
self, generated: str, reference: str
) -> float:
"""
Compute semantic similarity between generated and reference answer.
Uses embedding-based similarity.
"""
from openai import OpenAI
client = OpenAI()
# Get embeddings
embeddings = client.embeddings.create(
model="text-embedding-3-large",
input=[generated, reference]
)
gen_emb = np.array(embeddings.data[0].embedding)
ref_emb = np.array(embeddings.data[1].embedding)
# Cosine similarity
similarity = np.dot(gen_emb, ref_emb) / (
np.linalg.norm(gen_emb) * np.linalg.norm(ref_emb)
)
return float(similarity)
def evaluate_end_to_end(
self,
test_cases: List[Dict],
rag_system
) -> Dict:
"""
Run full evaluation on a test set.
test_cases format:
[
{
"query": "...",
"relevant_doc_ids": [...],
"reference_answer": "..." # optional
},
...
]
Returns aggregated metrics across all test cases.
"""
retrieval_results = []
generation_results = []
for case in test_cases:
query = case['query']
# Run RAG system
result = rag_system.query(query)
# Evaluate retrieval
retrieval_metrics = self.evaluate_retrieval(
query=query,
retrieved_docs=result.get('retrieved_docs', []),
relevant_doc_ids=set(case['relevant_doc_ids'])
)
retrieval_results.append(retrieval_metrics)
# Evaluate generation
generation_metrics = self.evaluate_generation(
query=query,
generated_answer=result['answer'],
context=result.get('context', ''),
reference_answer=case.get('reference_answer')
)
generation_results.append(generation_metrics)
# Aggregate metrics
return {
'retrieval': self._aggregate_metrics(retrieval_results),
'generation': self._aggregate_metrics(generation_results),
'num_test_cases': len(test_cases)
}
def _aggregate_metrics(self, results: List[Dict]) -> Dict:
"""Compute mean and std for all numeric metrics."""
aggregated = {}
# Get all metric keys
all_keys = set()
for result in results:
all_keys.update(result.keys())
for key in all_keys:
values = []
for result in results:
val = result.get(key)
# Only aggregate numeric values
if isinstance(val, (int, float)):
values.append(val)
if values:
aggregated[f'{key}_mean'] = np.mean(values)
aggregated[f'{key}_std'] = np.std(values)
return aggregated
# Usage Example
evaluator = RAGEvaluator()
# Golden dataset example
test_cases = [
{
"query": "What are the benefits of hybrid search?",
"relevant_doc_ids": {"doc_42", "doc_103", "doc_87"},
"reference_answer": "Hybrid search combines dense and sparse retrieval..."
},
# ... more test cases
]
# Evaluate your RAG system
results = evaluator.evaluate_end_to_end(
test_cases=test_cases,
rag_system=rag_system
)
print("Retrieval Metrics:")
print(f" Precision@10: {results['retrieval']['precision@k_mean']:.2%} ± {results['retrieval']['precision@k_std']:.2%}")
print(f" Recall@10: {results['retrieval']['recall@k_mean']:.2%}")
print(f" MRR: {results['retrieval']['mrr_mean']:.3f}")
print(f" NDCG: {results['retrieval']['ndcg_mean']:.3f}")
print("\nGeneration Metrics:")
print(f" Faithfulness: {results['generation']['faithfulness_score_mean']:.2%}")
print(f" Relevance: {results['generation']['relevance_score_mean']:.2%}")
Generation Evaluation
| Metric | What It Measures | How We Measure |
|---|---|---|
| Faithfulness | Are claims grounded in context? | LLM-as-judge + human audit |
| Relevance | Does it answer the question? | LLM-as-judge |
| Completeness | Does it cover all relevant info? | Human evaluation |
| Coherence | Is it well-written? | LLM-as-judge |
End-to-End Metrics
Ultimately, what matters is user outcomes:
- Task completion rate
- User satisfaction scores
- Time to answer
- Escalation rate (to human support)
Track correlation between your offline metrics and these online metrics. If they diverge, your evaluation is measuring the wrong things.
Production Considerations
Caching
RAG systems have natural caching opportunities:
- Embedding cache: Don't re-embed identical queries
- Retrieval cache: Same query often returns same results
- Response cache: Exact query matches can return cached responses
We use a tiered caching strategy with Redis, achieving ~40% cache hit rate and 3x latency improvement for cached queries.
Monitoring and Alerting
Monitor:
- Retrieval latency (p50, p95, p99)
- Generation latency
- Token usage and costs
- Empty retrieval rate (queries with no relevant results)
- User feedback signals
- Error rates by query type
Alert on:
- Latency degradation (>2x normal p95)
- Spike in empty retrievals
- Error rate >1%
- Sudden cost increase
Continuous Improvement
The work is never done. We run weekly:
- Analysis of low-rated responses
- Review of empty retrieval queries
- Identification of new document types to add
- Re-evaluation on expanded test sets
Monthly:
- Embedding model comparison
- Chunking strategy experiments
- Full retrieval benchmark
Common Pitfalls and How to Avoid Them
-
Skipping hybrid search: Pure vector search leaves 15-20% of precision on the table. Always combine dense and sparse.
-
One-size-fits-all chunking: Different document types need different strategies. Legal contracts, technical docs, and blog posts shouldn't be chunked the same way.
-
Ignoring context window limits: With 128K context windows, it's tempting to stuff everything in. But more context != better answers. Quality over quantity.
-
No evaluation framework: If you're not measuring retrieval quality, you're flying blind. Build evaluation infrastructure before scaling.
-
Treating RAG as a silver bullet: RAG works for factual retrieval from documents. It doesn't replace fine-tuning for behavior changes or reasoning improvements.
Conclusion
Building production RAG systems is an engineering discipline, not a demo hack. Success requires careful attention to every component: document processing, embedding strategy, retrieval architecture, generation prompting, and rigorous evaluation.
Start simple, measure everything, and iterate based on real user feedback. The techniques in this post have been battle-tested across millions of queries—but your domain will have its own challenges. Build the evaluation infrastructure to discover and solve them.
Frequently Asked Questions
Related Articles
Agentic RAG: When Retrieval Meets Autonomous Reasoning
How to build RAG systems that don't just retrieve—they reason, plan, and iteratively refine their searches to solve complex information needs.
LLM Frameworks: LangChain, LlamaIndex, LangGraph, and Beyond
A comprehensive comparison of LLM application frameworks—LangChain, LlamaIndex, LangGraph, Haystack, and alternatives. When to use each, how to combine them, and practical implementation patterns.
Mastering LLM Context Windows: Strategies for Long-Context Applications
Practical techniques for managing context windows in production LLM applications—from compression to hierarchical processing to infinite context architectures.