What embedding model should I use?

For most use cases, start with OpenAI's text-embedding-3-large or text-embedding-3-small (for cost-sensitive applications). If you're in a specialized domain (legal, medical, code), consider Voyage AI's domain-specific models or fine-tuning your own. For open-source deployments, BGE-large-en-v1.5 or E5-large-v2 are strong choices.

How many chunks should I retrieve?

It depends on your context window budget and quality requirements. We typically retrieve 20-30 chunks before re-ranking, then pass the top 5-10 to the LLM. More isn't always better—irrelevant chunks dilute signal and increase cost. Measure faithfulness and relevance as you adjust this parameter.

Should I use a vector database or PostgreSQL with pgvector?

For production workloads over 1M vectors with high QPS requirements, dedicated vector databases (Pinecone, Weaviate, Qdrant, Milvus) offer better performance, scaling, and features. PostgreSQL with pgvector is excellent for smaller scale (<1M vectors, <100 QPS) and when you want to minimize infrastructure complexity. We use Pinecone for customer-facing products and pgvector for internal tools.

How do I handle documents that change frequently?

Implement incremental indexing with change detection. Track document hashes and only re-process changed content. For frequently updated documents, consider shorter TTLs on cached retrievals. Some vector databases support in-place updates; others require delete-and-reinsert.

What's the best way to handle multi-lingual content?

Use multilingual embedding models (Cohere embed-v3, multilingual-e5-large). Store language metadata with chunks and consider language-specific retrieval when query language is detected. For generation, ensure your LLM handles the target language well—or translate context to the LLM's strongest language and translate responses back.

How do I reduce hallucinations?

Hallucinations decrease with: (1) better retrieval precision—irrelevant context confuses the model, (2) explicit grounding instructions in prompts, (3) citation requirements that force the model to connect claims to sources, (4) faithfulness evaluation in your pipeline that catches and flags ungrounded claims. No technique eliminates hallucinations entirely—always include confidence signals and source citations for users to verify.

Building Production-Ready RAG Systems: Lessons from the Field | Enrico Piovano

Q: How do I handle documents that change frequently?

Implement incremental indexing with change detection. Track document hashes and only re-process changed content. For frequently updated documents, consider shorter TTLs on cached retrievals. Some vector databases support in-place updates; others require delete-and-reinsert.

Q: What's the best way to handle multi-lingual content?

Use multilingual embedding models (Cohere embed-v3, multilingual-e5-large). Store language metadata with chunks and consider language-specific retrieval when query language is detected. For generation, ensure your LLM handles the target language well—or translate context to the LLM's strongest language and translate responses back.

Q: How do I reduce hallucinations?

Hallucinations decrease with: (1) better retrieval precision—irrelevant context confuses the model, (2) explicit grounding instructions in prompts, (3) citation requirements that force the model to connect claims to sources, (4) faithfulness evaluation in your pipeline that catches and flags ungrounded claims. No technique eliminates hallucinations entirely—always include confidence signals and source citations for users to verify.

Introduction

Retrieval-Augmented Generation (RAG) has become the dominant architecture for building LLM applications that need access to specific, up-to-date, or proprietary information. The concept is deceptively simple: retrieve relevant context, feed it to an LLM, generate a grounded response. Yet the gap between a weekend RAG prototype and a production system handling millions of queries is enormous.

After building RAG systems at Goji AI that process over 2 million queries monthly across legal, financial, and technical domains, I've learned that success lies not in any single technique, but in the careful orchestration of dozens of interdependent components. This post shares the hard-won lessons from that journey.

The Anatomy of Production RAG

A production RAG system is far more complex than the typical tutorial suggests. Here's what a real architecture looks like:

Ingestion Pipeline:

Document acquisition and normalization
Content extraction (PDF, HTML, Office formats)
Preprocessing and cleaning
Chunking and overlap management
Metadata extraction and enrichment
Embedding generation
Vector store indexing

Query Pipeline:

Query understanding and classification
Query expansion and reformulation
Hybrid retrieval (dense + sparse)
Re-ranking and filtering
Context assembly and compression
Prompt construction
LLM generation
Response post-processing
Citation extraction

Supporting Infrastructure:

Evaluation and monitoring
Caching layers
Rate limiting and load balancing
Feedback collection
Continuous improvement loops

Let's dive into the critical decisions at each stage.

Document Processing: The Foundation

The quality of your RAG system is bounded by the quality of your document processing. No amount of sophisticated retrieval can compensate for poorly chunked or corrupted content.

Why document processing is the most underinvested area in RAG: Teams spend weeks tuning embedding models and retrieval algorithms, but rush through document processing. This is backwards. If your extraction corrupts content, misses tables, or destroys document structure, you've created a ceiling that no amount of downstream optimization can break through. We've seen teams improve retrieval quality by 20%+ simply by fixing extraction issues, without touching any other component.

The garbage-in-garbage-out reality: Consider what happens when a PDF with two-column layout gets extracted as a single stream of text. Sentences from the left column interleave with sentences from the right column, creating nonsensical content. The embedding model faithfully embeds this garbage. The retrieval system dutifully returns it. The LLM generates confidently wrong answers from corrupted input. Every downstream component works perfectly—they just operate on corrupted data.

Content Extraction

PDF extraction alone is a minefield. Academic papers, scanned documents, tables, multi-column layouts—each requires different handling. We use a tiered approach:

Native text extraction for born-digital PDFs using PyMuPDF
OCR fallback with Tesseract for scanned documents, validated against confidence scores
Layout analysis using LayoutLM or similar models for complex documents
Table extraction with Camelot or Tabula, stored separately with structural metadata

Implementation: Here's our production document extraction pipeline that handles different PDF types intelligently. The key insight is that different PDFs require different extraction strategies—born-digital PDFs can use fast text extraction, while scanned documents need OCR. We also extract tables separately because they require different chunking strategies than prose text.

The confidence scoring is critical: OCR can produce gibberish on poor-quality scans. By checking the OCR confidence score, we can flag low-quality extractions for manual review rather than indexing garbage into our RAG system.

Python

import fitz  # PyMuPDF
from PIL import Image
import pytesseract
from typing import Dict, List, Tuple
import logging

class DocumentExtractor:
    """Production-grade document extraction with fallback strategies."""

    def __init__(self, ocr_confidence_threshold: float = 0.7):
        self.ocr_threshold = ocr_confidence_threshold
        self.logger = logging.getLogger(__name__)

    def extract_pdf(self, pdf_path: str) -> Dict[str, any]:
        """
        Extract text from PDF with automatic strategy selection.

        Returns a dict with:
        - text: extracted content
        - method: extraction method used (text/ocr/hybrid)
        - confidence: quality score (0-1)
        - tables: list of extracted tables
        - metadata: PDF metadata
        """
        doc = fitz.open(pdf_path)

        # Try native text extraction first
        text_content = []
        low_text_pages = []

        for page_num, page in enumerate(doc):
            text = page.get_text()

            # If page has very little text, it's likely scanned
            if len(text.strip()) < 50:
                low_text_pages.append(page_num)
            else:
                text_content.append(text)

        # Use OCR for pages with little/no text
        if low_text_pages:
            self.logger.info(f"Using OCR for {len(low_text_pages)} pages")
            ocr_results = self._ocr_pages(doc, low_text_pages)
            text_content.extend(ocr_results['text'])
            confidence = ocr_results['avg_confidence']
            method = "hybrid" if text_content else "ocr"
        else:
            confidence = 1.0
            method = "text"

        # Extract tables separately
        tables = self._extract_tables(pdf_path)

        return {
            'text': '\n\n'.join(text_content),
            'method': method,
            'confidence': confidence,
            'tables': tables,
            'metadata': doc.metadata,
            'page_count': len(doc)
        }

    def _ocr_pages(self, doc, page_numbers: List[int]) -> Dict:
        """OCR specific pages and return text with confidence scores."""
        texts = []
        confidences = []

        for page_num in page_numbers:
            page = doc[page_num]
            # Render page to image at 300 DPI for good OCR quality
            pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72))
            img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)

            # Get OCR data with confidence scores
            ocr_data = pytesseract.image_to_data(
                img,
                output_type=pytesseract.Output.DICT
            )

            # Calculate average confidence for this page
            conf_scores = [int(c) for c in ocr_data['conf'] if c != '-1']
            avg_conf = sum(conf_scores) / len(conf_scores) if conf_scores else 0

            # Only include text if confidence is acceptable
            if avg_conf >= self.ocr_threshold * 100:
                page_text = ' '.join(ocr_data['text'])
                texts.append(page_text)
                confidences.append(avg_conf / 100)
            else:
                self.logger.warning(
                    f"Page {page_num} OCR confidence too low: {avg_conf:.2f}%"
                )

        return {
            'text': texts,
            'avg_confidence': sum(confidences) / len(confidences) if confidences else 0
        }

    def _extract_tables(self, pdf_path: str) -> List[Dict]:
        """Extract tables with structure preserved."""
        try:
            import camelot
            tables = camelot.read_pdf(pdf_path, pages='all', flavor='lattice')

            return [{
                'page': table.page,
                'data': table.df.to_dict(),
                'shape': table.shape,
                'accuracy': table.accuracy
            } for table in tables]
        except Exception as e:
            self.logger.error(f"Table extraction failed: {e}")
            return []

# Usage example
extractor = DocumentExtractor(ocr_confidence_threshold=0.7)
result = extractor.extract_pdf("complex_document.pdf")

if result['confidence'] < 0.8:
    print(f"⚠️ Low confidence extraction ({result['confidence']:.2%})")
    print(f"Method used: {result['method']}")
    # Flag for manual review in production

For HTML content, we've found that simple libraries like BeautifulSoup often miss important context. We use a custom extraction pipeline that preserves:

Heading hierarchy (critical for chunking)
List structures
Table relationships
Link context and anchor text
Image alt text and captions

Implementation: This HTML extractor preserves document structure that's essential for high-quality chunking. The key difference from naive extraction is maintaining the heading hierarchy—this lets us chunk by section rather than by arbitrary character counts, which dramatically improves retrieval quality.

Notice how we preserve the heading level and position metadata. This allows our chunking strategy to keep related content together (everything under "Installation Instructions" stays together) rather than splitting mid-section.

Python

from bs4 import BeautifulSoup
from typing import List, Dict
import re

class HTMLExtractor:
    """Extract structured content from HTML preserving document structure."""

    def __init__(self):
        self.heading_tags = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']

    def extract_with_structure(self, html_content: str) -> Dict:
        """
        Extract HTML with preserved structure for better chunking.

        Returns sections with heading hierarchy, allowing semantic chunking
        that respects document structure rather than arbitrary splits.
        """
        soup = BeautifulSoup(html_content, 'html.parser')

        # Remove script, style, and navigation elements
        for tag in soup(['script', 'style', 'nav', 'footer', 'aside']):
            tag.decompose()

        sections = []
        current_section = {
            'heading': '',
            'level': 0,
            'content': [],
            'position': 0
        }

        for elem in soup.find_all(['p', 'li', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'table']):
            if elem.name in self.heading_tags:
                # Save previous section if it has content
                if current_section['content']:
                    sections.append(current_section)

                # Start new section
                level = int(elem.name[1])  # h1 -> 1, h2 -> 2, etc.
                current_section = {
                    'heading': elem.get_text(strip=True),
                    'level': level,
                    'content': [],
                    'position': len(sections)
                }

            elif elem.name == 'table':
                # Store tables separately with row/column structure
                current_section['content'].append({
                    'type': 'table',
                    'data': self._extract_table(elem)
                })

            elif elem.name in ['p', 'li']:
                text = elem.get_text(strip=True)
                if text:
                    # Preserve link context
                    links = elem.find_all('a')
                    link_context = [
                        {'text': a.get_text(), 'href': a.get('href')}
                        for a in links
                    ]

                    current_section['content'].append({
                        'type': 'text',
                        'text': text,
                        'links': link_context
                    })

        # Add final section
        if current_section['content']:
            sections.append(current_section)

        return {
            'sections': sections,
            'title': soup.find('title').get_text() if soup.find('title') else '',
            'metadata': self._extract_metadata(soup)
        }

    def _extract_table(self, table_elem) -> Dict:
        """Extract table with structure preserved."""
        rows = []
        for tr in table_elem.find_all('tr'):
            cells = [td.get_text(strip=True) for td in tr.find_all(['td', 'th'])]
            rows.append(cells)

        return {
            'headers': rows[0] if rows else [],
            'rows': rows[1:] if len(rows) > 1 else []
        }

    def _extract_metadata(self, soup) -> Dict:
        """Extract page metadata from meta tags."""
        metadata = {}

        # Common meta tags
        for meta in soup.find_all('meta'):
            name = meta.get('name') or meta.get('property')
            content = meta.get('content')
            if name and content:
                metadata[name] = content

        return metadata

# Usage
extractor = HTMLExtractor()
result = extractor.extract_with_structure(html_content)

# Now you can chunk by section, preserving semantic boundaries
for section in result['sections']:
    print(f"{'#' * section['level']} {section['heading']}")
    print(f"Position: {section['position']}")
    # This section stays together during chunking

Chunking Strategy: The Most Underrated Decision

Chunking strategy has more impact on retrieval quality than your choice of embedding model. After extensive experimentation, here's what we've learned.

Why chunking matters more than embedding models: A mediocre embedding model on well-chunked content will outperform an excellent embedding model on poorly-chunked content. Why? Embedding models are trained on coherent text—sentences that follow logically, paragraphs with complete thoughts. When you feed them a chunk that starts mid-sentence and ends mid-paragraph, the embedding doesn't capture the meaning because there is no coherent meaning to capture. The model can only embed what exists.

The chunk size tradeoff nobody talks about: Smaller chunks are more precise—they answer specific questions with less noise. Larger chunks provide more context—they help when the question requires understanding relationships across sentences. There's no universally optimal chunk size. We typically target 200-400 tokens as a baseline, but we adapt based on document type: short chunks for Q&A knowledge bases, longer chunks for narrative documents like reports.

Semantic Chunking Over Fixed-Size

Fixed token counts (e.g., 512 tokens) are convenient but destroy semantic coherence. A chunk that splits mid-sentence or mid-paragraph loses critical context. We chunk on semantic boundaries:

Section breaks (identified by headings)
Paragraph boundaries
Sentence boundaries as a last resort
Maximum chunk size as a hard limit, not a target

Implementation: This semantic chunker respects document structure instead of blindly splitting at token boundaries. The critical insight: a chunk should be a complete thought, not an arbitrary slice of text. By chunking at semantic boundaries (sections, paragraphs), each chunk is self-contained and embeds cleanly.

The overlap strategy is also semantic—we overlap by full sentences, not character counts. This ensures that if a key concept spans two chunks, both chunks get the complete sentences about that concept, not a truncated mid-sentence fragment.

Python

from typing import List, Dict
import tiktoken
from nltk.tokenize import sent_tokenize, ParagraphTokenizer
import nltk

# Download required NLTK data
nltk.download('punkt', quiet=True)

class SemanticChunker:
    """Chunk text at semantic boundaries, not arbitrary token counts."""

    def __init__(
        self,
        model_name: str = "gpt-4",
        target_chunk_size: int = 300,
        max_chunk_size: int = 500,
        overlap_sentences: int = 2
    ):
        self.tokenizer = tiktoken.encoding_for_model(model_name)
        self.target_chunk_size = target_chunk_size
        self.max_chunk_size = max_chunk_size
        self.overlap_sentences = overlap_sentences

    def chunk_document(self, sections: List[Dict]) -> List[Dict]:
        """
        Chunk document by semantic boundaries.

        Input: Structured sections from HTMLExtractor or similar
        Output: Chunks with metadata and overlap

        Each chunk maintains:
        - The heading hierarchy (crucial for context)
        - Position in document
        - Overlapping sentences with adjacent chunks
        """
        chunks = []

        for section in sections:
            section_heading = section['heading']
            section_level = section['level']

            # Extract all text content from section
            section_text = self._extract_section_text(section)

            # Split into paragraphs first (respects document structure)
            paragraphs = section_text.split('\n\n')

            current_chunk_paras = []
            current_tokens = 0

            for para in paragraphs:
                para = para.strip()
                if not para:
                    continue

                para_tokens = len(self.tokenizer.encode(para))

                # If single paragraph exceeds max, split by sentences
                if para_tokens > self.max_chunk_size:
                    # Finish current chunk if any
                    if current_chunk_paras:
                        chunk_text = '\n\n'.join(current_chunk_paras)
                        chunks.append(self._create_chunk(
                            chunk_text, section_heading, section_level, section['position']
                        ))
                        current_chunk_paras = []
                        current_tokens = 0

                    # Split long paragraph by sentences
                    chunks.extend(
                        self._chunk_long_paragraph(
                            para, section_heading, section_level, section['position']
                        )
                    )
                    continue

                # If adding this para exceeds target, finish current chunk
                if current_tokens + para_tokens > self.target_chunk_size and current_chunk_paras:
                    chunk_text = '\n\n'.join(current_chunk_paras)
                    chunks.append(self._create_chunk(
                        chunk_text, section_heading, section_level, section['position']
                    ))
                    current_chunk_paras = []
                    current_tokens = 0

                # Add paragraph to current chunk
                current_chunk_paras.append(para)
                current_tokens += para_tokens

            # Add final chunk if any content remains
            if current_chunk_paras:
                chunk_text = '\n\n'.join(current_chunk_paras)
                chunks.append(self._create_chunk(
                    chunk_text, section_heading, section_level, section['position']
                ))

        # Add overlaps between adjacent chunks
        chunks = self._add_overlaps(chunks)

        return chunks

    def _chunk_long_paragraph(
        self, paragraph: str, heading: str, level: int, position: int
    ) -> List[Dict]:
        """Split long paragraph by sentences."""
        sentences = sent_tokenize(paragraph)
        chunks = []

        current_sentences = []
        current_tokens = 0

        for sentence in sentences:
            sent_tokens = len(self.tokenizer.encode(sentence))

            if current_tokens + sent_tokens > self.target_chunk_size and current_sentences:
                chunk_text = ' '.join(current_sentences)
                chunks.append(self._create_chunk(chunk_text, heading, level, position))
                current_sentences = []
                current_tokens = 0

            current_sentences.append(sentence)
            current_tokens += sent_tokens

        if current_sentences:
            chunk_text = ' '.join(current_sentences)
            chunks.append(self._create_chunk(chunk_text, heading, level, position))

        return chunks

    def _create_chunk(self, text: str, heading: str, level: int, position: int) -> Dict:
        """Create chunk with metadata."""
        tokens = len(self.tokenizer.encode(text))

        return {
            'text': text,
            'heading': heading,
            'heading_level': level,
            'section_position': position,
            'token_count': tokens,
            'sentences': sent_tokenize(text)
        }

    def _add_overlaps(self, chunks: List[Dict]) -> List[Dict]:
        """Add sentence-level overlap between adjacent chunks."""
        for i in range(len(chunks) - 1):
            current_chunk = chunks[i]
            next_chunk = chunks[i + 1]

            # Take last N sentences from current chunk
            overlap_sentences = current_chunk['sentences'][-self.overlap_sentences:]

            # Add to next chunk's beginning
            next_chunk['overlap_prev'] = ' '.join(overlap_sentences)
            next_chunk['text'] = next_chunk['overlap_prev'] + '\n\n' + next_chunk['text']

        return chunks

    def _extract_section_text(self, section: Dict) -> str:
        """Extract plain text from structured section."""
        texts = []
        for item in section.get('content', []):
            if item['type'] == 'text':
                texts.append(item['text'])
            elif item['type'] == 'table':
                # Convert table to text representation
                table_text = self._table_to_text(item['data'])
                texts.append(table_text)
        return '\n\n'.join(texts)

    def _table_to_text(self, table_data: Dict) -> str:
        """Convert structured table to readable text."""
        lines = []

        # Headers
        if table_data.get('headers'):
            lines.append(' | '.join(table_data['headers']))
            lines.append('-' * 40)

        # Rows
        for row in table_data.get('rows', []):
            lines.append(' | '.join(str(cell) for cell in row))

        return '\n'.join(lines)

# Usage
chunker = SemanticChunker(
    model_name="gpt-4",
    target_chunk_size=300,
    max_chunk_size=500,
    overlap_sentences=2
)

# From our HTML extractor
html_extractor = HTMLExtractor()
structured_doc = html_extractor.extract_with_structure(html_content)

# Chunk by semantic boundaries
chunks = chunker.chunk_document(structured_doc['sections'])

for chunk in chunks:
    print(f"Heading: {chunk['heading']} (Level {chunk['heading_level']})")
    print(f"Tokens: {chunk['token_count']}")
    print(f"Text: {chunk['text'][:100]}...")
    print()

Hierarchical Chunking

We maintain multiple representations of each document. This is one of the most powerful techniques for production RAG, but also one of the most underutilized because it requires more complex indexing:

Level	Content	Use Case
Document	Full text summary (LLM-generated)	High-level relevance
Section	Section with heading context	Topical queries
Paragraph	Individual paragraphs	Specific fact retrieval
Sentence	Key sentences with context	Precise answers

Why hierarchical chunking transforms retrieval quality: Consider the query "What are the main findings of the research?" A paragraph-level chunk might contain one finding. But the question asks for "main findings" (plural)—it needs document or section level context. Without hierarchical chunks, you'd retrieve multiple paragraph chunks, each with partial information, and hope the LLM synthesizes them correctly. With hierarchical chunks, you retrieve the section summary that already aggregates the findings.

The implementation complexity is worth it: Hierarchical chunking means 4x the embeddings, 4x the storage, and more complex query logic (which level to query?). But the quality gains are substantial. We typically query multiple levels in parallel and let re-ranking sort out which level had the best match for this particular query.

This allows retrieval at the appropriate granularity. A question like "What is the company's revenue?" needs paragraph-level retrieval. "Summarize the key findings" benefits from section or document level.

Overlap and Context Windows

We use 15-20% overlap between adjacent chunks, but more importantly, we store contextual metadata with each chunk:

Parent section heading
Document title
Surrounding chunk summaries
Position in document (early/middle/late)

This metadata is crucial for re-ranking and helps the LLM understand where information comes from.

Metadata Enrichment

Raw text isn't enough. We extract and store:

Temporal metadata: Publication date, last updated, time references in content
Entity metadata: Named entities (people, organizations, products)
Structural metadata: Document type, section type, confidence scores
Source metadata: URL, author, domain authority
Custom metadata: Domain-specific tags, classification labels

This metadata enables powerful filtering. "Find information about GDPR compliance published after 2023 from official EU sources" becomes a simple filtered query.

Embedding Strategy

Model Selection

The embedding model landscape has matured significantly. Our current recommendations:

For General Use:

OpenAI text-embedding-3-large (excellent quality, reasonable cost)
Cohere embed-v3 (strong multilingual support)
Voyage AI voyage-large-2 (best-in-class for code and technical content)

For Cost-Sensitive Applications:

OpenAI text-embedding-3-small (80% of large quality at 20% cost)
Open-source alternatives: BGE-large, E5-large-v2

For Specialized Domains: Consider fine-tuning. We fine-tuned an embedding model on legal documents and saw retrieval precision improve from 72% to 89% on our benchmark.

Embedding Best Practices

Embed chunks with context: Don't embed raw chunk text. Prepend the document title and section heading. This dramatically improves retrieval for ambiguous queries.
Separate query and document embeddings: Some models (like E5) are trained with different prefixes for queries vs. documents. Use them correctly.
Normalize embeddings: Ensure your vector store uses normalized embeddings for consistent cosine similarity scores.
Version your embeddings: When you change embedding models or strategies, you need to re-embed everything. Track embedding versions in metadata.

Retrieval: Where Most Systems Fail

Retrieval is the heart of RAG, and it's where most production systems struggle. Pure vector search is not enough.

The retrieval quality ceiling: No matter how good your LLM is, it can only work with the context you provide. If your retrieval returns irrelevant chunks, the LLM will either hallucinate (ignoring the useless context) or generate wrong answers (trusting the wrong context). We've seen systems where improving retrieval quality from 70% to 90% precision doubled the end-to-end answer accuracy—without changing anything else.

Why demos work but production fails: In demos, you control the queries. You know what's in your documents and craft questions that match. In production, users ask questions you never anticipated, in phrasing you didn't expect, about edge cases in your content. The gap between demo performance and production performance is almost entirely a retrieval problem.

Hybrid Search is Non-Negotiable

Dense retrieval (embeddings) captures semantic similarity but misses keyword matches. Sparse retrieval (BM25) captures exact matches but misses paraphrases. You need both.

The failure modes are complementary: Vector search fails when the query uses different vocabulary than the documents. User asks about "compensation" but documents say "salary"—vector search might miss this if the embedding doesn't capture the synonym relationship strongly enough. BM25 fails when the query is conceptually related but shares no words. User asks "how to handle angry customers" but the document says "de-escalation techniques for difficult client interactions." Hybrid search handles both.

Our hybrid approach:

Dense retrieval: Top 50 candidates via ANN search
Sparse retrieval: Top 50 candidates via BM25
Reciprocal Rank Fusion: Combine results using RRF
Re-ranking: Cross-encoder on top 20 combined results

The performance difference is substantial:

Approach	Recall@10	Precision@10
Dense only	76%	58%
Sparse only	69%	62%
Hybrid (no rerank)	84%	65%
Hybrid + rerank	91%	78%

Implementation: This hybrid retriever combines the best of both worlds: BM25's exact keyword matching and vector search's semantic understanding. The Reciprocal Rank Fusion algorithm is elegant—it combines rankings without needing to normalize scores across different retrieval systems (which is notoriously difficult).

The key insight in RRF: a document that appears high in both rankings is probably very relevant. RRF gives each document a score based on its rank position in each result set, automatically handling the score normalization problem.

After RRF combines the results, we use a cross-encoder for precise re-ranking. Cross-encoders are expensive (they must process each query-document pair), but by running them only on the top 20 candidates from RRF, we get their accuracy at acceptable latency.

Python

from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings
from sentence_transformers import CrossEncoder
from typing import List, Dict, Tuple
import numpy as np

class HybridRetriever:
    """
    Production-grade hybrid retriever combining BM25 and vector search.

    Key design decisions:
    - BM25 for exact keyword matches (handles acronyms, proper nouns)
    - Vector search for semantic similarity (handles paraphrases, synonyms)
    - RRF for score-free fusion (no score normalization needed)
    - Cross-encoder for precise re-ranking (sees query + doc together)
    """

    def __init__(
        self,
        vector_store,
        documents: List[str],
        embeddings,
        cross_encoder_model: str = "cross-encoder/ms-marco-MiniLM-L-12-v2",
        bm25_weight: float = 0.4,
        vector_weight: float = 0.6,
        top_k_initial: int = 50,
        top_k_rerank: int = 20,
        top_k_final: int = 10
    ):
        """
        Args:
            vector_store: Vector store (Pinecone, Weaviate, etc.)
            documents: Raw documents for BM25 indexing
            embeddings: Embedding model for vector search
            cross_encoder_model: Model for re-ranking
            bm25_weight: Weight for BM25 scores (default 0.4)
            vector_weight: Weight for vector scores (default 0.6)
            top_k_initial: Candidates to retrieve from each method
            top_k_rerank: Candidates to re-rank with cross-encoder
            top_k_final: Final results to return
        """
        # Initialize BM25 retriever
        self.bm25_retriever = BM25Retriever.from_texts(
            documents,
            k=top_k_initial
        )

        # Vector store retriever
        self.vector_retriever = vector_store.as_retriever(
            search_kwargs={"k": top_k_initial}
        )

        # Cross-encoder for re-ranking
        self.cross_encoder = CrossEncoder(cross_encoder_model)

        # Parameters
        self.bm25_weight = bm25_weight
        self.vector_weight = vector_weight
        self.top_k_rerank = top_k_rerank
        self.top_k_final = top_k_final

        self.documents = documents

    def retrieve(self, query: str) -> List[Dict]:
        """
        Retrieve documents using hybrid search with re-ranking.

        Steps:
        1. BM25 retrieval (keyword-based)
        2. Vector retrieval (semantic)
        3. Reciprocal Rank Fusion
        4. Cross-encoder re-ranking
        5. Return top K

        Returns list of dicts with:
        - text: document text
        - score: final relevance score
        - rank: final ranking position
        - retrieval_method: how it was found (bm25/vector/both)
        """
        # Step 1: BM25 retrieval
        bm25_results = self.bm25_retriever.get_relevant_documents(query)

        # Step 2: Vector retrieval
        vector_results = self.vector_retriever.get_relevant_documents(query)

        # Step 3: Reciprocal Rank Fusion
        fused_results = self._reciprocal_rank_fusion(
            query, bm25_results, vector_results
        )

        # Step 4: Cross-encoder re-ranking on top candidates
        reranked_results = self._rerank_with_cross_encoder(
            query,
            fused_results[:self.top_k_rerank]
        )

        # Step 5: Return top K
        return reranked_results[:self.top_k_final]

    def _reciprocal_rank_fusion(
        self,
        query: str,
        bm25_results: List,
        vector_results: List,
        k: int = 60  # RRF constant, typically 60
    ) -> List[Dict]:
        """
        Combine rankings using Reciprocal Rank Fusion.

        RRF formula for each document d:
        RRF(d) = Σ(1 / (k + rank_i(d)))

        Where rank_i(d) is the rank of document d in retrieval method i.

        Why RRF works:
        - No score normalization needed (works with ranks only)
        - Documents appearing high in multiple rankings get boosted
        - Robust to score scale differences between methods
        """
        # Build rank dictionaries
        bm25_ranks = {doc.page_content: rank for rank, doc in enumerate(bm25_results)}
        vector_ranks = {doc.page_content: rank for rank, doc in enumerate(vector_results)}

        # All unique documents
        all_docs = set(bm25_ranks.keys()) | set(vector_ranks.keys())

        # Calculate RRF scores
        rrf_scores = {}
        for doc in all_docs:
            score = 0
            retrieval_methods = []

            if doc in bm25_ranks:
                score += self.bm25_weight / (k + bm25_ranks[doc])
                retrieval_methods.append('bm25')

            if doc in vector_ranks:
                score += self.vector_weight / (k + vector_ranks[doc])
                retrieval_methods.append('vector')

            rrf_scores[doc] = {
                'score': score,
                'retrieval_method': '+'.join(retrieval_methods),
                'bm25_rank': bm25_ranks.get(doc, None),
                'vector_rank': vector_ranks.get(doc, None)
            }

        # Sort by RRF score
        sorted_docs = sorted(
            rrf_scores.items(),
            key=lambda x: x[1]['score'],
            reverse=True
        )

        return [
            {
                'text': doc,
                'rrf_score': data['score'],
                'retrieval_method': data['retrieval_method'],
                'bm25_rank': data['bm25_rank'],
                'vector_rank': data['vector_rank']
            }
            for doc, data in sorted_docs
        ]

    def _rerank_with_cross_encoder(
        self,
        query: str,
        candidates: List[Dict]
    ) -> List[Dict]:
        """
        Re-rank candidates using cross-encoder.

        Cross-encoders see query and document together, enabling
        much more accurate relevance scoring than bi-encoders.

        Trade-off: Much slower (must run inference for each pair),
        so only use on top candidates from initial retrieval.
        """
        if not candidates:
            return []

        # Prepare query-document pairs
        pairs = [(query, cand['text']) for cand in candidates]

        # Get cross-encoder scores
        # Shape: (num_candidates,) with scores 0-1
        ce_scores = self.cross_encoder.predict(pairs)

        # Add cross-encoder scores to candidates
        for cand, score in zip(candidates, ce_scores):
            cand['cross_encoder_score'] = float(score)
            # Weighted combination of RRF and cross-encoder
            cand['final_score'] = 0.3 * cand['rrf_score'] + 0.7 * score

        # Sort by final score
        candidates.sort(key=lambda x: x['final_score'], reverse=True)

        # Add final ranks
        for rank, cand in enumerate(candidates):
            cand['rank'] = rank + 1

        return candidates

# Usage example
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings
import pinecone

# Initialize vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

pinecone.init(api_key="your-key", environment="us-west1-gcp")
vector_store = Pinecone.from_documents(
    documents,
    embeddings,
    index_name="production-rag"
)

# Initialize hybrid retriever
retriever = HybridRetriever(
    vector_store=vector_store,
    documents=[doc.page_content for doc in documents],
    embeddings=embeddings,
    bm25_weight=0.4,
    vector_weight=0.6,
    top_k_final=10
)

# Retrieve
query = "What are the key findings about climate change?"
results = retriever.retrieve(query)

for result in results:
    print(f"Rank {result['rank']}: Score {result['final_score']:.3f}")
    print(f"  Method: {result['retrieval_method']}")
    print(f"  BM25 rank: {result['bm25_rank']}, Vector rank: {result['vector_rank']}")
    print(f"  Text: {result['text'][:150]}...")
    print()

Query Understanding

Real user queries are messy. They contain typos, ambiguity, implicit context, and mixed intents. We process queries through several stages:

Query Classification:

Is this a factual question, a comparison, a summarization request?
Does it reference previous conversation context?
Is it in-domain or out-of-domain?

Query Expansion:

Synonym expansion using a domain-specific thesaurus
Acronym expansion
LLM-based query reformulation for complex queries

Query Decomposition: For complex queries like "Compare the privacy policies of Apple and Google regarding data retention," we decompose into sub-queries:

"Apple privacy policy data retention"
"Google privacy policy data retention" Then merge and deduplicate results.

Re-ranking: The Secret Weapon

Initial retrieval optimizes for recall—casting a wide net. Re-ranking optimizes for precision. We use a cross-encoder model (ms-marco-MiniLM-L-12-v2 or similar) that sees both query and document together.

Why cross-encoders outperform bi-encoders for ranking: Bi-encoders (used in initial retrieval) embed query and document separately, then compare the embeddings. This is fast but loses information—the model can't see how specific words in the query relate to specific words in the document. Cross-encoders see query and document together, enabling much richer comparison. The model can learn that "capital" in the query matches "headquarters" in a document about companies, but not in a document about finance. This contextual understanding is impossible with separate embeddings.

The two-stage architecture is essential at scale: You can't run cross-encoder inference on 100,000 documents—it would take minutes. Instead, you use fast bi-encoder search to get top 50-100 candidates, then expensive cross-encoder re-ranking on just those candidates. This gives you the best of both worlds: broad coverage from bi-encoders, precise ranking from cross-encoders.

Re-ranking is computationally expensive, so we only apply it to top candidates. The latency/quality tradeoff:

Candidates Re-ranked	Latency Added	Quality Gain
10	~50ms	+8% precision
20	~100ms	+12% precision
50	~250ms	+14% precision

We typically re-rank top 20 as the sweet spot.

Generation: Beyond Simple Prompting

With good retrieval, generation becomes more straightforward—but there are still critical decisions.

The generation step is often the easiest part: If your retrieval returns the right information, modern LLMs are remarkably good at synthesizing it into coherent answers. The hard work is upstream. That said, poor generation can still squander good retrieval. The most common failure modes: the LLM ignores the provided context and hallucinates, the LLM fails to cite sources making answers unverifiable, or the LLM produces verbose responses when concise answers were needed.

Context window management becomes critical at scale: GPT-4 can handle 128K tokens, but should you use all of it? More context means higher cost, higher latency, and—surprisingly—sometimes lower quality. LLMs can get "lost in the middle" when important information is surrounded by less relevant content. We've found that 4-8K tokens of highly relevant context often outperforms 32K tokens of moderately relevant context.

Context Assembly

Don't just concatenate retrieved chunks. Structure matters:

Order by relevance: Most relevant first (LLMs have primacy bias)
Include metadata: Source titles, dates help the LLM assess reliability
Add structural markers: Clear delimiters between chunks
Compress when needed: For very long contexts, use LLM-based summarization

Prompt Engineering for RAG

Our generation prompt includes:

Clear instruction on task and format
Explicit grounding requirements ("Only use information from the provided context")
Citation instructions
Handling for insufficient information ("If the context doesn't contain enough information to answer, say so")
Output format specification

Implementation: The prompt is where you enforce grounding—the LLM's tendency to hallucinate must be actively suppressed through clear instructions. Notice how we explicitly tell the model what to do when information is missing (say so, don't make it up), how to cite sources (inline with [source_id]), and how to structure the answer.

The system message sets the behavior globally (you are a helpful assistant that strictly uses provided context). The user message provides the actual context and question. By separating these, we can easily swap system messages for different use cases (concise answers, detailed explanations, etc.) without changing the context assembly logic.

Python

from openai import OpenAI
from typing import List, Dict
import json

class RAGGenerator:
    """Generate responses from retrieved context with proper grounding."""

    def __init__(
        self,
        model: str = "gpt-4-turbo-preview",
        temperature: float = 0.0  # Low temp for factual responses
    ):
        self.client = OpenAI()
        self.model = model
        self.temperature = temperature

    def generate(
        self,
        query: str,
        retrieved_chunks: List[Dict],
        response_style: str = "concise"  # concise, detailed, technical
    ) -> Dict:
        """
        Generate answer from retrieved context.

        Args:
            query: User question
            retrieved_chunks: List of dicts with 'text', 'heading', 'score'
            response_style: Answer style (concise/detailed/technical)

        Returns:
            Dict with:
            - answer: Generated response
            - sources: List of source chunks used
            - confidence: Estimated confidence (based on retrieval scores)
        """
        # Assemble context with source IDs
        context = self._assemble_context(retrieved_chunks)

        # Select appropriate system message
        system_message = self._get_system_message(response_style)

        # Build user message
        user_message = self._build_user_message(query, context)

        # Generate response
        response = self.client.chat.completions.create(
            model=self.model,
            temperature=self.temperature,
            messages=[
                {"role": "system", "content": system_message},
                {"role": "user", "content": user_message}
            ]
        )

        answer = response.choices[0].message.content

        # Extract citations from answer
        sources = self._extract_citations(answer, retrieved_chunks)

        # Estimate confidence based on retrieval scores
        confidence = self._estimate_confidence(retrieved_chunks)

        return {
            'answer': answer,
            'sources': sources,
            'confidence': confidence,
            'model': self.model,
            'tokens_used': response.usage.total_tokens
        }

    def _get_system_message(self, style: str) -> str:
        """Get system message based on response style."""
        base_instruction = """You are a helpful assistant that answers questions using ONLY the provided context.

CRITICAL RULES:
1. Base your answer EXCLUSIVELY on the provided context
2. If the context doesn't contain enough information, say "I don't have enough information to answer this question"
3. Cite sources inline using [Source X] notation
4. Do not make up information or use external knowledge
5. If multiple sources contradict each other, mention the contradiction"""

        style_instructions = {
            "concise": "\n6. Provide concise, direct answers without unnecessary elaboration",
            "detailed": "\n6. Provide comprehensive, detailed explanations with examples",
            "technical": "\n6. Use technical language and include relevant technical details"
        }

        return base_instruction + style_instructions.get(style, "")

    def _assemble_context(self, chunks: List[Dict]) -> str:
        """
        Assemble retrieved chunks into formatted context.

        Key decisions:
        - Order by relevance (highest first) - LLMs have primacy bias
        - Include metadata (heading, source) for context
        - Add source IDs for citation
        - Use clear delimiters between chunks
        """
        context_parts = []

        for i, chunk in enumerate(chunks, 1):
            # Format: [Source X] (from "Section Heading")
            # Content here...
            source_id = f"[Source {i}]"
            heading = chunk.get('heading', 'Unknown Section')
            text = chunk['text']
            score = chunk.get('final_score', chunk.get('score', 0))

            context_parts.append(
                f"{source_id} (from \"{heading}\", relevance: {score:.2f})\n{text}"
            )

        return "\n\n---\n\n".join(context_parts)

    def _build_user_message(self, query: str, context: str) -> str:
        """Build user message with context and question."""
        return f"""Context:
{context}

---

Question: {query}

Instructions:
- Answer based ONLY on the context above
- Cite sources using [Source X] notation
- If information is insufficient, explicitly state what's missing
- If sources contradict, mention it

Answer:"""

    def _extract_citations(
        self,
        answer: str,
        chunks: List[Dict]
    ) -> List[Dict]:
        """
        Extract which sources were cited in the answer.

        This is important for:
        - Showing users where information came from
        - Validating that model actually used the context
        - Computing faithfulness metrics
        """
        import re

        # Find all [Source X] citations
        citations = re.findall(r'\[Source (\d+)\]', answer)
        cited_indices = set(int(c) - 1 for c in citations)  # Convert to 0-indexed

        sources = []
        for idx in cited_indices:
            if idx < len(chunks):
                chunk = chunks[idx]
                sources.append({
                    'source_id': idx + 1,
                    'text': chunk['text'],
                    'heading': chunk.get('heading', 'Unknown'),
                    'score': chunk.get('final_score', chunk.get('score', 0))
                })

        return sources

    def _estimate_confidence(self, chunks: List[Dict]) -> float:
        """
        Estimate answer confidence based on retrieval quality.

        Heuristics:
        - Higher retrieval scores → higher confidence
        - Multiple high-scoring chunks → higher confidence
        - Low diversity in scores → lower confidence (might be guessing)
        """
        if not chunks:
            return 0.0

        scores = [
            chunk.get('final_score', chunk.get('score', 0))
            for chunk in chunks
        ]

        # Average of top 3 scores (or all if fewer than 3)
        top_scores = sorted(scores, reverse=True)[:3]
        avg_score = sum(top_scores) / len(top_scores)

        # Confidence is roughly proportional to retrieval quality
        # Scale: 0.0-1.0
        confidence = min(avg_score, 1.0)

        return confidence

# Complete RAG Pipeline Example
class ProductionRAGSystem:
    """End-to-end RAG system combining all components."""

    def __init__(self):
        # Initialize components (assuming they're already set up)
        self.retriever = HybridRetriever(...)  # From previous code
        self.generator = RAGGenerator(model="gpt-4-turbo-preview")

    def query(
        self,
        user_query: str,
        response_style: str = "concise",
        min_confidence: float = 0.5
    ) -> Dict:
        """
        Process user query through complete RAG pipeline.

        Args:
            user_query: User's question
            response_style: Answer style (concise/detailed/technical)
            min_confidence: Minimum confidence threshold to return answer

        Returns:
            Dict with answer, sources, confidence, and metadata
        """
        # Step 1: Retrieve relevant chunks
        retrieved_chunks = self.retriever.retrieve(user_query)

        # Step 2: Generate answer
        result = self.generator.generate(
            query=user_query,
            retrieved_chunks=retrieved_chunks,
            response_style=response_style
        )

        # Step 3: Check confidence threshold
        if result['confidence'] < min_confidence:
            result['warning'] = (
                f"Low confidence ({result['confidence']:.2f}). "
                "Answer may be unreliable."
            )

        # Step 4: Log for monitoring
        self._log_query(user_query, result, retrieved_chunks)

        return result

    def _log_query(self, query: str, result: Dict, chunks: List[Dict]):
        """Log query for monitoring and improvement."""
        # In production, log to your observability system
        log_entry = {
            'query': query,
            'confidence': result['confidence'],
            'num_sources': len(result['sources']),
            'tokens_used': result['tokens_used'],
            'top_retrieval_score': chunks[0]['final_score'] if chunks else 0
        }
        # logger.info(json.dumps(log_entry))

# Usage
rag_system = ProductionRAGSystem()

result = rag_system.query(
    user_query="What are the main benefits of hybrid search in RAG systems?",
    response_style="detailed"
)

print("Answer:", result['answer'])
print("\nSources used:")
for source in result['sources']:
    print(f"  - Source {source['source_id']}: {source['heading']}")
print(f"\nConfidence: {result['confidence']:.2%}")

if 'warning' in result:
    print(f"⚠️  {result['warning']}")

Citation Generation

Users need to verify AI-generated answers. We extract citations by:

Asking the LLM to cite sources inline
Post-processing to map citations to actual retrieved chunks
Validating that cited claims actually appear in cited sources
Generating confidence scores based on citation coverage

Evaluation: The Hardest Part

You can't improve what you don't measure, but measuring RAG quality is genuinely difficult.

Why RAG evaluation is uniquely challenging: Traditional ML evaluation compares predictions to ground truth labels. RAG has multiple components (retrieval, generation, citations) each with different failure modes. A wrong answer might be caused by bad retrieval (right answer wasn't in context), bad generation (right answer was in context but LLM missed it), or bad source data (retrieved content was wrong). You need component-level evaluation to know what to fix.

The evaluation data problem: Good evaluation requires representative queries with known answers. Where do these come from? You could manually create them (expensive, limited coverage), extract them from logs (selection bias toward easy queries), or generate them synthetically (may not match real usage patterns). We use all three: a core set of 200 hand-curated queries for regression testing, augmented with synthetic queries for coverage, and sampled production queries for ongoing monitoring.

Offline vs. online evaluation: Offline evaluation (on a held-out test set) tells you if changes improved quality on your benchmark. Online evaluation (in production) tells you if changes improved quality on real user queries. These can diverge. A change might improve offline metrics but hurt online metrics if your test set doesn't represent production distribution. We use offline evaluation for rapid iteration and online evaluation for final validation.

Retrieval Evaluation

Metric	What It Measures	Target
Recall@K	% of relevant docs in top K	> 90%
Precision@K	% of top K that are relevant	> 70%
MRR	Mean reciprocal rank of first relevant	> 0.8
NDCG	Ranking quality with graded relevance	> 0.85

Build a golden dataset of queries with labeled relevant documents. We have 500 queries across our domains, each with human-labeled relevant documents.

Implementation: This evaluation framework provides automated metrics for RAG quality. The key is having a golden dataset—queries with known relevant documents and ideal answers. You can create this by:

Sampling real user queries from logs
Having domain experts label which documents are relevant
Optionally writing ideal reference answers

The framework measures both retrieval quality (are the right documents retrieved?) and generation quality (is the answer correct, grounded, and complete?). The faithfulness check is critical—it verifies that every claim in the answer appears in the context, preventing hallucinations.

Python

from typing import List, Dict, Set
import numpy as np
from openai import OpenAI
import json

class RAGEvaluator:
    """Comprehensive evaluation framework for RAG systems."""

    def __init__(self):
        self.client = OpenAI()

    def evaluate_retrieval(
        self,
        query: str,
        retrieved_docs: List[Dict],
        relevant_doc_ids: Set[str],
        k: int = 10
    ) -> Dict:
        """
        Evaluate retrieval quality against golden dataset.

        Args:
            query: The search query
            retrieved_docs: List of retrieved documents with 'id' field
            relevant_doc_ids: Set of known relevant document IDs
            k: Number of top results to evaluate

        Returns:
            Dict with precision, recall, MRR, NDCG metrics
        """
        retrieved_ids = [doc['id'] for doc in retrieved_docs[:k]]

        # Precision@K: What fraction of retrieved docs are relevant?
        relevant_retrieved = set(retrieved_ids) & relevant_doc_ids
        precision_at_k = len(relevant_retrieved) / k if k > 0 else 0

        # Recall@K: What fraction of relevant docs were retrieved?
        recall_at_k = (
            len(relevant_retrieved) / len(relevant_doc_ids)
            if relevant_doc_ids else 0
        )

        # Mean Reciprocal Rank: Position of first relevant doc
        mrr = 0
        for i, doc_id in enumerate(retrieved_ids, 1):
            if doc_id in relevant_doc_ids:
                mrr = 1 / i
                break

        # NDCG: Ranking quality with graded relevance
        # (simplified version with binary relevance)
        dcg = sum(
            (1 if doc_id in relevant_doc_ids else 0) / np.log2(i + 1)
            for i, doc_id in enumerate(retrieved_ids, 1)
        )

        # Ideal DCG (all relevant docs at top)
        ideal_ranking = [1] * min(len(relevant_doc_ids), k)
        idcg = sum(1 / np.log2(i + 1) for i in range(1, len(ideal_ranking) + 1))

        ndcg = dcg / idcg if idcg > 0 else 0

        return {
            'precision@k': precision_at_k,
            'recall@k': recall_at_k,
            'mrr': mrr,
            'ndcg': ndcg,
            'num_relevant_retrieved': len(relevant_retrieved),
            'num_relevant_total': len(relevant_doc_ids)
        }

    def evaluate_generation(
        self,
        query: str,
        generated_answer: str,
        context: str,
        reference_answer: str = None
    ) -> Dict:
        """
        Evaluate generation quality.

        Metrics:
        - Faithfulness: Are claims grounded in context?
        - Relevance: Does it answer the question?
        - Completeness: Does it cover key points?
        - Answer similarity: How close to reference? (if provided)

        Uses LLM-as-judge for automated evaluation.
        """
        results = {}

        # Faithfulness: Check if answer is grounded in context
        results['faithfulness'] = self._check_faithfulness(
            generated_answer, context
        )

        # Relevance: Does answer address the query?
        results['relevance'] = self._check_relevance(
            query, generated_answer
        )

        # If reference answer provided, compute similarity
        if reference_answer:
            results['answer_similarity'] = self._compute_answer_similarity(
                generated_answer, reference_answer
            )

        return results

    def _check_faithfulness(self, answer: str, context: str) -> Dict:
        """
        Check if answer is faithful to context (no hallucinations).

        Returns:
        - is_faithful: bool
        - unfaithful_claims: List of claims not in context
        - score: 0-1 faithfulness score
        """
        prompt = f"""Given the following context and answer, determine if the answer is faithful to the context.
An answer is faithful if every factual claim in the answer is supported by the context.

Context:
{context}

Answer:
{answer}

Evaluate faithfulness:
1. List each factual claim in the answer
2. For each claim, verify if it's supported by the context
3. Return your evaluation in JSON format:
{{
    "is_faithful": true/false,
    "claims": [
        {{"claim": "...", "supported": true/false, "evidence": "..."}},
        ...
    ],
    "score": 0.0-1.0,
    "explanation": "..."
}}

JSON Response:"""

        response = self.client.chat.completions.create(
            model="gpt-4-turbo-preview",
            temperature=0,
            response_format={"type": "json_object"},
            messages=[{"role": "user", "content": prompt}]
        )

        result = json.loads(response.choices[0].message.content)
        return result

    def _check_relevance(self, query: str, answer: str) -> Dict:
        """Check if answer relevantly addresses the query."""
        prompt = f"""Evaluate if the following answer relevantly addresses the query.

Query: {query}

Answer: {answer}

Return JSON evaluation:
{{
    "is_relevant": true/false,
    "relevance_score": 0.0-1.0,
    "explanation": "Why is/isn't this relevant?",
    "missing_aspects": ["What key aspects of the query are not addressed?"]
}}

JSON Response:"""

        response = self.client.chat.completions.create(
            model="gpt-4-turbo-preview",
            temperature=0,
            response_format={"type": "json_object"},
            messages=[{"role": "user", "content": prompt}]
        )

        result = json.loads(response.choices[0].message.content)
        return result

    def _compute_answer_similarity(
        self, generated: str, reference: str
    ) -> float:
        """
        Compute semantic similarity between generated and reference answer.
        Uses embedding-based similarity.
        """
        from openai import OpenAI
        client = OpenAI()

        # Get embeddings
        embeddings = client.embeddings.create(
            model="text-embedding-3-large",
            input=[generated, reference]
        )

        gen_emb = np.array(embeddings.data[0].embedding)
        ref_emb = np.array(embeddings.data[1].embedding)

        # Cosine similarity
        similarity = np.dot(gen_emb, ref_emb) / (
            np.linalg.norm(gen_emb) * np.linalg.norm(ref_emb)
        )

        return float(similarity)

    def evaluate_end_to_end(
        self,
        test_cases: List[Dict],
        rag_system
    ) -> Dict:
        """
        Run full evaluation on a test set.

        test_cases format:
        [
            {
                "query": "...",
                "relevant_doc_ids": [...],
                "reference_answer": "..."  # optional
            },
            ...
        ]

        Returns aggregated metrics across all test cases.
        """
        retrieval_results = []
        generation_results = []

        for case in test_cases:
            query = case['query']

            # Run RAG system
            result = rag_system.query(query)

            # Evaluate retrieval
            retrieval_metrics = self.evaluate_retrieval(
                query=query,
                retrieved_docs=result.get('retrieved_docs', []),
                relevant_doc_ids=set(case['relevant_doc_ids'])
            )
            retrieval_results.append(retrieval_metrics)

            # Evaluate generation
            generation_metrics = self.evaluate_generation(
                query=query,
                generated_answer=result['answer'],
                context=result.get('context', ''),
                reference_answer=case.get('reference_answer')
            )
            generation_results.append(generation_metrics)

        # Aggregate metrics
        return {
            'retrieval': self._aggregate_metrics(retrieval_results),
            'generation': self._aggregate_metrics(generation_results),
            'num_test_cases': len(test_cases)
        }

    def _aggregate_metrics(self, results: List[Dict]) -> Dict:
        """Compute mean and std for all numeric metrics."""
        aggregated = {}

        # Get all metric keys
        all_keys = set()
        for result in results:
            all_keys.update(result.keys())

        for key in all_keys:
            values = []
            for result in results:
                val = result.get(key)
                # Only aggregate numeric values
                if isinstance(val, (int, float)):
                    values.append(val)

            if values:
                aggregated[f'{key}_mean'] = np.mean(values)
                aggregated[f'{key}_std'] = np.std(values)

        return aggregated

# Usage Example
evaluator = RAGEvaluator()

# Golden dataset example
test_cases = [
    {
        "query": "What are the benefits of hybrid search?",
        "relevant_doc_ids": {"doc_42", "doc_103", "doc_87"},
        "reference_answer": "Hybrid search combines dense and sparse retrieval..."
    },
    # ... more test cases
]

# Evaluate your RAG system
results = evaluator.evaluate_end_to_end(
    test_cases=test_cases,
    rag_system=rag_system
)

print("Retrieval Metrics:")
print(f"  Precision@10: {results['retrieval']['precision@k_mean']:.2%} ± {results['retrieval']['precision@k_std']:.2%}")
print(f"  Recall@10: {results['retrieval']['recall@k_mean']:.2%}")
print(f"  MRR: {results['retrieval']['mrr_mean']:.3f}")
print(f"  NDCG: {results['retrieval']['ndcg_mean']:.3f}")

print("\nGeneration Metrics:")
print(f"  Faithfulness: {results['generation']['faithfulness_score_mean']:.2%}")
print(f"  Relevance: {results['generation']['relevance_score_mean']:.2%}")

Generation Evaluation

Metric	What It Measures	How We Measure
Faithfulness	Are claims grounded in context?	LLM-as-judge + human audit
Relevance	Does it answer the question?	LLM-as-judge
Completeness	Does it cover all relevant info?	Human evaluation
Coherence	Is it well-written?	LLM-as-judge

End-to-End Metrics

Ultimately, what matters is user outcomes:

Task completion rate
User satisfaction scores
Time to answer
Escalation rate (to human support)

Track correlation between your offline metrics and these online metrics. If they diverge, your evaluation is measuring the wrong things.

Production Considerations

Caching

RAG systems have natural caching opportunities:

Embedding cache: Don't re-embed identical queries
Retrieval cache: Same query often returns same results
Response cache: Exact query matches can return cached responses

We use a tiered caching strategy with Redis, achieving ~40% cache hit rate and 3x latency improvement for cached queries.

Monitoring and Alerting

Monitor:

Retrieval latency (p50, p95, p99)
Generation latency
Token usage and costs
Empty retrieval rate (queries with no relevant results)
User feedback signals
Error rates by query type

Alert on:

Latency degradation (>2x normal p95)
Spike in empty retrievals
Error rate >1%
Sudden cost increase

Continuous Improvement

The work is never done. We run weekly:

Analysis of low-rated responses
Review of empty retrieval queries
Identification of new document types to add
Re-evaluation on expanded test sets

Monthly:

Embedding model comparison
Chunking strategy experiments
Full retrieval benchmark

Common Pitfalls and How to Avoid Them

Skipping hybrid search: Pure vector search leaves 15-20% of precision on the table. Always combine dense and sparse.
One-size-fits-all chunking: Different document types need different strategies. Legal contracts, technical docs, and blog posts shouldn't be chunked the same way.
Ignoring context window limits: With 128K context windows, it's tempting to stuff everything in. But more context != better answers. Quality over quantity.
No evaluation framework: If you're not measuring retrieval quality, you're flying blind. Build evaluation infrastructure before scaling.
Treating RAG as a silver bullet: RAG works for factual retrieval from documents. It doesn't replace fine-tuning for behavior changes or reasoning improvements.

Conclusion

Building production RAG systems is an engineering discipline, not a demo hack. Success requires careful attention to every component: document processing, embedding strategy, retrieval architecture, generation prompting, and rigorous evaluation.

Start simple, measure everything, and iterate based on real user feedback. The techniques in this post have been battle-tested across millions of queries—but your domain will have its own challenges. Build the evaluation infrastructure to discover and solve them.

Building Production-Ready RAG Systems: Lessons from the Field

Table of Contents

Introduction

The Anatomy of Production RAG

Document Processing: The Foundation

Content Extraction

Chunking Strategy: The Most Underrated Decision

Metadata Enrichment

Embedding Strategy

Model Selection

Embedding Best Practices

Retrieval: Where Most Systems Fail

Hybrid Search is Non-Negotiable

Query Understanding

Re-ranking: The Secret Weapon

Generation: Beyond Simple Prompting

Context Assembly

Prompt Engineering for RAG

Citation Generation

Evaluation: The Hardest Part

Retrieval Evaluation

Generation Evaluation

End-to-End Metrics

Production Considerations

Caching

Monitoring and Alerting

Continuous Improvement

Common Pitfalls and How to Avoid Them

Conclusion

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Agentic RAG: When Retrieval Meets Autonomous Reasoning

LLM Frameworks: LangChain, LlamaIndex, LangGraph, and Beyond

Mastering LLM Context Windows: Strategies for Long-Context Applications