Skip to main content
Back to Blog

ML System Design: A Complete Framework for Production Systems

A comprehensive framework for designing machine learning systems at scale. From problem framing to production monitoring—everything you need to build ML systems that actually work.

8 min read
Share:

Why ML System Design Matters

Building a machine learning model is the easy part. Getting it to work reliably in production, at scale, serving millions of users—that's where most teams struggle.

The gap between "model works in notebook" and "model works in production" is enormous. Models that achieve 95% accuracy offline mysteriously underperform when deployed. Systems that work perfectly in testing collapse under real-world load. Features that take milliseconds to compute locally take seconds when serving live traffic.

This isn't a failure of machine learning—it's a failure of system design. The best ML engineers aren't necessarily the ones who can build the most sophisticated models. They're the ones who can design systems that reliably transform raw data into business value at scale.

This post provides a complete framework for ML system design. It's structured as eight sequential steps, each building on the previous. Whether you're designing a recommendation system, a fraud detection pipeline, or a search ranking model, the same principles apply.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    ML SYSTEM DESIGN FRAMEWORK                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. PROBLEM FRAMING                                                      │
│     └──→ What business problem are we solving?                          │
│                                                                          │
│  2. SCALE & LATENCY REQUIREMENTS                                         │
│     └──→ How big and how fast?                                          │
│                                                                          │
│  3. METRICS                                                              │
│     └──→ How do we know if it's working?                                │
│                                                                          │
│  4. ARCHITECTURE                                                         │
│     └──→ How do the components fit together?                            │
│                                                                          │
│  5. DATA COLLECTION & PREPARATION                                        │
│     └──→ Where does the data come from?                                 │
│                                                                          │
│  6. OFFLINE MODEL DEVELOPMENT                                            │
│     └──→ Build and evaluate the model                                   │
│                                                                          │
│  7. ONLINE EXECUTION & TESTING                                           │
│     └──→ Deploy and validate in production                              │
│                                                                          │
│  8. MONITORING & CONTINUAL LEARNING                                      │
│     └──→ Keep it working over time                                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  This framework is:                                                      │
│  • Sequential: Each step builds on the previous                         │
│  • Iterative: You'll revisit earlier steps as you learn more           │
│  • Universal: Applies to any ML system, not just deep learning         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Step 1: Problem Framing

The most common mistake in ML system design is starting with "let's build a model" instead of "what business problem are we solving?" This section ensures you're solving the right problem before writing any code.

Define the Business Objective

Every ML system exists to achieve a business goal. If you can't articulate that goal clearly, you shouldn't be building an ML system yet.

The business objective should be:

  • Measurable: You can track progress with numbers
  • Time-bound: There's a deadline or milestone
  • Impactful: Success meaningfully affects the business
  • Achievable: ML can plausibly help (not everything needs ML)
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    PROBLEM FRAMING EXAMPLES                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  EXAMPLE 1: E-COMMERCE RECOMMENDATIONS                                   │
│  ─────────────────────────────────────                                   │
│  Problem: Users aren't discovering products they'd like                 │
│  Business Goal: Increase revenue per user by 20% in 6 months           │
│  ML Task: Personalized product recommendation (ranking)                 │
│  Success: Revenue per user increases, return visit rate up             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE 2: FRAUD DETECTION                                              │
│  ─────────────────────────────                                           │
│  Problem: Fraudulent transactions costing $10M/year                     │
│  Business Goal: Reduce fraud losses by 50% without hurting UX          │
│  ML Task: Binary classification (fraud vs legitimate)                   │
│  Success: Fraud losses drop, false positive rate stays <1%             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE 3: CONTENT MODERATION                                           │
│  ─────────────────────────────────                                       │
│  Problem: Harmful content reaching users, regulatory risk               │
│  Business Goal: Remove 99% of harmful content within 1 hour            │
│  ML Task: Multi-label classification (spam, hate, violence, etc.)      │
│  Success: Harmful content exposure drops, user reports decrease        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE 4: SEARCH RANKING                                               │
│  ───────────────────────────                                             │
│  Problem: Users not finding what they're looking for                    │
│  Business Goal: Improve search success rate from 60% to 80%            │
│  ML Task: Learning to rank (relevance scoring)                         │
│  Success: Users find items faster, search abandonment decreases        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Translate Business Goals to ML Tasks

Once you understand the business problem, translate it into a well-defined ML task. This requires answering several questions:

What type of ML problem is this?

  • Classification: Assign items to categories (spam/not spam, fraud/legitimate)
  • Regression: Predict a continuous value (price, time, probability)
  • Ranking: Order items by relevance or preference
  • Clustering: Group similar items together
  • Generation: Create new content (text, images, recommendations)

What is the input? Define exactly what data the model will receive at inference time. Be specific:

  • User features (demographics, history, context)
  • Item features (attributes, metadata, embeddings)
  • Context features (time, location, device, session)
  • Real-time signals (current page, recent actions)

What is the output? Define the exact format of the model's prediction:

  • A single score (probability, relevance)
  • A ranked list (top-N recommendations)
  • A category (class label)
  • Multiple labels (multi-label classification)
  • Generated content (text, embeddings)

What is the ground truth? How will you know if the model's prediction was correct? This is crucial because it determines how you'll train and evaluate the model:

  • Explicit labels: Human annotations, user ratings
  • Implicit labels: Clicks, purchases, time spent
  • Delayed labels: Fraud confirmed days later, subscription churn after months

Understand Failure Modes

Before building the system, understand how it can fail and what the consequences are. Different failure modes have different costs:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    FAILURE MODE ANALYSIS                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FRAUD DETECTION EXAMPLE:                                                │
│  ────────────────────────                                                │
│                                                                          │
│  False Negative (Missed Fraud):                                         │
│  • Cost: Direct financial loss                                          │
│  • Impact: $100-$10,000 per incident                                   │
│  • Tolerance: Minimize aggressively                                     │
│                                                                          │
│  False Positive (Blocked Legitimate):                                   │
│  • Cost: Lost sale + customer frustration                              │
│  • Impact: $50 lost sale + potential churn                             │
│  • Tolerance: Keep below 1% of transactions                            │
│                                                                          │
│  System Failure (Model Unavailable):                                    │
│  • Cost: Must have fallback (rule-based or allow all)                  │
│  • Impact: Either missed fraud or blocked sales                        │
│  • Tolerance: <0.01% of transactions affected                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RECOMMENDATION SYSTEM EXAMPLE:                                          │
│  ───────────────────────────────                                         │
│                                                                          │
│  Poor Recommendations:                                                   │
│  • Cost: User doesn't engage, opportunity cost                         │
│  • Impact: Lower CTR, reduced revenue                                  │
│  • Tolerance: Acceptable if better than random                         │
│                                                                          │
│  Offensive/Inappropriate Recommendations:                               │
│  • Cost: User trust damaged, potential PR crisis                       │
│  • Impact: User churn, brand damage                                    │
│  • Tolerance: Near zero—needs guardrails                               │
│                                                                          │
│  Filter Bubble (Too Similar):                                           │
│  • Cost: User misses content they'd enjoy                              │
│  • Impact: Long-term engagement decline                                │
│  • Tolerance: Monitor diversity metrics                                │
│                                                                          │
│  Cold Start (New User/Item):                                            │
│  • Cost: Can't personalize for new users/items                        │
│  • Impact: Poor initial experience                                     │
│  • Tolerance: Need explicit fallback strategy                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Problem Framing Checklist

Before moving to the next step, verify:

  • Business objective is clearly defined and measurable
  • ML task type identified (classification, ranking, etc.)
  • Input data specified (what the model receives)
  • Output format defined (what the model produces)
  • Ground truth source identified (how you'll get labels)
  • Success criteria established (what "good" looks like)
  • Failure modes understood (what can go wrong and the cost)
  • Stakeholder alignment confirmed (everyone agrees on goals)

Common Pitfall: Jumping to model architecture before understanding the problem. Spend more time here than you think you need. A well-framed problem is half-solved.


Step 2: Scale and Latency Requirements

This step prevents two equally bad outcomes: over-engineering a system for scale you don't have, or under-engineering for scale you'll hit in three months. Get the numbers right before designing the architecture.

Estimate Request Volume

Understand how many predictions your system needs to make:

Queries Per Second (QPS):

  • Average QPS: Total daily predictions / 86,400 seconds
  • Peak QPS: Usually 2-10x average (depends on traffic patterns)
  • Burst QPS: Short spikes, often 20-50x average

Example calculation for an e-commerce recommendation system:

Code
Daily active users: 10 million
Sessions per user per day: 2
Page views per session: 10
Recommendations per page: 1 call

Daily predictions: 10M × 2 × 10 = 200 million
Average QPS: 200M / 86,400 ≈ 2,300 QPS
Peak QPS (3x): ~7,000 QPS
Flash sale burst (10x): ~23,000 QPS
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    SCALE REFERENCE POINTS                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SCALE TIERS:                                                            │
│  ────────────                                                            │
│                                                                          │
│  Tier 1 - Small (Startup/MVP):                                          │
│  • 1-100 QPS                                                            │
│  • Single server can handle                                             │
│  • Simple deployment, minimal infrastructure                           │
│  • Example: Early-stage product recommendations                        │
│                                                                          │
│  Tier 2 - Medium (Growing Product):                                     │
│  • 100-10,000 QPS                                                       │
│  • Multiple servers, load balancing required                           │
│  • Need caching, careful resource planning                             │
│  • Example: Mid-size e-commerce search                                 │
│                                                                          │
│  Tier 3 - Large (Mature Product):                                       │
│  • 10,000-1,000,000 QPS                                                 │
│  • Distributed systems, complex orchestration                          │
│  • Dedicated infrastructure team                                        │
│  • Example: Major social media feed ranking                            │
│                                                                          │
│  Tier 4 - Hyperscale (Big Tech):                                        │
│  • 1M+ QPS                                                              │
│  • Custom hardware, global distribution                                │
│  • Massive engineering investment                                       │
│  • Example: Google Search, Facebook News Feed                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE OF THUMB:                                                          │
│  Design for 10x your current scale, but don't implement it yet.        │
│  Document what changes at 10x and 100x.                                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Determine Latency Requirements

Latency requirements depend on where the ML system fits in the user experience:

User-Facing (Synchronous): The model blocks the user's request. Every millisecond matters.

  • Search ranking: <100ms (users expect instant results)
  • Recommendations: <200ms (part of page load)
  • Fraud detection: <100ms (during checkout)
  • Autocomplete: <50ms (must feel instantaneous)

Background (Asynchronous): The model runs separately from the user's request.

  • Email classification: <1 second (user doesn't wait)
  • Content moderation: <1 minute (before content goes live)
  • Batch recommendations: Hours (precomputed)
  • Model retraining: Hours to days

Latency Percentiles: Don't just measure average latency. Measure percentiles:

  • p50: Median latency (half of requests are faster)
  • p95: 95th percentile (95% of requests are faster)
  • p99: 99th percentile (only 1% are slower)
  • p99.9: 99.9th percentile (tail latency)
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    LATENCY BUDGET BREAKDOWN                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  EXAMPLE: PRODUCT RECOMMENDATION (200ms BUDGET)                          │
│  ───────────────────────────────────────────────                         │
│                                                                          │
│  Component                          Time (ms)    % of Budget            │
│  ─────────────────────────────────────────────────────────────          │
│  Network round-trip (user→server)      20           10%                 │
│  API gateway + auth                    10            5%                 │
│  Feature fetching (user features)      30           15%                 │
│  Feature fetching (item features)      40           20%                 │
│  Model inference                       50           25%                 │
│  Post-processing + filtering           20           10%                 │
│  Response serialization                10            5%                 │
│  Network round-trip (server→user)      20           10%                 │
│  ─────────────────────────────────────────────────────────────          │
│  Total                                200          100%                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY INSIGHTS:                                                           │
│                                                                          │
│  • Model inference is only 25% of total latency                        │
│  • Feature fetching often dominates (35% here)                         │
│  • Network latency is fixed cost—can't optimize                        │
│  • Every component needs its own latency budget                        │
│                                                                          │
│  OPTIMIZATION PRIORITY:                                                  │
│  1. Feature fetching (cache, pre-compute)                              │
│  2. Model inference (smaller model, batching, hardware)                │
│  3. Post-processing (simplify business rules)                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Calculate Data and Storage Requirements

Understand the data volumes involved:

Training Data:

  • How much historical data do you need?
  • How often do you retrain?
  • What's the storage format and compression?

Feature Storage:

  • How many features per entity?
  • How many entities (users, items)?
  • What's the update frequency?

Model Artifacts:

  • Model weights (can be GBs for deep learning)
  • Preprocessing artifacts (tokenizers, scalers)
  • Configuration and metadata

Example storage calculation:

Code
Users: 10 million
Items: 1 million
User features: 200 floats × 4 bytes = 800 bytes/user
Item features: 500 floats × 4 bytes = 2,000 bytes/item

User feature store: 10M × 800B = 8 GB
Item feature store: 1M × 2KB = 2 GB
Training data (1 year): 500 GB
Model artifacts: 2 GB

Total: ~512 GB (manageable on single machine)
At 100M users: ~80 GB user features (needs distributed store)

Scale and Latency Checklist

Before moving forward, verify:

  • Peak QPS estimated (with seasonal/event variations)
  • Latency budget defined (p50, p95, p99)
  • Latency breakdown by component
  • Training data volume calculated
  • Feature storage requirements estimated
  • Growth projections documented (6 months, 1 year, 3 years)
  • Cost estimate prepared (compute, storage, network)

Common Pitfall: Designing for current scale, not future scale. Always plan for 10x growth, but don't implement it until you need it. Document what changes at each scale tier.


Step 3: Metrics (Offline and Online Evaluation)

Metrics are how you know if your ML system is working. The wrong metrics lead to optimizing the wrong thing—a common and costly mistake. This section establishes the measurements that will guide development and deployment.

The Metrics Hierarchy

ML systems need multiple types of metrics at different levels:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    METRICS HIERARCHY                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  LEVEL 1: BUSINESS METRICS (What the company cares about)               │
│  ─────────────────────────────────────────────────────────               │
│  • Revenue, profit, cost savings                                        │
│  • User retention, engagement, satisfaction                             │
│  • Operational efficiency                                               │
│  Example: Revenue per user, monthly active users, NPS score            │
│                                                                          │
│           ↑ Correlated but not identical ↓                              │
│                                                                          │
│  LEVEL 2: ONLINE METRICS (What you A/B test)                            │
│  ────────────────────────────────────────────                            │
│  • User behavior directly influenced by the model                       │
│  • Measurable in real-time during experiments                          │
│  • Leading indicators of business metrics                               │
│  Example: Click-through rate, conversion rate, time on site            │
│                                                                          │
│           ↑ Should correlate ↓                                          │
│                                                                          │
│  LEVEL 3: OFFLINE METRICS (What you optimize during training)           │
│  ─────────────────────────────────────────────────────────               │
│  • Model quality on held-out test data                                  │
│  • Computed during development, before deployment                       │
│  • Should predict online performance (but often don't!)                │
│  Example: AUC, NDCG, precision, recall, F1                             │
│                                                                          │
│           ↑ Must map to ↓                                               │
│                                                                          │
│  LEVEL 4: MODEL METRICS (Technical health)                              │
│  ──────────────────────────────────────────                              │
│  • Training loss, validation loss                                       │
│  • Gradient norms, learning curves                                      │
│  • Inference latency, throughput                                        │
│  Example: Cross-entropy loss, p99 latency, GPU utilization             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CRITICAL INSIGHT:                                                       │
│  Optimizing offline metrics that don't correlate with online/business  │
│  metrics is the #1 cause of ML systems that "work" but don't deliver   │
│  value. Always validate the correlation.                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Choosing Offline Metrics

Offline metrics are computed on held-out test data during model development. Choose metrics appropriate to your ML task:

Classification Metrics:

MetricWhen to UseLimitation
AccuracyBalanced classes, equal error costsMisleading with imbalanced data
PrecisionFalse positives are costly (spam filter)Ignores false negatives
RecallFalse negatives are costly (fraud, disease)Ignores false positives
F1 ScoreBalance precision and recallAssumes equal cost
AUC-ROCRanking quality, threshold-independentCan be misleading with imbalance
AUC-PRImbalanced data, precision-recall trade-offLess intuitive than ROC
Log LossCalibrated probabilities matterSensitive to confident wrong predictions

Ranking Metrics:

MetricWhen to UseLimitation
NDCG@KGraded relevance (very/somewhat/not relevant)Requires relevance labels
MAPBinary relevance, multiple relevant itemsAll relevant items weighted equally
MRRSingle correct answer (Q&A, search)Only considers first relevant result
Precision@KTop-K results matter mostIgnores ordering within K
Recall@KFinding all relevant items mattersIgnores ranking quality

Regression Metrics:

MetricWhen to UseLimitation
MSE/RMSEPenalize large errors heavilySensitive to outliers
MAERobust to outliersDoesn't penalize large errors
MAPERelative error mattersUndefined when actual is zero
Explain variance in targetCan be negative, misleading

Choosing Online Metrics

Online metrics are measured during A/B tests in production. They should be:

  • Sensitive: Detectable changes with reasonable sample sizes
  • Actionable: You can influence them with your model
  • Aligned: Correlated with business outcomes
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    ONLINE METRICS BY USE CASE                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  RECOMMENDATIONS:                                                        │
│  ────────────────                                                        │
│  Primary: Click-through rate (CTR), Conversion rate                    │
│  Secondary: Revenue per session, Items viewed                          │
│  Guardrails: Latency, Diversity of recommendations                     │
│                                                                          │
│  SEARCH RANKING:                                                         │
│  ───────────────                                                         │
│  Primary: Click-through rate, Success rate (found what they wanted)    │
│  Secondary: Time to first click, Queries per session                   │
│  Guardrails: Zero-result rate, Latency                                 │
│                                                                          │
│  FRAUD DETECTION:                                                        │
│  ────────────────                                                        │
│  Primary: Fraud loss rate, Fraud detection rate                        │
│  Secondary: False positive rate, Manual review volume                  │
│  Guardrails: Customer friction (blocked legitimate), Latency           │
│                                                                          │
│  CONTENT MODERATION:                                                     │
│  ───────────────────                                                     │
│  Primary: Harmful content exposure rate, Removal accuracy              │
│  Secondary: Time to removal, Appeal rate                               │
│  Guardrails: False positive rate (wrongly removed), User reports       │
│                                                                          │
│  AD TARGETING:                                                           │
│  ─────────────                                                           │
│  Primary: Click-through rate, Cost per acquisition                     │
│  Secondary: Return on ad spend (ROAS), Impression to conversion        │
│  Guardrails: Ad fatigue, User complaints                               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Guardrail Metrics

Guardrail metrics are things that should NOT get worse when you launch a new model. They prevent unintended consequences:

System Guardrails:

  • Latency (p50, p95, p99) should not regress
  • Error rate should not increase
  • Resource utilization should not spike

User Experience Guardrails:

  • User complaints should not increase
  • Negative feedback rate should stay stable
  • Session length should not decrease (unless intentional)

Business Guardrails:

  • Revenue should not decrease (even if engagement increases)
  • Customer support tickets should not spike
  • Churn should not increase

Fairness Guardrails:

  • Performance should not degrade for any user segment
  • Disparate impact across demographic groups
  • Equal opportunity metrics

Metrics Checklist

Before moving forward, verify:

  • Primary offline metric chosen (what you optimize during training)
  • Primary online metric chosen (what you A/B test)
  • Guardrail metrics listed (what shouldn't regress)
  • Offline-online correlation validated (or plan to validate)
  • Metric definitions documented precisely
  • Statistical significance requirements defined
  • Dashboards planned for all metric levels

Common Pitfall: Optimizing offline metrics that don't correlate with online business metrics. A model with higher AUC doesn't always produce higher CTR. Validate the correlation early and often.


Step 4: Architecting for Scale

This is where you design the system that will run your ML model in production. The goal is a system that meets your latency requirements, handles your scale, and fails gracefully when things go wrong.

High-Level Architecture Patterns

Most ML systems follow one of a few common patterns:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    PATTERN 1: ONLINE INFERENCE                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  User Request → API Gateway → Model Server → Response                   │
│                                     ↑                                    │
│                              Feature Store                              │
│                                                                          │
│  CHARACTERISTICS:                                                        │
│  • Real-time predictions (synchronous)                                  │
│  • Low latency required (<200ms typical)                               │
│  • Features computed on-demand or pre-computed                         │
│                                                                          │
│  USE CASES:                                                              │
│  • Search ranking                                                       │
│  • Real-time recommendations                                           │
│  • Fraud detection at checkout                                         │
│                                                                          │
│  CHALLENGES:                                                             │
│  • Latency optimization critical                                       │
│  • Feature freshness vs computation cost                               │
│  • Handling traffic spikes                                             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────┐
│                    PATTERN 2: BATCH INFERENCE                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Scheduled Job → Data Warehouse → Model → Prediction Store → Serve     │
│                                                                          │
│  CHARACTERISTICS:                                                        │
│  • Pre-computed predictions (asynchronous)                              │
│  • Higher latency acceptable (minutes to hours)                        │
│  • Predictions cached and served from store                            │
│                                                                          │
│  USE CASES:                                                              │
│  • Email recommendations (daily digest)                                │
│  • Risk scoring for all customers                                      │
│  • Content pre-ranking for feeds                                       │
│                                                                          │
│  CHALLENGES:                                                             │
│  • Predictions may be stale                                            │
│  • Can't personalize to real-time context                             │
│  • Storage costs for all predictions                                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────┐
│                    PATTERN 3: HYBRID (MOST COMMON)                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                    ┌───────────────┐                                    │
│                    │ Batch Layer   │ Pre-compute candidates             │
│                    │ (Candidate    │ (hourly/daily)                     │
│                    │  Generation)  │                                    │
│                    └───────┬───────┘                                    │
│                            ↓                                             │
│  User Request → ┌─────────────────────┐ → Response                     │
│                 │ Online Layer        │                                  │
│                 │ (Real-time Ranking) │                                  │
│                 └─────────────────────┘                                  │
│                            ↑                                             │
│                    Feature Store                                        │
│                                                                          │
│  CHARACTERISTICS:                                                        │
│  • Batch layer generates candidates (1000s of items)                   │
│  • Online layer ranks/filters for real-time context                    │
│  • Best of both worlds: coverage + freshness                           │
│                                                                          │
│  USE CASES:                                                              │
│  • Large-scale recommendations (Netflix, YouTube)                      │
│  • Search with pre-built index + real-time ranking                    │
│  • Ad serving with pre-filtered inventory                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────┐
│                    PATTERN 4: STREAMING                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Event Stream → Stream Processor → Model → Action/Store                │
│       ↓                                                                  │
│  Feature Updates                                                        │
│                                                                          │
│  CHARACTERISTICS:                                                        │
│  • Process events as they arrive                                       │
│  • Near real-time (seconds to minutes)                                 │
│  • Features updated continuously                                       │
│                                                                          │
│  USE CASES:                                                              │
│  • Real-time fraud detection on transactions                           │
│  • Anomaly detection in logs/metrics                                   │
│  • Dynamic pricing updates                                             │
│                                                                          │
│  CHALLENGES:                                                             │
│  • Complex infrastructure (Kafka, Flink)                               │
│  • Exactly-once processing guarantees                                  │
│  • Debugging streaming pipelines                                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Detailed System Architecture

Let's walk through a detailed architecture for a recommendation system—one of the most common ML systems:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    RECOMMENDATION SYSTEM ARCHITECTURE                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  USER REQUEST                                                            │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  LOAD BALANCER / API GATEWAY                                    │   │
│  │  • Rate limiting, authentication                                │   │
│  │  • Route to appropriate service                                 │   │
│  │  • Latency: ~5ms                                                │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  CANDIDATE GENERATION                                           │   │
│  │                                                                  │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │   │
│  │  │ Retrieval 1 │  │ Retrieval 2 │  │ Retrieval 3 │             │   │
│  │  │ (Popular)   │  │ (Similar)   │  │ (Personal)  │             │   │
│  │  └─────────────┘  └─────────────┘  └─────────────┘             │   │
│  │         │               │               │                       │   │
│  │         └───────────────┼───────────────┘                       │   │
│  │                         ▼                                        │   │
│  │                  Merge & Dedupe                                  │   │
│  │                  (~500-1000 candidates)                          │   │
│  │                                                                  │   │
│  │  Latency: ~30ms (parallel retrieval)                            │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  FEATURE FETCHING                                               │   │
│  │                                                                  │   │
│  │  ┌─────────────────┐     ┌─────────────────┐                   │   │
│  │  │ User Features   │     │ Item Features   │                   │   │
│  │  │ (from cache)    │     │ (batch lookup)  │                   │   │
│  │  └─────────────────┘     └─────────────────┘                   │   │
│  │           │                      │                              │   │
│  │           └──────────┬───────────┘                              │   │
│  │                      ▼                                           │   │
│  │              Feature Assembly                                    │   │
│  │              (user × item pairs)                                │   │
│  │                                                                  │   │
│  │  Latency: ~40ms (often the bottleneck!)                         │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  RANKING MODEL                                                  │   │
│  │                                                                  │   │
│  │  • Score all candidates (500-1000 items)                       │   │
│  │  • Batch inference for efficiency                               │   │
│  │  • GPU/TPU for complex models                                   │   │
│  │                                                                  │   │
│  │  Latency: ~50ms (depends on model complexity)                   │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  POST-PROCESSING / BUSINESS RULES                               │   │
│  │                                                                  │   │
│  │  • Filter: Remove already seen, out of stock, blocked          │   │
│  │  • Diversify: Don't show all items from same category          │   │
│  │  • Boost: Promote sponsored items, new releases                │   │
│  │  • Explain: Generate "Because you watched X" text              │   │
│  │                                                                  │   │
│  │  Latency: ~15ms                                                 │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│       │                                                                  │
│       ▼                                                                  │
│  RESPONSE (Top 10-50 recommendations)                                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TOTAL LATENCY BUDGET: ~150ms                                           │
│                                                                          │
│  SUPPORTING SYSTEMS:                                                     │
│  • Feature Store: Redis/DynamoDB for real-time features               │
│  • Model Registry: MLflow/Vertex AI for model versioning              │
│  • Logging: Kafka → Data Warehouse for training data                  │
│  • Monitoring: Prometheus/DataDog for metrics and alerts              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The Feature Store: Heart of ML Infrastructure

Feature stores are critical infrastructure that solves several hard problems:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    FEATURE STORE ARCHITECTURE                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE PROBLEM:                                                            │
│  ────────────                                                            │
│  • Features computed differently for training vs serving (skew!)       │
│  • Same features recomputed across multiple models (waste!)            │
│  • Feature logic scattered across notebooks and services (chaos!)      │
│                                                                          │
│  THE SOLUTION: FEATURE STORE                                            │
│  ────────────────────────────                                            │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                    FEATURE STORE                                │   │
│  │                                                                  │   │
│  │  ┌───────────────────────────────────────────────────────────┐  │   │
│  │  │  FEATURE DEFINITIONS (code/config)                        │  │   │
│  │  │  • user_total_purchases (SUM of purchases, 30 days)      │  │   │
│  │  │  • item_avg_rating (AVG of ratings)                      │  │   │
│  │  │  • user_item_affinity (embedding similarity)             │  │   │
│  │  └───────────────────────────────────────────────────────────┘  │   │
│  │                          │                                       │   │
│  │           ┌──────────────┼──────────────┐                       │   │
│  │           ▼              ▼              ▼                       │   │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐               │   │
│  │  │ Offline     │ │ Online      │ │ Streaming   │               │   │
│  │  │ Store       │ │ Store       │ │ Updates     │               │   │
│  │  │ (training)  │ │ (serving)   │ │ (real-time) │               │   │
│  │  │             │ │             │ │             │               │   │
│  │  │ Data Lake   │ │ Redis/      │ │ Kafka →     │               │   │
│  │  │ Parquet     │ │ DynamoDB    │ │ Flink       │               │   │
│  │  └─────────────┘ └─────────────┘ └─────────────┘               │   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  BENEFITS:                                                               │
│  • Single source of truth for feature definitions                      │
│  • Training-serving consistency guaranteed                             │
│  • Feature reuse across models                                         │
│  • Point-in-time correct training data                                 │
│  • Monitoring for feature drift                                        │
│                                                                          │
│  POPULAR OPTIONS:                                                        │
│  • Feast (open source)                                                 │
│  • Tecton (managed)                                                    │
│  • Vertex AI Feature Store (GCP)                                       │
│  • Amazon SageMaker Feature Store (AWS)                                │
│  • Databricks Feature Store                                            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Defining Features: The Foundation of Consistency

A feature definition should be a single source of truth that both training and serving pipelines reference. The code below shows how to define features in a way that prevents training-serving skew—the feature computation logic lives in one place and is used everywhere.

Python
from feast import Entity, Feature, FeatureView, FileSource, ValueType
from datetime import timedelta

# Define the entity (what we're computing features for)
user = Entity(
    name="user_id",
    value_type=ValueType.STRING,
    description="Unique user identifier"
)

# Define the data source
user_activity_source = FileSource(
    path="s3://features/user_activity.parquet",
    event_timestamp_column="event_timestamp",
    created_timestamp_column="created_timestamp"
)

# Define the feature view - THIS IS THE SINGLE SOURCE OF TRUTH
user_features = FeatureView(
    name="user_features",
    entities=["user_id"],
    ttl=timedelta(days=1),  # How long features are valid
    features=[
        Feature(name="purchase_count_30d", dtype=ValueType.INT64),
        Feature(name="avg_purchase_value_30d", dtype=ValueType.FLOAT),
        Feature(name="days_since_last_purchase", dtype=ValueType.INT64),
        Feature(name="favorite_category", dtype=ValueType.STRING),
    ],
    online=True,   # Serve from online store (Redis)
    offline=True,  # Available for training (data warehouse)
    source=user_activity_source,
)

This feature definition using Feast (a popular open-source feature store) illustrates several critical concepts. The FeatureView defines exactly what features exist and how they're computed. Both the training pipeline and serving pipeline reference this same definition—there's no separate "training features" and "serving features" that could drift apart.

The ttl (time-to-live) parameter specifies how long computed features remain valid. For user purchase behavior, one day makes sense—preferences don't change hourly. For real-time features like "items currently in cart," you'd use a shorter TTL.

The online=True flag enables serving from a low-latency store (typically Redis or DynamoDB), while offline=True makes the same features available in the data warehouse for training. When you fetch features for training, the feature store handles point-in-time correctness—it retrieves the feature values as they existed at the time of each training example, preventing future data leakage.

The power of this approach is enforcement. When a data scientist wants to add a new feature, they define it in the feature store. When the serving team needs that feature at inference time, they fetch it from the same store. There's no opportunity for the training and serving computations to diverge.

Caching Strategies

Caching is essential for meeting latency requirements at scale:

What to Cache:

  • User features: Change infrequently, cache for hours
  • Item features: Change infrequently, cache for hours
  • Candidate lists: Pre-computed, cache for minutes
  • Model predictions: If inputs are cacheable, cache results

Cache Invalidation:

  • TTL-based: Expire after fixed time (simple but may be stale)
  • Event-based: Invalidate when underlying data changes (fresh but complex)
  • Hybrid: TTL with event-based refresh for critical changes

Cache Hierarchy:

Code
Request → L1 Cache (in-process, microseconds)
       → L2 Cache (Redis, milliseconds)
       → L3 Cache (distributed, tens of milliseconds)
       → Origin (database/compute, hundreds of milliseconds)

Failure Modes and Fallbacks

Design for failure. Every component will fail eventually:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    FAILURE MODES AND FALLBACKS                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  COMPONENT              FAILURE MODE              FALLBACK               │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Model Server           Timeout/Crash            Return popular items   │
│                                                   (pre-computed)         │
│                                                                          │
│  Feature Store          Slow/Unavailable         Use cached features    │
│                                                   (may be stale)         │
│                                                                          │
│  Candidate Generation   No results               Expand to global       │
│                                                   popular items          │
│                                                                          │
│  Real-time Features     Missing                  Use default values     │
│                                                   (document in training) │
│                                                                          │
│  Ranking Model          Wrong predictions        Circuit breaker →      │
│                                                   fallback model         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  GRACEFUL DEGRADATION HIERARCHY:                                        │
│                                                                          │
│  1. Full personalization (everything works)                            │
│  2. Partial personalization (some features missing)                    │
│  3. Segment-based (user cohort defaults)                               │
│  4. Global popular (same for everyone)                                 │
│  5. Static fallback (hardcoded list)                                   │
│                                                                          │
│  Each level should be tested and monitored.                            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Architecture Checklist

Before moving forward, verify:

  • High-level architecture diagram drawn
  • Critical path identified (what blocks the response)
  • Bottlenecks analyzed (CPU, memory, network, disk)
  • Caching strategy defined (what's cached, TTL, invalidation)
  • Feature store design completed (or decision to not use one)
  • Failure modes identified with fallback strategies
  • Capacity planning done (servers, memory, cost)
  • Deployment strategy defined (blue-green, canary, etc.)

Common Pitfall: Over-engineering for scale you don't have yet, or under-engineering for scale you'll hit in 3 months. Document what changes at each scale tier and implement incrementally.


Step 5: Data Collection and Preparation

Data is the foundation of any ML system. This step ensures you have the right data, properly prepared, flowing reliably through the system.

Identify Data Sources

Start by mapping all data sources relevant to your ML problem:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DATA SOURCE INVENTORY                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SOURCE TYPE        EXAMPLES                    CONSIDERATIONS          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  User Behavior      Clicks, views, purchases    High volume, real-time │
│  Logs               Time on page, scroll depth  Privacy considerations │
│                                                                          │
│  User Profile       Demographics, preferences   PII handling required  │
│  Data               Account settings            May be incomplete      │
│                                                                          │
│  Item Catalog       Product attributes,         Regular updates needed │
│                     descriptions, images        Quality varies         │
│                                                                          │
│  Transaction        Purchases, refunds,         Delayed feedback       │
│  Data               subscriptions               (churn after months)   │
│                                                                          │
│  External Data      Weather, events,            API costs, reliability │
│                     market prices               Terms of use           │
│                                                                          │
│  User Feedback      Ratings, reviews,           Sparse, biased sample │
│                     surveys                      Selection effects      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  FOR EACH SOURCE, DOCUMENT:                                             │
│  • Owner: Who controls this data?                                      │
│  • Access: How do we get it? (API, database, stream)                  │
│  • Freshness: How often is it updated?                                 │
│  • Quality: What's missing or incorrect?                               │
│  • Volume: How much data? Growth rate?                                 │
│  • Sensitivity: PII? Regulatory constraints?                           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Define Your Labeling Strategy

Labels are the ground truth your model learns from. The labeling strategy is crucial:

Explicit Labels (Human-Generated):

  • Human annotators label data
  • High quality but expensive and slow
  • Subject to inter-annotator disagreement
  • Examples: Content moderation labels, search relevance judgments

Implicit Labels (Behavior-Derived):

  • Derived from user behavior
  • Abundant and free but noisy
  • May not reflect true preferences
  • Examples: Clicks (positive), impressions without clicks (negative?)

Programmatic Labels (Rule-Based):

  • Generated by rules or heuristics
  • Scalable but limited to rule coverage
  • Good for bootstrapping
  • Examples: Regex-based spam detection, threshold-based anomalies
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    LABELING STRATEGIES BY USE CASE                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  RECOMMENDATION SYSTEMS:                                                 │
│  ───────────────────────                                                 │
│  Positive labels: Clicks, purchases, long engagement                   │
│  Negative labels: Impressions without interaction (tricky!)            │
│  Challenge: Exposure bias (only see what was shown)                    │
│  Solution: Use propensity scoring, randomized exploration              │
│                                                                          │
│  FRAUD DETECTION:                                                        │
│  ────────────────                                                        │
│  Positive labels: Confirmed fraud (chargebacks, investigations)        │
│  Negative labels: Transactions not reported as fraud                   │
│  Challenge: Label delay (fraud discovered weeks later)                 │
│  Solution: Retrain with mature labels, use intermediate signals        │
│                                                                          │
│  CONTENT MODERATION:                                                     │
│  ───────────────────                                                     │
│  Positive labels: Human-reviewed violations                            │
│  Negative labels: Content not flagged as violating                     │
│  Challenge: Policy changes, subjective judgments                       │
│  Solution: Regular recalibration, multiple annotators                  │
│                                                                          │
│  SEARCH RANKING:                                                         │
│  ───────────────                                                         │
│  Positive labels: Clicks, successful sessions                          │
│  Negative labels: Skipped results, reformulated queries                │
│  Challenge: Position bias (top results get more clicks)                │
│  Solution: Position debiasing, randomized experiments                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Deep Dive: Why These Labeling Challenges Matter

The diagram above summarizes labeling strategies, but understanding why each challenge exists—and how to solve it—is crucial for building systems that actually work.

Exposure Bias in Recommendations:

When you train a recommendation model, your training data consists of items that were shown to users and whether they clicked. But here's the problem: you only have labels for items that were shown. Items that were never shown have no labels—not because users wouldn't like them, but because your previous model never recommended them.

This creates a feedback loop. Your model learns to recommend items similar to what it already recommends, because those are the only items with positive labels. Items that might be excellent but were never shown remain hidden. Over time, this narrows the diversity of recommendations and can trap users in "filter bubbles."

The solution involves two strategies. First, propensity scoring weights training examples by the probability they would have been shown. If an item had only a 1% chance of being shown but was clicked, that's a much stronger signal than a click on an item that was shown to everyone. Second, randomized exploration deliberately shows some items randomly (not based on model scores) to collect unbiased feedback. This sacrifices short-term engagement for long-term model improvement.

Label Delay in Fraud Detection:

Fraud labels arrive late—sometimes weeks or months after the transaction. A customer disputes a charge, the bank investigates, and eventually confirms fraud. During this delay, your model is training on incomplete data. Recent transactions are labeled "not fraud" simply because there hasn't been time for fraud to be reported.

This creates a systematic bias toward under-detecting fraud. Your model sees recent "not fraud" labels that are actually "fraud not yet discovered" and learns that certain patterns are safe when they're not.

The solution requires training on "mature" labels—data old enough that most fraud would have been discovered. If fraud typically surfaces within 30 days, train on data at least 45 days old. For real-time scoring, use intermediate signals (account flags, suspicious pattern matches) that indicate elevated risk before fraud is confirmed.

Position Bias in Search:

Users are far more likely to click results at the top of the page, regardless of relevance. A mediocre result in position 1 gets more clicks than an excellent result in position 5. If you train naively on click data, your model learns that "being shown first" predicts clicks—which is true but useless, since you're trying to learn which results deserve to be first.

Position debiasing corrects for this. During training, you weight clicks by the inverse of position bias—a click on position 5 counts more than a click on position 1, because it happened despite the position disadvantage. Some teams run randomized experiments where result order is shuffled, collecting unbiased click data at the cost of degraded user experience during the experiment.

Training-Serving Consistency

One of the most common and insidious bugs in ML systems is training-serving skew: features computed differently during training versus serving.

Sources of Skew:

  1. Code skew: Different implementations in training (Python/Spark) vs serving (Java/C++)
  2. Data skew: Different data sources or freshness
  3. Time skew: Using future information during training (leakage)
  4. Processing skew: Different preprocessing (normalization, encoding)

Solutions:

  • Shared feature definitions: Single source of truth (feature store)
  • Logged features: Log features at serving time, use for training
  • Feature validation: Compare training and serving feature distributions
  • Integration tests: End-to-end tests that catch skew
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    TRAINING-SERVING CONSISTENCY                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ANTI-PATTERN (Causes Skew):                                            │
│  ───────────────────────────                                             │
│                                                                          │
│  Training Pipeline:                                                      │
│  Raw Data → Spark Job → Feature Engineering → Training                 │
│                         (Python code A)                                  │
│                                                                          │
│  Serving Pipeline:                                                       │
│  Request → API Server → Feature Engineering → Model → Response         │
│                         (Java code B)                                    │
│                                                                          │
│  Problem: Code A and Code B may compute features differently!          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  GOOD PATTERN (Consistent):                                              │
│  ──────────────────────────                                              │
│                                                                          │
│  Both Pipelines:                                                         │
│  Data → Feature Store → Same Features → Training/Serving               │
│           (single definition)                                            │
│                                                                          │
│  OR:                                                                     │
│                                                                          │
│  Serving:                                                                │
│  Request → Features → Model → Response                                  │
│                │                                                         │
│                └──→ Log Features                                        │
│                          │                                               │
│  Training:               ▼                                               │
│                 Use Logged Features                                      │
│                                                                          │
│  This guarantees training uses exactly what serving computes.          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Concrete Example: How Training-Serving Skew Destroys Model Performance

Let's walk through a real scenario where training-serving skew caused a recommendation model to fail in production, despite excellent offline metrics.

The Setup:

An e-commerce team builds a recommendation model. One important feature is user_avg_purchase_price—the average price of items a user has purchased. During training, they compute this feature using a Spark job that processes the full purchase history.

The Skew:

In their training pipeline (Spark/Python):

Python
# Training: computes average over ALL purchases
avg_price = user_purchases.groupBy("user_id").agg(avg("price"))

In their serving pipeline (Java microservice):

Java
// Serving: computes average over LAST 30 DAYS (for performance)
avgPrice = recentPurchases.stream()
    .mapToDouble(Purchase::getPrice)
    .average();

The training code computes the lifetime average. The serving code, for latency reasons, only uses the last 30 days. Nobody documented this difference. The feature has the same name in both places.

The Impact:

For a user who bought cheap items years ago but now buys expensive items, the training feature might be 50(lifetimeaverage),whiletheservingfeatureis50 (lifetime average), while the serving feature is 200 (recent average). The model learned patterns based on lifetime averages but receives recent averages at inference time.

The model's offline AUC was 0.82. In production, it performed barely better than random. The team spent weeks debugging before discovering the skew.

How to Detect This:

  1. Feature distribution monitoring: Log feature values at serving time. Compare serving distributions to training distributions daily. The user_avg_purchase_price feature would show a different distribution (higher variance, shifted mean).

  2. Shadow scoring: Run the training pipeline on recent data and compare features to what was logged during serving. Any significant differences indicate skew.

  3. Feature contracts: Document exactly how each feature is computed, including time windows, null handling, and edge cases. Code review changes against this contract.

The Fix:

Either change training to use 30-day windows (matching serving), or change serving to compute lifetime averages (matching training). The specific choice depends on which definition is more predictive—but they must match.

This example illustrates why training-serving consistency is non-negotiable. A feature store that enforces single definitions prevents this entire class of bugs.

Data Quality Assessment

Before training, assess data quality systematically:

Completeness:

  • What percentage of records have each feature?
  • Are missing values random or systematic?
  • How do you handle missing values?

Accuracy:

  • Are labels correct? (Sample and verify)
  • Are feature values plausible? (Range checks)
  • Are there systematic errors? (Biased collection)

Consistency:

  • Are definitions consistent over time?
  • Are there duplicate records?
  • Do related fields agree?

Timeliness:

  • How fresh is the data?
  • Is there label delay?
  • Are features point-in-time correct?

Implementing Data Quality Checks

Data quality assessment shouldn't be manual inspection—it should be automated validation that runs on every data pipeline execution. The code below shows a simplified data validation approach that catches common issues before they corrupt your training data.

Python
def validate_training_data(df, config):
    """
    Validate training data quality before model training.
    Returns validation report with pass/fail status.
    """
    issues = []

    # Check for required columns
    missing_cols = set(config.required_columns) - set(df.columns)
    if missing_cols:
        issues.append(f"Missing columns: {missing_cols}")

    # Check null rates against thresholds
    for col, max_null_rate in config.null_thresholds.items():
        actual_null_rate = df[col].isnull().mean()
        if actual_null_rate > max_null_rate:
            issues.append(
                f"{col}: null rate {actual_null_rate:.2%} exceeds threshold {max_null_rate:.2%}"
            )

    # Check value ranges for numeric features
    for col, (min_val, max_val) in config.value_ranges.items():
        out_of_range = ((df[col] < min_val) | (df[col] > max_val)).mean()
        if out_of_range > 0.01:  # More than 1% out of range
            issues.append(
                f"{col}: {out_of_range:.2%} of values outside [{min_val}, {max_val}]"
            )

    # Check label distribution hasn't shifted dramatically
    label_dist = df[config.label_column].value_counts(normalize=True)
    for label, expected_rate in config.expected_label_rates.items():
        actual_rate = label_dist.get(label, 0)
        if abs(actual_rate - expected_rate) > config.label_drift_threshold:
            issues.append(
                f"Label '{label}': rate {actual_rate:.2%} vs expected {expected_rate:.2%}"
            )

    return {"passed": len(issues) == 0, "issues": issues}

This validation function embodies several important principles. First, it checks for structural issues like missing columns—these indicate upstream pipeline failures and should block training entirely. Second, it monitors null rates per column, because a sudden spike in nulls often indicates a data source problem (API changed, logging broke, upstream job failed). The thresholds should be set based on historical baselines, not arbitrary numbers.

Third, the value range checks catch data corruption. If user ages suddenly include values of 500 or -10, something is wrong upstream. Fourth, label distribution monitoring catches concept drift—if your fraud rate suddenly doubles, either fraud patterns changed or your labeling pipeline broke. Either way, you need to investigate before training.

The key is making these checks automatic and blocking. If validation fails, the training pipeline should stop and alert, not proceed with corrupted data. Bad data produces bad models, and catching issues at validation time is far cheaper than debugging model failures in production.

Data Pipeline Architecture

Design robust pipelines for data flow:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DATA PIPELINE ARCHITECTURE                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  DATA SOURCES                                                            │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  INGESTION LAYER                                                │   │
│  │  • Kafka for streaming data                                     │   │
│  │  • Batch imports for static data                                │   │
│  │  • Schema validation at entry                                   │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  PROCESSING LAYER                                               │   │
│  │  • Spark/Flink for transformation                               │   │
│  │  • Feature computation                                          │   │
│  │  • Data quality checks                                          │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  STORAGE LAYER                                                  │   │
│  │  • Data Lake (S3/GCS) for raw and processed data               │   │
│  │  • Feature Store for ML features                                │   │
│  │  • Data Warehouse for analytics                                 │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  SERVING LAYER                                                  │   │
│  │  • Training data for model development                          │   │
│  │  • Online features for inference                                │   │
│  │  • Batch features for offline scoring                           │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  KEY PRINCIPLES:                                                         │
│  • Idempotent processing (re-runnable without side effects)            │
│  • Schema evolution (handle schema changes gracefully)                 │
│  • Data lineage (track where data came from)                           │
│  • Monitoring and alerting (catch issues early)                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Data Collection Checklist

Before moving forward, verify:

  • All data sources identified and accessible
  • Labeling strategy defined (explicit, implicit, or programmatic)
  • Training-serving consistency ensured
  • Data quality assessed (completeness, accuracy, consistency)
  • Data pipeline architecture designed
  • Privacy and compliance requirements addressed
  • Data retention and deletion policies defined
  • Documentation of all feature definitions

Common Pitfall: Spending 80% of time on modeling, 20% on data. Should be reversed. Data quality has more impact on model performance than model architecture for most problems.


Step 6: Offline Model Development and Evaluation

This step covers the actual machine learning: selecting algorithms, training models, and evaluating performance before deployment.

Establish a Baseline

Before building complex models, establish a baseline. This serves multiple purposes:

  • Proves the problem is solvable
  • Provides a benchmark for improvement
  • May be sufficient for the business need
  • Helps debug more complex models

Good Baselines:

  • Random: What's random performance? (important for imbalanced classes)
  • Heuristic: Simple rules based on domain knowledge
  • Simple ML: Logistic regression, decision tree, k-NN
  • Previous system: What's the current production model?
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    BASELINE EXAMPLES                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  RECOMMENDATIONS:                                                        │
│  ────────────────                                                        │
│  Random: NDCG@10 = 0.05                                                │
│  Popular items: NDCG@10 = 0.25                                         │
│  User-based CF: NDCG@10 = 0.35                                         │
│  Target: NDCG@10 > 0.45                                                │
│                                                                          │
│  FRAUD DETECTION:                                                        │
│  ────────────────                                                        │
│  Random: Precision@1% = 0.01 (1% of transactions are fraud)           │
│  Rule-based: Precision@1% = 0.30                                       │
│  Logistic regression: Precision@1% = 0.50                             │
│  Target: Precision@1% > 0.70                                           │
│                                                                          │
│  SEARCH RANKING:                                                         │
│  ───────────────                                                         │
│  BM25 (keyword): NDCG@10 = 0.45                                        │
│  + Click features: NDCG@10 = 0.55                                      │
│  Target: NDCG@10 > 0.65                                                │
│                                                                          │
│  ALWAYS REPORT:                                                          │
│  • Baseline performance (what you're improving on)                     │
│  • Model performance (what you achieved)                               │
│  • Improvement (absolute and relative)                                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Model Selection

Choose model architecture based on your requirements:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    MODEL SELECTION GUIDE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  WHEN TO USE SIMPLER MODELS (Linear, Trees, GBMs):                     │
│  ────────────────────────────────────────────────                        │
│  • Limited training data (<100K examples)                              │
│  • Need interpretability (regulated domains)                           │
│  • Strict latency requirements (<10ms)                                 │
│  • Limited engineering resources                                        │
│  • Features are already well-engineered                                │
│                                                                          │
│  WHEN TO USE DEEP LEARNING:                                             │
│  ──────────────────────────                                              │
│  • Abundant training data (>1M examples)                               │
│  • Raw/unstructured data (text, images, sequences)                     │
│  • Complex feature interactions to learn                               │
│  • Have GPU/TPU infrastructure                                          │
│  • State-of-the-art performance required                               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  COMMON ARCHITECTURES BY TASK:                                          │
│  ──────────────────────────────                                          │
│                                                                          │
│  Classification/Regression (tabular):                                   │
│  • XGBoost, LightGBM, CatBoost (gradient boosting)                    │
│  • Wide & Deep (combines memorization + generalization)                │
│  • TabNet (attention for tabular)                                      │
│                                                                          │
│  Ranking:                                                                │
│  • LambdaMART (gradient boosting for ranking)                          │
│  • Two-tower models (separate user/item encoders)                      │
│  • Cross-attention models (BERT-style)                                 │
│                                                                          │
│  Sequence Modeling:                                                      │
│  • Transformers (attention-based)                                      │
│  • LSTMs/GRUs (recurrent)                                              │
│  • Temporal Convolutional Networks                                     │
│                                                                          │
│  Embeddings:                                                             │
│  • Word2Vec, FastText (word embeddings)                                │
│  • Item2Vec (item embeddings from co-occurrence)                       │
│  • Graph Neural Networks (relationship-aware)                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Understanding the Trade-offs: When Simple Beats Complex

The diagram above provides guidelines, but understanding the underlying trade-offs helps you make better decisions for your specific situation.

The Data Efficiency Question:

Deep learning models are hungry for data. A neural network with millions of parameters needs millions of examples to avoid overfitting. Gradient boosted trees, with their inherent regularization and feature selection, can achieve strong performance with orders of magnitude less data.

The crossover point varies by problem complexity. For simple tabular prediction (like click-through rate with well-engineered features), XGBoost often matches or beats neural networks even with 10M+ examples. For complex problems with raw inputs (understanding sentence meaning, detecting objects in images), neural networks pull ahead much sooner.

A practical test: train both approaches on 10% of your data. If XGBoost is competitive, it will likely remain competitive at full scale. If the neural network is already significantly better, the gap will widen with more data.

The Feature Engineering Trade-off:

Gradient boosted models excel when features are well-engineered. They can find complex interactions and thresholds, but they operate on the features you provide. If your features capture the signal in your data, GBMs are hard to beat.

Neural networks learn features from raw data. This is powerful when you don't know the right features (what makes a sentence positive sentiment?) or when the right features are too complex to engineer manually (what patterns in an image indicate a cat?). But this power comes at a cost: more data required, more compute needed, and harder to interpret.

For tabular data with domain expertise available, investing in feature engineering plus XGBoost often beats investing in neural architecture search. The engineering time pays off in faster training, easier debugging, and more interpretable models.

The Latency Trade-off:

Inference latency varies dramatically by model type. A small XGBoost model can score in microseconds. A BERT model requires milliseconds on GPU, tens of milliseconds on CPU. For applications requiring <10ms latency (autocomplete, real-time bidding), this constraint often eliminates deep learning options—or requires significant optimization investment (distillation, quantization, specialized hardware).

Consider total system latency, not just model latency. If feature fetching takes 50ms and you have a 100ms budget, your model has 50ms regardless of architecture. In that case, you might afford a larger model than you initially thought.

The Interpretability Trade-off:

In regulated domains (credit decisions, medical diagnosis), you may need to explain why the model made a prediction. Tree-based models provide natural explanations: "This loan was rejected because income < $50K AND debt-to-income > 40%." Neural networks are black boxes that require additional techniques (SHAP, attention visualization) for explanation, and these explanations are approximations.

Don't overestimate interpretability requirements. Many "we need interpretability" situations actually need auditability (can we check the model isn't discriminating?) or debuggability (can we understand when it fails?). These are different from per-prediction explanations and can be achieved with black-box models.

Training Pipeline Design

Design a reproducible, automated training pipeline:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    TRAINING PIPELINE COMPONENTS                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. DATA LOADING                                                         │
│     • Load training, validation, test splits                           │
│     • Apply consistent preprocessing                                    │
│     • Handle data versioning (what data was used?)                     │
│                                                                          │
│  2. FEATURE ENGINEERING                                                  │
│     • Transform raw features to model inputs                           │
│     • Handle missing values, encoding                                   │
│     • Feature selection if needed                                       │
│                                                                          │
│  3. MODEL TRAINING                                                       │
│     • Initialize model with hyperparameters                            │
│     • Train with early stopping on validation                          │
│     • Log metrics, losses, gradients                                   │
│                                                                          │
│  4. HYPERPARAMETER TUNING                                                │
│     • Grid search, random search, or Bayesian optimization            │
│     • Cross-validation for robust estimates                            │
│     • Track all experiments                                            │
│                                                                          │
│  5. EVALUATION                                                           │
│     • Compute metrics on held-out test set                             │
│     • Error analysis (where does model fail?)                          │
│     • Fairness analysis (performance across groups)                    │
│                                                                          │
│  6. ARTIFACT STORAGE                                                     │
│     • Save model weights, config, preprocessing                        │
│     • Version everything (reproducibility)                             │
│     • Store evaluation results                                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TOOLS:                                                                  │
│  • Experiment tracking: MLflow, Weights & Biases, Neptune             │
│  • Pipeline orchestration: Airflow, Kubeflow, Prefect                 │
│  • Model registry: MLflow, Vertex AI, SageMaker                       │
│  • Hyperparameter tuning: Optuna, Ray Tune, Hyperopt                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Evaluation Strategy

Proper evaluation prevents deploying models that look good but perform poorly:

Data Splits:

  • Training set (70-80%): Used to train the model
  • Validation set (10-15%): Used for hyperparameter tuning and early stopping
  • Test set (10-15%): Used only for final evaluation (never look until the end!)

Time-Based Splits (for temporal data):

Code
├─────────── Training ───────────┼── Val ──┼── Test ──┤
│     Jan 1 - Sep 30             │ Oct     │ Nov      │

Stratified Splits: Maintain class distribution in all splits (important for imbalanced data).

Cross-Validation: When data is limited, use k-fold cross-validation for more robust estimates.

Error Analysis

Understanding where your model fails is as important as knowing its overall accuracy:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    ERROR ANALYSIS FRAMEWORK                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. CONFUSION MATRIX ANALYSIS                                            │
│     • False positives: What did we wrongly predict as positive?        │
│     • False negatives: What did we miss?                               │
│     • Are errors concentrated in certain classes?                      │
│                                                                          │
│  2. SLICE ANALYSIS                                                       │
│     • Performance by user segment (new vs returning)                   │
│     • Performance by item category                                     │
│     • Performance by time period                                       │
│     • Performance by feature values                                    │
│                                                                          │
│  3. HARD EXAMPLE MINING                                                  │
│     • Which examples have highest loss?                                │
│     • What patterns do they share?                                     │
│     • Are they mislabeled or genuinely hard?                          │
│                                                                          │
│  4. FEATURE IMPORTANCE                                                   │
│     • Which features drive predictions?                                │
│     • Are important features sensible?                                 │
│     • Are there leakage features?                                      │
│                                                                          │
│  5. CALIBRATION ANALYSIS                                                 │
│     • Do predicted probabilities match actual rates?                   │
│     • Is the model overconfident or underconfident?                   │
│     • Calibration by prediction range                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ACTIONABLE INSIGHTS:                                                    │
│  • Error patterns suggest features to add                              │
│  • Systematic errors suggest labeling issues                           │
│  • Slice performance gaps suggest fairness issues                      │
│  • Calibration issues affect threshold selection                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Offline Development Checklist

Before moving to deployment, verify:

  • Baseline model established and documented
  • Model architecture chosen with justification
  • Training pipeline automated and reproducible
  • Hyperparameter tuning completed
  • Offline metrics computed on held-out test set
  • Error analysis completed (understand failure modes)
  • Model artifacts saved and versioned
  • Training reproducible (same data + config = same model)

Common Pitfall: Overfitting to the validation set by iterating too many times. Always have a final holdout test set that you only evaluate once at the end.


Step 7: Online Execution, Testing and Evaluation

This step covers deploying the model to production and validating that it works in the real world. The gap between offline and online performance is often significant.

Deployment Strategies

How you deploy determines your risk exposure:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DEPLOYMENT STRATEGIES                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. SHADOW MODE (Lowest Risk)                                           │
│  ────────────────────────────                                            │
│  • New model runs alongside production                                 │
│  • Predictions logged but not served to users                          │
│  • Compare predictions offline                                          │
│  • Duration: 1-2 weeks                                                 │
│                                                                          │
│  Use for: Major model changes, new ML systems                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  2. CANARY DEPLOYMENT (Low Risk)                                        │
│  ────────────────────────────────                                        │
│  • Route small percentage to new model                                 │
│  • Monitor for errors, latency, metrics                                │
│  • Gradually increase: 1% → 5% → 10% → 25% → 50% → 100%               │
│  • Automated rollback on anomalies                                     │
│                                                                          │
│  Use for: Most deployments, standard practice                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  3. A/B TEST (For Business Impact)                                      │
│  ─────────────────────────────────                                       │
│  • Random 50/50 split between control and treatment                    │
│  • Run for statistical significance (usually 1-4 weeks)               │
│  • Measure impact on business metrics                                  │
│  • Make ship/no-ship decision based on results                        │
│                                                                          │
│  Use for: Validating business impact, major changes                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  4. BLUE-GREEN (Quick Rollback)                                         │
│  ───────────────────────────────                                         │
│  • Two identical environments (blue and green)                         │
│  • Deploy new model to inactive environment                            │
│  • Switch traffic instantly                                            │
│  • Instant rollback by switching back                                  │
│                                                                          │
│  Use for: When fast rollback is critical                               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

A/B Testing Best Practices

A/B testing is how you validate that offline improvements translate to online gains:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    A/B TEST DESIGN                                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  BEFORE THE TEST:                                                        │
│  ─────────────────                                                       │
│  1. Define hypothesis: "New model will increase CTR by 5%"             │
│  2. Choose primary metric (one!)                                        │
│  3. List guardrail metrics (shouldn't regress)                        │
│  4. Calculate sample size for statistical power                        │
│  5. Define success criteria before starting                            │
│                                                                          │
│  DURING THE TEST:                                                        │
│  ────────────────                                                        │
│  1. Ensure random assignment (no selection bias)                       │
│  2. Monitor for technical issues (errors, latency)                    │
│  3. Don't peek at results repeatedly (multiple testing problem)       │
│  4. Run for full planned duration                                      │
│                                                                          │
│  AFTER THE TEST:                                                         │
│  ───────────────                                                         │
│  1. Analyze statistical significance                                   │
│  2. Check guardrail metrics                                            │
│  3. Look for heterogeneous effects (different by segment)             │
│  4. Document findings and decision                                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  COMMON MISTAKES:                                                        │
│                                                                          │
│  • Stopping early when results look good (inflates false positives)   │
│  • Ignoring guardrail regressions                                      │
│  • Not accounting for novelty effect                                   │
│  • Simpson's paradox (aggregate hides segment differences)            │
│  • Network effects (treatment affects control)                         │
│  • Too many metrics (multiple testing)                                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Sample Size Calculation

Running tests with insufficient sample size leads to inconclusive or misleading results:

Key Parameters:

  • Baseline rate: Current performance (e.g., 2% CTR)
  • Minimum detectable effect (MDE): Smallest improvement worth detecting (e.g., 5% relative)
  • Statistical significance (α): Usually 0.05 (5% false positive rate)
  • Statistical power (1-β): Usually 0.80 (80% chance of detecting true effect)

Rule of Thumb: For small effects (1-5% relative change), you typically need:

  • 10K-50K users per variant for conversion metrics
  • 1K-10K users per variant for engagement metrics
  • More users for rare events (fraud, churn)

Understanding the Statistics: Why These Numbers Matter

The parameters above encode important trade-offs. Understanding them helps you make better decisions about test design.

Statistical Significance (α = 0.05):

Statistical significance answers: "If the new model is actually no better than the old one, what's the probability we'd see results this extreme by chance?"

Setting α = 0.05 means we accept a 5% chance of a false positive—declaring the new model better when it's actually not. This seems low, but consider: if you run 20 A/B tests per year, you'd expect one false positive annually purely by chance.

Lower α (0.01) reduces false positives but requires larger sample sizes. Higher α (0.10) gets results faster but increases false positives. For high-stakes decisions (major product changes), use α = 0.01. For exploratory tests, α = 0.10 may be acceptable.

Statistical Power (1-β = 0.80):

Power answers a different question: "If the new model really is better by the amount we care about, what's the probability we'll detect it?"

With power = 0.80, there's a 20% chance of missing a real improvement—a false negative. We'd conclude "no significant difference" when the new model actually is better.

Higher power (0.90 or 0.95) reduces missed improvements but requires larger samples. The trade-off is test duration. With limited traffic, running at 0.80 power might take 2 weeks; 0.95 power might take 6 weeks. During those extra 4 weeks, if the new model is better, you're missing out on those gains.

Minimum Detectable Effect (MDE):

MDE is the smallest improvement worth detecting. If your baseline CTR is 2% and MDE is 5% relative, you're designing the test to detect changes from 2.0% to 2.1% or larger.

Smaller MDE requires more samples. A test designed to detect 1% relative change needs roughly 25x more users than one designed to detect 5% relative change. Be realistic about what improvements are meaningful. A 1% relative improvement in CTR might be worth millions in revenue for Google, but irrelevant for a startup. Don't design expensive tests for effects that don't matter to your business.

Why You Shouldn't Peek:

The math above assumes you check results once, at the planned end of the experiment. If you peek at results daily and stop when they look significant, you inflate false positive rates dramatically—potentially to 30% or higher.

This happens because random variation means results fluctuate. Even with no real effect, you'll see "significant" results at some point during the test due to chance. Peeking and stopping at that point locks in the false positive.

If you must monitor ongoing tests, use sequential testing methods (like always valid inference) that account for multiple looks. Or, define stopping rules in advance: "We'll check at 1 week. If p < 0.001, we can stop early. Otherwise, continue to 2 weeks."

Simpson's Paradox: The Hidden Danger:

Simpson's paradox occurs when aggregate results hide segment-level effects. Classic example:

  • Aggregate: New model has 2.5% CTR vs 2.4% CTR for control. Ship it!
  • Mobile users (60% of traffic): New model 1.8% vs control 2.0%. Worse!
  • Desktop users (40% of traffic): New model 3.6% vs control 3.0%. Better!

The new model looks better overall only because it shifted traffic composition—more requests came from desktop during the test period. Within each segment, it's worse for the majority of users.

Always check segment-level results. If the treatment effect varies by segment, investigate before shipping. You might discover that "improvement" only helps a minority while hurting the majority.

Monitoring During Deployment

What to watch during and after deployment:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DEPLOYMENT MONITORING                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SYSTEM HEALTH (Real-time):                                             │
│  ──────────────────────────                                              │
│  • Request rate (QPS)                                                  │
│  • Error rate (5xx, timeouts)                                          │
│  • Latency (p50, p95, p99)                                             │
│  • Resource utilization (CPU, memory, GPU)                             │
│                                                                          │
│  MODEL HEALTH (Near real-time):                                         │
│  ───────────────────────────────                                         │
│  • Prediction distribution (scores, classes)                           │
│  • Feature value distributions                                         │
│  • Null/missing feature rates                                          │
│  • Model version being served                                          │
│                                                                          │
│  BUSINESS METRICS (Hourly/Daily):                                       │
│  ─────────────────────────────────                                       │
│  • Primary metric (CTR, conversion, etc.)                             │
│  • Guardrail metrics                                                   │
│  • Segment-level performance                                           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ALERTING RULES:                                                         │
│                                                                          │
│  Immediate (PagerDuty):                                                 │
│  • Error rate > 1%                                                     │
│  • p99 latency > 500ms                                                 │
│  • Prediction rate drops > 50%                                         │
│                                                                          │
│  Urgent (Slack):                                                        │
│  • Primary metric drops > 10%                                          │
│  • Feature null rate > 5%                                              │
│  • Prediction distribution shift                                       │
│                                                                          │
│  Daily Review:                                                          │
│  • All metric trends                                                   │
│  • A/B test progress                                                   │
│  • Error logs                                                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Rollback Planning

Always have a rollback plan before deploying:

When to Rollback:

  • System health degradation (errors, latency)
  • Severe business metric regression
  • Unexpected behavior (even if metrics look okay)
  • User complaints spike

How to Rollback:

  • Instant: Switch traffic back to previous model
  • Gradual: Reduce new model percentage while monitoring
  • Feature flag: Disable new model via configuration

Rollback Checklist:

  • Previous model still deployed and healthy
  • Traffic routing configurable without code deploy
  • Rollback decision criteria documented
  • On-call knows rollback procedure
  • Tested rollback works before deployment

Online Testing Checklist

Before declaring success, verify:

  • Shadow mode completed (if applicable)
  • Canary deployment successful
  • A/B test designed with proper sample size
  • Statistical significance reached
  • Primary metric improved
  • Guardrail metrics stable
  • Rollback plan tested
  • Documentation updated

Common Pitfall: Declaring victory too early. Wait for statistical significance. Watch for novelty effects that fade. Monitor for delayed negative effects (churn shows up weeks later).


Step 8: Monitoring, Scaling and Continual Learning

The work doesn't end at deployment. ML systems require ongoing monitoring, maintenance, and improvement to remain effective.

Comprehensive Monitoring

Monitor at multiple levels:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    MONITORING HIERARCHY                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  LEVEL 1: INFRASTRUCTURE MONITORING                                     │
│  ─────────────────────────────────────                                   │
│  • Server health (CPU, memory, disk, network)                          │
│  • Service availability (uptime, error rates)                          │
│  • Request metrics (QPS, latency, throughput)                          │
│  • Dependencies (database, cache, feature store)                       │
│                                                                          │
│  Tools: Prometheus, Grafana, DataDog, CloudWatch                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LEVEL 2: MODEL MONITORING                                              │
│  ──────────────────────────                                              │
│  • Prediction distributions (score histograms, class ratios)          │
│  • Feature distributions (input drift detection)                       │
│  • Model version and configuration                                     │
│  • Inference latency breakdown                                         │
│                                                                          │
│  Tools: Evidently, Arize, Fiddler, custom logging                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LEVEL 3: DATA QUALITY MONITORING                                       │
│  ─────────────────────────────────                                       │
│  • Feature completeness (null rates)                                   │
│  • Feature freshness (staleness)                                       │
│  • Label availability and quality                                      │
│  • Training data pipeline health                                       │
│                                                                          │
│  Tools: Great Expectations, dbt tests, custom checks                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LEVEL 4: BUSINESS MONITORING                                           │
│  ─────────────────────────────                                           │
│  • Business KPIs (revenue, engagement, conversion)                     │
│  • Model-attributed metrics (recommendations CTR)                      │
│  • User feedback and complaints                                        │
│  • Segment-level performance                                           │
│                                                                          │
│  Tools: Amplitude, Mixpanel, Tableau, custom dashboards               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Implementing Drift Detection

Drift detection shouldn't wait for business metrics to drop—by then, the damage is done. The code below shows how to proactively detect when your model's input distributions shift, alerting you before performance degrades.

Python
from scipy import stats
import numpy as np

def detect_feature_drift(baseline_data, current_data, features, threshold=0.05):
    """
    Detect drift between baseline (training) and current (production) data.
    Uses Kolmogorov-Smirnov test for continuous features.
    Returns features that have drifted significantly.
    """
    drifted_features = []

    for feature in features:
        baseline_values = baseline_data[feature].dropna()
        current_values = current_data[feature].dropna()

        # KS test: are these distributions different?
        statistic, p_value = stats.ks_2samp(baseline_values, current_values)

        if p_value < threshold:
            drifted_features.append({
                "feature": feature,
                "p_value": p_value,
                "ks_statistic": statistic,
                "baseline_mean": baseline_values.mean(),
                "current_mean": current_values.mean(),
                "baseline_std": baseline_values.std(),
                "current_std": current_values.std(),
            })

    return drifted_features

def check_prediction_drift(baseline_predictions, current_predictions, threshold=0.1):
    """
    Check if model prediction distribution has shifted.
    A shift here suggests either input drift or concept drift.
    """
    baseline_mean = np.mean(baseline_predictions)
    current_mean = np.mean(current_predictions)

    relative_change = abs(current_mean - baseline_mean) / baseline_mean

    return {
        "drifted": relative_change > threshold,
        "baseline_mean": baseline_mean,
        "current_mean": current_mean,
        "relative_change": relative_change
    }

This drift detection approach uses statistical tests to compare distributions. The Kolmogorov-Smirnov test asks: "What's the probability these two samples came from the same underlying distribution?" A low p-value (below your threshold) indicates the distributions are likely different—your feature has drifted.

The key insight is monitoring both input drift (feature distributions) and output drift (prediction distributions). Input drift tells you the world has changed—maybe your user demographics shifted or a data source changed format. Output drift tells you your model is behaving differently, even if you can't pinpoint why.

When drift is detected, you have options. Minor drift might just warrant closer monitoring. Significant drift on important features should trigger investigation—is this real-world change or a data pipeline bug? Severe drift might require immediate retraining or rollback to a fallback model.

Run these checks daily on a sample of production traffic compared to your training baseline. Store the results in a time series to spot gradual drift trends, not just sudden shifts. A feature that drifts 1% per week might not trigger alerts but will cause serious problems after a few months.

Detecting and Handling Drift

Model performance degrades over time as the world changes. Detect drift early:

Types of Drift:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    TYPES OF DRIFT                                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  DATA DRIFT (Covariate Shift):                                          │
│  ──────────────────────────────                                          │
│  • Input feature distributions change                                  │
│  • Example: User demographics shift, new product categories           │
│  • Detection: Compare feature distributions over time                  │
│  • Response: Retrain on recent data                                   │
│                                                                          │
│  CONCEPT DRIFT:                                                          │
│  ───────────────                                                         │
│  • Relationship between features and target changes                    │
│  • Example: User preferences change, fraud patterns evolve            │
│  • Detection: Monitor prediction-outcome correlation                   │
│  • Response: Retrain with new labels                                  │
│                                                                          │
│  LABEL DRIFT:                                                            │
│  ─────────────                                                           │
│  • Target distribution changes                                         │
│  • Example: Fraud rate increases, conversion rate drops               │
│  • Detection: Monitor label/outcome distributions                      │
│  • Response: Adjust thresholds, retrain                               │
│                                                                          │
│  UPSTREAM DATA DRIFT:                                                    │
│  ─────────────────────                                                   │
│  • Data sources or schemas change                                      │
│  • Example: Logging format changes, new data provider                 │
│  • Detection: Schema validation, data quality checks                   │
│  • Response: Update pipelines, may need model changes                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DRIFT DETECTION METHODS:                                                │
│  • Statistical tests (KS test, chi-squared, PSI)                      │
│  • Distribution comparison (histograms, quantiles)                     │
│  • Model-based detection (classifier to distinguish old vs new)       │
│  • Performance degradation (monitor actual outcomes)                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Retraining Strategy

Define when and how to retrain:

Retraining Triggers:

  • Scheduled: Fixed cadence (daily, weekly, monthly)
  • Performance-based: When metrics drop below threshold
  • Drift-based: When drift detected above threshold
  • Event-based: After major changes (new features, policy changes)

Retraining Pipeline:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    AUTOMATED RETRAINING PIPELINE                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRIGGER                                                                 │
│     │ (schedule, drift alert, or manual)                                │
│     ▼                                                                    │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  DATA PREPARATION                                               │   │
│  │  • Fetch recent training data                                   │   │
│  │  • Apply data quality checks                                    │   │
│  │  • Generate train/validation splits                             │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│     │                                                                    │
│     ▼                                                                    │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  MODEL TRAINING                                                 │   │
│  │  • Train with same hyperparameters (or re-tune)                │   │
│  │  • Track experiment in MLflow/W&B                              │   │
│  │  • Save model artifacts                                        │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│     │                                                                    │
│     ▼                                                                    │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  VALIDATION                                                     │   │
│  │  • Compute offline metrics on holdout                          │   │
│  │  • Compare to current production model                         │   │
│  │  • Check for regressions on slices                             │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│     │                                                                    │
│     ▼                                                                    │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  AUTOMATED GATES                                                │   │
│  │  • Pass: Metrics improved or stable                            │   │
│  │  • Fail: Metrics regressed (alert, don't deploy)              │   │
│  │  • Flag: Unusual patterns (human review)                       │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│     │ (if passed)                                                        │
│     ▼                                                                    │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  DEPLOYMENT                                                     │   │
│  │  • Register in model registry                                   │   │
│  │  • Deploy via canary                                            │   │
│  │  • Monitor online metrics                                       │   │
│  │  • Auto-rollback if issues                                     │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  FREQUENCY GUIDELINES:                                                   │
│  • High-change domains (news, trending): Daily                        │
│  • Medium-change domains (e-commerce): Weekly                         │
│  • Low-change domains (fraud patterns): Monthly                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Feedback Loops

Close the loop between predictions and outcomes:

Collecting Feedback:

  • Log predictions with unique IDs
  • Track outcomes when they occur
  • Join predictions to outcomes
  • Use for evaluation and retraining

Feedback Types:

  • Immediate: Click/no-click, add-to-cart
  • Delayed: Purchase, subscription, churn
  • Labeled: Human review, customer support escalation
  • Implicit: Return visits, time spent, complaints

Avoiding Feedback Loops:

  • Model predictions influence user behavior
  • Changed behavior becomes training data
  • Model reinforces its own biases
  • Solution: Randomized exploration, propensity scoring

Incident Response

When things go wrong (and they will):

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    INCIDENT RESPONSE RUNBOOK                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SEVERITY LEVELS:                                                        │
│  ────────────────                                                        │
│  P0 (Critical): Model completely down, major business impact           │
│  P1 (High): Significant degradation, metrics dropping fast             │
│  P2 (Medium): Minor degradation, localized issues                      │
│  P3 (Low): Cosmetic issues, no immediate impact                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  IMMEDIATE RESPONSE (P0/P1):                                            │
│  ────────────────────────────                                            │
│  1. Acknowledge alert (5 min)                                          │
│  2. Assess impact scope (10 min)                                       │
│  3. Decide: rollback, hotfix, or investigate (15 min)                 │
│  4. Execute decision (varies)                                          │
│  5. Communicate status to stakeholders                                 │
│                                                                          │
│  INVESTIGATION:                                                          │
│  ──────────────                                                          │
│  • Check recent deployments                                            │
│  • Check data pipeline health                                          │
│  • Check feature store health                                          │
│  • Check upstream dependencies                                         │
│  • Review error logs                                                   │
│                                                                          │
│  POST-INCIDENT:                                                          │
│  ──────────────                                                          │
│  • Write post-mortem (within 48 hours)                                │
│  • Identify root cause                                                 │
│  • Document what went wrong                                            │
│  • Create action items to prevent recurrence                           │
│  • Share learnings with team                                           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  COMMON ISSUES AND SOLUTIONS:                                           │
│                                                                          │
│  Model serving slow:                                                    │
│  → Check GPU utilization, batch sizes, feature fetch latency          │
│                                                                          │
│  Predictions look wrong:                                                │
│  → Check feature values, model version, recent deployments            │
│                                                                          │
│  Metrics dropping:                                                      │
│  → Check for data drift, upstream changes, A/B test issues            │
│                                                                          │
│  Model not serving:                                                     │
│  → Check model server health, dependencies, configuration             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Monitoring and Continual Learning Checklist

Verify ongoing operations:

  • Monitoring dashboards for all levels (infra, model, data, business)
  • Alerting rules configured with appropriate thresholds
  • On-call rotation established
  • Drift detection automated
  • Retraining pipeline automated
  • Incident runbooks documented
  • Feedback loops closed (predictions → outcomes → training)
  • Regular model reviews scheduled

Common Pitfall: "Set and forget" mentality. Models degrade over time. Monitoring and retraining are non-negotiable. Build these into the system from day one.


Complete System Design Checklist

Use this comprehensive checklist to ensure thoroughness:

Problem Framing

  • Business objective defined and measurable
  • ML task type identified
  • Input and output specified
  • Ground truth source identified
  • Success criteria established
  • Failure modes understood
  • Stakeholders aligned

Scale and Latency

  • Peak QPS estimated
  • Latency budget defined (p50, p95, p99)
  • Data volume calculated
  • Storage requirements estimated
  • Growth projections documented
  • Cost estimate prepared

Metrics

  • Primary offline metric chosen
  • Primary online metric chosen
  • Guardrail metrics listed
  • Offline-online correlation validated
  • Statistical significance requirements defined

Architecture

  • High-level architecture diagram drawn
  • Critical path identified
  • Caching strategy defined
  • Failure modes with fallbacks designed
  • Capacity planning completed
  • Deployment strategy defined

Data

  • Data sources identified
  • Labeling strategy defined
  • Training-serving consistency ensured
  • Data quality assessed
  • Privacy requirements addressed

Offline Development

  • Baseline established
  • Model architecture chosen
  • Training pipeline automated
  • Offline metrics computed
  • Error analysis completed
  • Artifacts versioned

Online Execution

  • Shadow/canary deployment completed
  • A/B test designed and run
  • Statistical significance reached
  • Guardrails verified
  • Rollback plan tested

Monitoring and Operations

  • Monitoring dashboards created
  • Alerting rules configured
  • Drift detection automated
  • Retraining pipeline automated
  • Incident runbooks documented
  • Feedback loops closed

Summary

ML system design is more than building models—it's building reliable systems that deliver business value at scale. The framework in this post covers the complete lifecycle:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    KEY TAKEAWAYS                                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. START WITH THE BUSINESS PROBLEM                                     │
│     Don't build a model without understanding why.                      │
│     Define success before writing code.                                 │
│                                                                          │
│  2. DESIGN FOR SCALE (BUT DON'T OVER-ENGINEER)                         │
│     Know your numbers: QPS, latency, data volume.                      │
│     Plan for 10x, implement for 1x.                                    │
│                                                                          │
│  3. METRICS DRIVE EVERYTHING                                            │
│     Offline metrics must correlate with online business impact.        │
│     Guardrails prevent unintended consequences.                        │
│                                                                          │
│  4. DATA QUALITY > MODEL COMPLEXITY                                     │
│     Most improvements come from better data, not fancier models.       │
│     Training-serving consistency is critical.                          │
│                                                                          │
│  5. VALIDATE IN PRODUCTION                                              │
│     Offline success doesn't guarantee online success.                  │
│     A/B test everything significant.                                   │
│                                                                          │
│  6. PLAN FOR FAILURE                                                    │
│     Everything will fail eventually.                                    │
│     Graceful degradation, fast rollback.                               │
│                                                                          │
│  7. MONITOR AND ITERATE                                                 │
│     Models degrade over time.                                           │
│     Continuous monitoring and retraining are mandatory.                │
│                                                                          │
│  THE BEST ML ENGINEERS:                                                 │
│  • Think in systems, not just models                                   │
│  • Obsess over data quality                                            │
│  • Measure everything that matters                                     │
│  • Design for failure and recovery                                     │
│  • Balance complexity with pragmatism                                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

This framework is iterative. You'll revisit earlier steps as you learn more. Don't be afraid to adapt it to your specific needs—the principles matter more than the exact steps.


Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles