How long does it take to build an ML system from scratch?

For a mature organization with existing infrastructure: 2-6 months from problem definition to production. For a startup building from scratch: 6-12 months. Most of the time goes to data infrastructure (Step 5) and production hardening (Steps 7-8), not model development (Step 6).

Should I use a feature store?

If you have multiple models sharing features, or if training-serving consistency is a pain point, yes. For a single model with simple features, the overhead may not be worth it. Start simple, add a feature store when the pain justifies it.

How often should I retrain my model?

Depends on how fast your domain changes. E-commerce recommendations: weekly. Fraud detection: daily to weekly. Document classification: monthly. News/trending content: daily or real-time. Monitor performance to find the right cadence.

What's the most common mistake in ML system design?

Optimizing offline metrics that don't correlate with business outcomes. Teams spend months improving AUC from 0.85 to 0.90, then discover it doesn't move revenue. Validate the offline-online correlation early.

How do I handle cold start (new users/items)?

Multiple strategies: (1) Use non-personalized recommendations for new users, (2) Ask explicit preferences during onboarding, (3) Use content-based features for new items, (4) Collaborative filtering for warm-start (few interactions). Design your system to handle cold start gracefully.

When should I use deep learning vs. traditional ML?

Deep learning when: abundant data (>1M examples), raw/unstructured inputs (text, images), complex interactions to learn. Traditional ML when: limited data, well-engineered features, interpretability required, strict latency constraints. Start with traditional ML as a baseline regardless.

How do I convince stakeholders to invest in ML infrastructure?

Frame it in business terms: (1) Current cost of manual work the ML system will replace, (2) Revenue opportunity from better predictions, (3) Risk of not building (competitors). Start with a concrete pilot that demonstrates ROI, then expand.

What's the difference between ML Engineer and Data Scientist?

Data Scientists focus on model development (Step 6)—choosing algorithms, training models, analyzing results. ML Engineers focus on the system (Steps 4, 7, 8)—building infrastructure, deploying models, ensuring reliability. In practice, roles overlap. This framework covers both.

ML System Design: A Complete Framework for Production Systems | Enrico Piovano

Why ML System Design Matters

Building a machine learning model is the easy part. Getting it to work reliably in production, at scale, serving millions of users—that's where most teams struggle.

The gap between "model works in notebook" and "model works in production" is enormous. Models that achieve 95% accuracy offline mysteriously underperform when deployed. Systems that work perfectly in testing collapse under real-world load. Features that take milliseconds to compute locally take seconds when serving live traffic.

This isn't a failure of machine learning—it's a failure of system design. The best ML engineers aren't necessarily the ones who can build the most sophisticated models. They're the ones who can design systems that reliably transform raw data into business value at scale.

This post provides a complete framework for ML system design. It's structured as eight sequential steps, each building on the previous. Whether you're designing a recommendation system, a fraud detection pipeline, or a search ranking model, the same principles apply.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    ML SYSTEM DESIGN FRAMEWORK                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. PROBLEM FRAMING                                                      │
│     └──→ What business problem are we solving?                          │
│                                                                          │
│  2. SCALE & LATENCY REQUIREMENTS                                         │
│     └──→ How big and how fast?                                          │
│                                                                          │
│  3. METRICS                                                              │
│     └──→ How do we know if it's working?                                │
│                                                                          │
│  4. ARCHITECTURE                                                         │
│     └──→ How do the components fit together?                            │
│                                                                          │
│  5. DATA COLLECTION & PREPARATION                                        │
│     └──→ Where does the data come from?                                 │
│                                                                          │
│  6. OFFLINE MODEL DEVELOPMENT                                            │
│     └──→ Build and evaluate the model                                   │
│                                                                          │
│  7. ONLINE EXECUTION & TESTING                                           │
│     └──→ Deploy and validate in production                              │
│                                                                          │
│  8. MONITORING & CONTINUAL LEARNING                                      │
│     └──→ Keep it working over time                                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  This framework is:                                                      │
│  • Sequential: Each step builds on the previous                         │
│  • Iterative: You'll revisit earlier steps as you learn more           │
│  • Universal: Applies to any ML system, not just deep learning         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Step 1: Problem Framing

The most common mistake in ML system design is starting with "let's build a model" instead of "what business problem are we solving?" This section ensures you're solving the right problem before writing any code.

Define the Business Objective

Every ML system exists to achieve a business goal. If you can't articulate that goal clearly, you shouldn't be building an ML system yet.

The business objective should be:

Measurable: You can track progress with numbers
Time-bound: There's a deadline or milestone
Impactful: Success meaningfully affects the business
Achievable: ML can plausibly help (not everything needs ML)

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PROBLEM FRAMING EXAMPLES                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  EXAMPLE 1: E-COMMERCE RECOMMENDATIONS                                   │
│  ─────────────────────────────────────                                   │
│  Problem: Users aren't discovering products they'd like                 │
│  Business Goal: Increase revenue per user by 20% in 6 months           │
│  ML Task: Personalized product recommendation (ranking)                 │
│  Success: Revenue per user increases, return visit rate up             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE 2: FRAUD DETECTION                                              │
│  ─────────────────────────────                                           │
│  Problem: Fraudulent transactions costing $10M/year                     │
│  Business Goal: Reduce fraud losses by 50% without hurting UX          │
│  ML Task: Binary classification (fraud vs legitimate)                   │
│  Success: Fraud losses drop, false positive rate stays <1%             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE 3: CONTENT MODERATION                                           │
│  ─────────────────────────────────                                       │
│  Problem: Harmful content reaching users, regulatory risk               │
│  Business Goal: Remove 99% of harmful content within 1 hour            │
│  ML Task: Multi-label classification (spam, hate, violence, etc.)      │
│  Success: Harmful content exposure drops, user reports decrease        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE 4: SEARCH RANKING                                               │
│  ───────────────────────────                                             │
│  Problem: Users not finding what they're looking for                    │
│  Business Goal: Improve search success rate from 60% to 80%            │
│  ML Task: Learning to rank (relevance scoring)                         │
│  Success: Users find items faster, search abandonment decreases        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Translate Business Goals to ML Tasks

Once you understand the business problem, translate it into a well-defined ML task. This requires answering several questions:

What type of ML problem is this?

Classification: Assign items to categories (spam/not spam, fraud/legitimate)
Regression: Predict a continuous value (price, time, probability)
Ranking: Order items by relevance or preference
Clustering: Group similar items together
Generation: Create new content (text, images, recommendations)

What is the input? Define exactly what data the model will receive at inference time. Be specific:

User features (demographics, history, context)
Item features (attributes, metadata, embeddings)
Context features (time, location, device, session)
Real-time signals (current page, recent actions)

What is the output? Define the exact format of the model's prediction:

A single score (probability, relevance)
A ranked list (top-N recommendations)
A category (class label)
Multiple labels (multi-label classification)
Generated content (text, embeddings)

What is the ground truth? How will you know if the model's prediction was correct? This is crucial because it determines how you'll train and evaluate the model:

Explicit labels: Human annotations, user ratings
Implicit labels: Clicks, purchases, time spent
Delayed labels: Fraud confirmed days later, subscription churn after months

Understand Failure Modes

Before building the system, understand how it can fail and what the consequences are. Different failure modes have different costs:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    FAILURE MODE ANALYSIS                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FRAUD DETECTION EXAMPLE:                                                │
│  ────────────────────────                                                │
│                                                                          │
│  False Negative (Missed Fraud):                                         │
│  • Cost: Direct financial loss                                          │
│  • Impact: $100-$10,000 per incident                                   │
│  • Tolerance: Minimize aggressively                                     │
│                                                                          │
│  False Positive (Blocked Legitimate):                                   │
│  • Cost: Lost sale + customer frustration                              │
│  • Impact: $50 lost sale + potential churn                             │
│  • Tolerance: Keep below 1% of transactions                            │
│                                                                          │
│  System Failure (Model Unavailable):                                    │
│  • Cost: Must have fallback (rule-based or allow all)                  │
│  • Impact: Either missed fraud or blocked sales                        │
│  • Tolerance: <0.01% of transactions affected                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RECOMMENDATION SYSTEM EXAMPLE:                                          │
│  ───────────────────────────────                                         │
│                                                                          │
│  Poor Recommendations:                                                   │
│  • Cost: User doesn't engage, opportunity cost                         │
│  • Impact: Lower CTR, reduced revenue                                  │
│  • Tolerance: Acceptable if better than random                         │
│                                                                          │
│  Offensive/Inappropriate Recommendations:                               │
│  • Cost: User trust damaged, potential PR crisis                       │
│  • Impact: User churn, brand damage                                    │
│  • Tolerance: Near zero—needs guardrails                               │
│                                                                          │
│  Filter Bubble (Too Similar):                                           │
│  • Cost: User misses content they'd enjoy                              │
│  • Impact: Long-term engagement decline                                │
│  • Tolerance: Monitor diversity metrics                                │
│                                                                          │
│  Cold Start (New User/Item):                                            │
│  • Cost: Can't personalize for new users/items                        │
│  • Impact: Poor initial experience                                     │
│  • Tolerance: Need explicit fallback strategy                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Problem Framing Checklist

Before moving to the next step, verify:

Business objective is clearly defined and measurable
ML task type identified (classification, ranking, etc.)
Input data specified (what the model receives)
Output format defined (what the model produces)
Ground truth source identified (how you'll get labels)
Success criteria established (what "good" looks like)
Failure modes understood (what can go wrong and the cost)
Stakeholder alignment confirmed (everyone agrees on goals)

Common Pitfall: Jumping to model architecture before understanding the problem. Spend more time here than you think you need. A well-framed problem is half-solved.

Step 2: Scale and Latency Requirements

This step prevents two equally bad outcomes: over-engineering a system for scale you don't have, or under-engineering for scale you'll hit in three months. Get the numbers right before designing the architecture.

Estimate Request Volume

Understand how many predictions your system needs to make:

Queries Per Second (QPS):

Average QPS: Total daily predictions / 86,400 seconds
Peak QPS: Usually 2-10x average (depends on traffic patterns)
Burst QPS: Short spikes, often 20-50x average

Example calculation for an e-commerce recommendation system:

Code

Daily active users: 10 million
Sessions per user per day: 2
Page views per session: 10
Recommendations per page: 1 call

Daily predictions: 10M × 2 × 10 = 200 million
Average QPS: 200M / 86,400 ≈ 2,300 QPS
Peak QPS (3x): ~7,000 QPS
Flash sale burst (10x): ~23,000 QPS

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    SCALE REFERENCE POINTS                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SCALE TIERS:                                                            │
│  ────────────                                                            │
│                                                                          │
│  Tier 1 - Small (Startup/MVP):                                          │
│  • 1-100 QPS                                                            │
│  • Single server can handle                                             │
│  • Simple deployment, minimal infrastructure                           │
│  • Example: Early-stage product recommendations                        │
│                                                                          │
│  Tier 2 - Medium (Growing Product):                                     │
│  • 100-10,000 QPS                                                       │
│  • Multiple servers, load balancing required                           │
│  • Need caching, careful resource planning                             │
│  • Example: Mid-size e-commerce search                                 │
│                                                                          │
│  Tier 3 - Large (Mature Product):                                       │
│  • 10,000-1,000,000 QPS                                                 │
│  • Distributed systems, complex orchestration                          │
│  • Dedicated infrastructure team                                        │
│  • Example: Major social media feed ranking                            │
│                                                                          │
│  Tier 4 - Hyperscale (Big Tech):                                        │
│  • 1M+ QPS                                                              │
│  • Custom hardware, global distribution                                │
│  • Massive engineering investment                                       │
│  • Example: Google Search, Facebook News Feed                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RULE OF THUMB:                                                          │
│  Design for 10x your current scale, but don't implement it yet.        │
│  Document what changes at 10x and 100x.                                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Determine Latency Requirements

Latency requirements depend on where the ML system fits in the user experience:

User-Facing (Synchronous): The model blocks the user's request. Every millisecond matters.

Search ranking: <100ms (users expect instant results)
Recommendations: <200ms (part of page load)
Fraud detection: <100ms (during checkout)
Autocomplete: <50ms (must feel instantaneous)

Background (Asynchronous): The model runs separately from the user's request.

Email classification: <1 second (user doesn't wait)
Content moderation: <1 minute (before content goes live)
Batch recommendations: Hours (precomputed)
Model retraining: Hours to days

Latency Percentiles: Don't just measure average latency. Measure percentiles:

p50: Median latency (half of requests are faster)
p95: 95th percentile (95% of requests are faster)
p99: 99th percentile (only 1% are slower)
p99.9: 99.9th percentile (tail latency)

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    LATENCY BUDGET BREAKDOWN                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  EXAMPLE: PRODUCT RECOMMENDATION (200ms BUDGET)                          │
│  ───────────────────────────────────────────────                         │
│                                                                          │
│  Component                          Time (ms)    % of Budget            │
│  ─────────────────────────────────────────────────────────────          │
│  Network round-trip (user→server)      20           10%                 │
│  API gateway + auth                    10            5%                 │
│  Feature fetching (user features)      30           15%                 │
│  Feature fetching (item features)      40           20%                 │
│  Model inference                       50           25%                 │
│  Post-processing + filtering           20           10%                 │
│  Response serialization                10            5%                 │
│  Network round-trip (server→user)      20           10%                 │
│  ─────────────────────────────────────────────────────────────          │
│  Total                                200          100%                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY INSIGHTS:                                                           │
│                                                                          │
│  • Model inference is only 25% of total latency                        │
│  • Feature fetching often dominates (35% here)                         │
│  • Network latency is fixed cost—can't optimize                        │
│  • Every component needs its own latency budget                        │
│                                                                          │
│  OPTIMIZATION PRIORITY:                                                  │
│  1. Feature fetching (cache, pre-compute)                              │
│  2. Model inference (smaller model, batching, hardware)                │
│  3. Post-processing (simplify business rules)                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Calculate Data and Storage Requirements

Understand the data volumes involved:

Training Data:

How much historical data do you need?
How often do you retrain?
What's the storage format and compression?

Feature Storage:

How many features per entity?
How many entities (users, items)?
What's the update frequency?

Model Artifacts:

Model weights (can be GBs for deep learning)
Preprocessing artifacts (tokenizers, scalers)
Configuration and metadata

Example storage calculation:

Code

Users: 10 million
Items: 1 million
User features: 200 floats × 4 bytes = 800 bytes/user
Item features: 500 floats × 4 bytes = 2,000 bytes/item

User feature store: 10M × 800B = 8 GB
Item feature store: 1M × 2KB = 2 GB
Training data (1 year): 500 GB
Model artifacts: 2 GB

Total: ~512 GB (manageable on single machine)
At 100M users: ~80 GB user features (needs distributed store)

Scale and Latency Checklist

Before moving forward, verify:

Peak QPS estimated (with seasonal/event variations)
Latency budget defined (p50, p95, p99)
Latency breakdown by component
Training data volume calculated
Feature storage requirements estimated
Growth projections documented (6 months, 1 year, 3 years)
Cost estimate prepared (compute, storage, network)

Common Pitfall: Designing for current scale, not future scale. Always plan for 10x growth, but don't implement it until you need it. Document what changes at each scale tier.

Step 3: Metrics (Offline and Online Evaluation)

Metrics are how you know if your ML system is working. The wrong metrics lead to optimizing the wrong thing—a common and costly mistake. This section establishes the measurements that will guide development and deployment.

The Metrics Hierarchy

ML systems need multiple types of metrics at different levels:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    METRICS HIERARCHY                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  LEVEL 1: BUSINESS METRICS (What the company cares about)               │
│  ─────────────────────────────────────────────────────────               │
│  • Revenue, profit, cost savings                                        │
│  • User retention, engagement, satisfaction                             │
│  • Operational efficiency                                               │
│  Example: Revenue per user, monthly active users, NPS score            │
│                                                                          │
│           ↑ Correlated but not identical ↓                              │
│                                                                          │
│  LEVEL 2: ONLINE METRICS (What you A/B test)                            │
│  ────────────────────────────────────────────                            │
│  • User behavior directly influenced by the model                       │
│  • Measurable in real-time during experiments                          │
│  • Leading indicators of business metrics                               │
│  Example: Click-through rate, conversion rate, time on site            │
│                                                                          │
│           ↑ Should correlate ↓                                          │
│                                                                          │
│  LEVEL 3: OFFLINE METRICS (What you optimize during training)           │
│  ─────────────────────────────────────────────────────────               │
│  • Model quality on held-out test data                                  │
│  • Computed during development, before deployment                       │
│  • Should predict online performance (but often don't!)                │
│  Example: AUC, NDCG, precision, recall, F1                             │
│                                                                          │
│           ↑ Must map to ↓                                               │
│                                                                          │
│  LEVEL 4: MODEL METRICS (Technical health)                              │
│  ──────────────────────────────────────────                              │
│  • Training loss, validation loss                                       │
│  • Gradient norms, learning curves                                      │
│  • Inference latency, throughput                                        │
│  Example: Cross-entropy loss, p99 latency, GPU utilization             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CRITICAL INSIGHT:                                                       │
│  Optimizing offline metrics that don't correlate with online/business  │
│  metrics is the #1 cause of ML systems that "work" but don't deliver   │
│  value. Always validate the correlation.                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Choosing Offline Metrics

Offline metrics are computed on held-out test data during model development. Choose metrics appropriate to your ML task:

Classification Metrics:

Metric	When to Use	Limitation
Accuracy	Balanced classes, equal error costs	Misleading with imbalanced data
Precision	False positives are costly (spam filter)	Ignores false negatives
Recall	False negatives are costly (fraud, disease)	Ignores false positives
F1 Score	Balance precision and recall	Assumes equal cost
AUC-ROC	Ranking quality, threshold-independent	Can be misleading with imbalance
AUC-PR	Imbalanced data, precision-recall trade-off	Less intuitive than ROC
Log Loss	Calibrated probabilities matter	Sensitive to confident wrong predictions

Ranking Metrics:

Metric	When to Use	Limitation
NDCG@K	Graded relevance (very/somewhat/not relevant)	Requires relevance labels
MAP	Binary relevance, multiple relevant items	All relevant items weighted equally
MRR	Single correct answer (Q&A, search)	Only considers first relevant result
Precision@K	Top-K results matter most	Ignores ordering within K
Recall@K	Finding all relevant items matters	Ignores ranking quality

Regression Metrics:

Metric	When to Use	Limitation
MSE/RMSE	Penalize large errors heavily	Sensitive to outliers
MAE	Robust to outliers	Doesn't penalize large errors
MAPE	Relative error matters	Undefined when actual is zero
R²	Explain variance in target	Can be negative, misleading

Choosing Online Metrics

Online metrics are measured during A/B tests in production. They should be:

Sensitive: Detectable changes with reasonable sample sizes
Actionable: You can influence them with your model
Aligned: Correlated with business outcomes

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    ONLINE METRICS BY USE CASE                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  RECOMMENDATIONS:                                                        │
│  ────────────────                                                        │
│  Primary: Click-through rate (CTR), Conversion rate                    │
│  Secondary: Revenue per session, Items viewed                          │
│  Guardrails: Latency, Diversity of recommendations                     │
│                                                                          │
│  SEARCH RANKING:                                                         │
│  ───────────────                                                         │
│  Primary: Click-through rate, Success rate (found what they wanted)    │
│  Secondary: Time to first click, Queries per session                   │
│  Guardrails: Zero-result rate, Latency                                 │
│                                                                          │
│  FRAUD DETECTION:                                                        │
│  ────────────────                                                        │
│  Primary: Fraud loss rate, Fraud detection rate                        │
│  Secondary: False positive rate, Manual review volume                  │
│  Guardrails: Customer friction (blocked legitimate), Latency           │
│                                                                          │
│  CONTENT MODERATION:                                                     │
│  ───────────────────                                                     │
│  Primary: Harmful content exposure rate, Removal accuracy              │
│  Secondary: Time to removal, Appeal rate                               │
│  Guardrails: False positive rate (wrongly removed), User reports       │
│                                                                          │
│  AD TARGETING:                                                           │
│  ─────────────                                                           │
│  Primary: Click-through rate, Cost per acquisition                     │
│  Secondary: Return on ad spend (ROAS), Impression to conversion        │
│  Guardrails: Ad fatigue, User complaints                               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Guardrail Metrics

Guardrail metrics are things that should NOT get worse when you launch a new model. They prevent unintended consequences:

System Guardrails:

Latency (p50, p95, p99) should not regress
Error rate should not increase
Resource utilization should not spike

User Experience Guardrails:

User complaints should not increase
Negative feedback rate should stay stable
Session length should not decrease (unless intentional)

Business Guardrails:

Revenue should not decrease (even if engagement increases)
Customer support tickets should not spike
Churn should not increase

Fairness Guardrails:

Performance should not degrade for any user segment
Disparate impact across demographic groups
Equal opportunity metrics

Metrics Checklist

Before moving forward, verify:

Primary offline metric chosen (what you optimize during training)
Primary online metric chosen (what you A/B test)
Guardrail metrics listed (what shouldn't regress)
Offline-online correlation validated (or plan to validate)
Metric definitions documented precisely
Statistical significance requirements defined
Dashboards planned for all metric levels

Common Pitfall: Optimizing offline metrics that don't correlate with online business metrics. A model with higher AUC doesn't always produce higher CTR. Validate the correlation early and often.

Step 4: Architecting for Scale

This is where you design the system that will run your ML model in production. The goal is a system that meets your latency requirements, handles your scale, and fails gracefully when things go wrong.

High-Level Architecture Patterns

Most ML systems follow one of a few common patterns:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PATTERN 1: ONLINE INFERENCE                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  User Request → API Gateway → Model Server → Response                   │
│                                     ↑                                    │
│                              Feature Store                              │
│                                                                          │
│  CHARACTERISTICS:                                                        │
│  • Real-time predictions (synchronous)                                  │
│  • Low latency required (<200ms typical)                               │
│  • Features computed on-demand or pre-computed                         │
│                                                                          │
│  USE CASES:                                                              │
│  • Search ranking                                                       │
│  • Real-time recommendations                                           │
│  • Fraud detection at checkout                                         │
│                                                                          │
│  CHALLENGES:                                                             │
│  • Latency optimization critical                                       │
│  • Feature freshness vs computation cost                               │
│  • Handling traffic spikes                                             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────┐
│                    PATTERN 2: BATCH INFERENCE                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Scheduled Job → Data Warehouse → Model → Prediction Store → Serve     │
│                                                                          │
│  CHARACTERISTICS:                                                        │
│  • Pre-computed predictions (asynchronous)                              │
│  • Higher latency acceptable (minutes to hours)                        │
│  • Predictions cached and served from store                            │
│                                                                          │
│  USE CASES:                                                              │
│  • Email recommendations (daily digest)                                │
│  • Risk scoring for all customers                                      │
│  • Content pre-ranking for feeds                                       │
│                                                                          │
│  CHALLENGES:                                                             │
│  • Predictions may be stale                                            │
│  • Can't personalize to real-time context                             │
│  • Storage costs for all predictions                                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────┐
│                    PATTERN 3: HYBRID (MOST COMMON)                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                    ┌───────────────┐                                    │
│                    │ Batch Layer   │ Pre-compute candidates             │
│                    │ (Candidate    │ (hourly/daily)                     │
│                    │  Generation)  │                                    │
│                    └───────┬───────┘                                    │
│                            ↓                                             │
│  User Request → ┌─────────────────────┐ → Response                     │
│                 │ Online Layer        │                                  │
│                 │ (Real-time Ranking) │                                  │
│                 └─────────────────────┘                                  │
│                            ↑                                             │
│                    Feature Store                                        │
│                                                                          │
│  CHARACTERISTICS:                                                        │
│  • Batch layer generates candidates (1000s of items)                   │
│  • Online layer ranks/filters for real-time context                    │
│  • Best of both worlds: coverage + freshness                           │
│                                                                          │
│  USE CASES:                                                              │
│  • Large-scale recommendations (Netflix, YouTube)                      │
│  • Search with pre-built index + real-time ranking                    │
│  • Ad serving with pre-filtered inventory                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────┐
│                    PATTERN 4: STREAMING                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Event Stream → Stream Processor → Model → Action/Store                │
│       ↓                                                                  │
│  Feature Updates                                                        │
│                                                                          │
│  CHARACTERISTICS:                                                        │
│  • Process events as they arrive                                       │
│  • Near real-time (seconds to minutes)                                 │
│  • Features updated continuously                                       │
│                                                                          │
│  USE CASES:                                                              │
│  • Real-time fraud detection on transactions                           │
│  • Anomaly detection in logs/metrics                                   │
│  • Dynamic pricing updates                                             │
│                                                                          │
│  CHALLENGES:                                                             │
│  • Complex infrastructure (Kafka, Flink)                               │
│  • Exactly-once processing guarantees                                  │
│  • Debugging streaming pipelines                                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Detailed System Architecture

Let's walk through a detailed architecture for a recommendation system—one of the most common ML systems:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    RECOMMENDATION SYSTEM ARCHITECTURE                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  USER REQUEST                                                            │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  LOAD BALANCER / API GATEWAY                                    │   │
│  │  • Rate limiting, authentication                                │   │
│  │  • Route to appropriate service                                 │   │
│  │  • Latency: ~5ms                                                │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  CANDIDATE GENERATION                                           │   │
│  │                                                                  │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │   │
│  │  │ Retrieval 1 │  │ Retrieval 2 │  │ Retrieval 3 │             │   │
│  │  │ (Popular)   │  │ (Similar)   │  │ (Personal)  │             │   │
│  │  └─────────────┘  └─────────────┘  └─────────────┘             │   │
│  │         │               │               │                       │   │
│  │         └───────────────┼───────────────┘                       │   │
│  │                         ▼                                        │   │
│  │                  Merge & Dedupe                                  │   │
│  │                  (~500-1000 candidates)                          │   │
│  │                                                                  │   │
│  │  Latency: ~30ms (parallel retrieval)                            │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  FEATURE FETCHING                                               │   │
│  │                                                                  │   │
│  │  ┌─────────────────┐     ┌─────────────────┐                   │   │
│  │  │ User Features   │     │ Item Features   │                   │   │
│  │  │ (from cache)    │     │ (batch lookup)  │                   │   │
│  │  └─────────────────┘     └─────────────────┘                   │   │
│  │           │                      │                              │   │
│  │           └──────────┬───────────┘                              │   │
│  │                      ▼                                           │   │
│  │              Feature Assembly                                    │   │
│  │              (user × item pairs)                                │   │
│  │                                                                  │   │
│  │  Latency: ~40ms (often the bottleneck!)                         │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  RANKING MODEL                                                  │   │
│  │                                                                  │   │
│  │  • Score all candidates (500-1000 items)                       │   │
│  │  • Batch inference for efficiency                               │   │
│  │  • GPU/TPU for complex models                                   │   │
│  │                                                                  │   │
│  │  Latency: ~50ms (depends on model complexity)                   │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  POST-PROCESSING / BUSINESS RULES                               │   │
│  │                                                                  │   │
│  │  • Filter: Remove already seen, out of stock, blocked          │   │
│  │  • Diversify: Don't show all items from same category          │   │
│  │  • Boost: Promote sponsored items, new releases                │   │
│  │  • Explain: Generate "Because you watched X" text              │   │
│  │                                                                  │   │
│  │  Latency: ~15ms                                                 │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│       │                                                                  │
│       ▼                                                                  │
│  RESPONSE (Top 10-50 recommendations)                                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TOTAL LATENCY BUDGET: ~150ms                                           │
│                                                                          │
│  SUPPORTING SYSTEMS:                                                     │
│  • Feature Store: Redis/DynamoDB for real-time features               │
│  • Model Registry: MLflow/Vertex AI for model versioning              │
│  • Logging: Kafka → Data Warehouse for training data                  │
│  • Monitoring: Prometheus/DataDog for metrics and alerts              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The Feature Store: Heart of ML Infrastructure

Feature stores are critical infrastructure that solves several hard problems:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    FEATURE STORE ARCHITECTURE                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE PROBLEM:                                                            │
│  ────────────                                                            │
│  • Features computed differently for training vs serving (skew!)       │
│  • Same features recomputed across multiple models (waste!)            │
│  • Feature logic scattered across notebooks and services (chaos!)      │
│                                                                          │
│  THE SOLUTION: FEATURE STORE                                            │
│  ────────────────────────────                                            │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                    FEATURE STORE                                │   │
│  │                                                                  │   │
│  │  ┌───────────────────────────────────────────────────────────┐  │   │
│  │  │  FEATURE DEFINITIONS (code/config)                        │  │   │
│  │  │  • user_total_purchases (SUM of purchases, 30 days)      │  │   │
│  │  │  • item_avg_rating (AVG of ratings)                      │  │   │
│  │  │  • user_item_affinity (embedding similarity)             │  │   │
│  │  └───────────────────────────────────────────────────────────┘  │   │
│  │                          │                                       │   │
│  │           ┌──────────────┼──────────────┐                       │   │
│  │           ▼              ▼              ▼                       │   │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐               │   │
│  │  │ Offline     │ │ Online      │ │ Streaming   │               │   │
│  │  │ Store       │ │ Store       │ │ Updates     │               │   │
│  │  │ (training)  │ │ (serving)   │ │ (real-time) │               │   │
│  │  │             │ │             │ │             │               │   │
│  │  │ Data Lake   │ │ Redis/      │ │ Kafka →     │               │   │
│  │  │ Parquet     │ │ DynamoDB    │ │ Flink       │               │   │
│  │  └─────────────┘ └─────────────┘ └─────────────┘               │   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  BENEFITS:                                                               │
│  • Single source of truth for feature definitions                      │
│  • Training-serving consistency guaranteed                             │
│  • Feature reuse across models                                         │
│  • Point-in-time correct training data                                 │
│  • Monitoring for feature drift                                        │
│                                                                          │
│  POPULAR OPTIONS:                                                        │
│  • Feast (open source)                                                 │
│  • Tecton (managed)                                                    │
│  • Vertex AI Feature Store (GCP)                                       │
│  • Amazon SageMaker Feature Store (AWS)                                │
│  • Databricks Feature Store                                            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Defining Features: The Foundation of Consistency

A feature definition should be a single source of truth that both training and serving pipelines reference. The code below shows how to define features in a way that prevents training-serving skew—the feature computation logic lives in one place and is used everywhere.

Python

from feast import Entity, Feature, FeatureView, FileSource, ValueType
from datetime import timedelta

# Define the entity (what we're computing features for)
user = Entity(
    name="user_id",
    value_type=ValueType.STRING,
    description="Unique user identifier"
)

# Define the data source
user_activity_source = FileSource(
    path="s3://features/user_activity.parquet",
    event_timestamp_column="event_timestamp",
    created_timestamp_column="created_timestamp"
)

# Define the feature view - THIS IS THE SINGLE SOURCE OF TRUTH
user_features = FeatureView(
    name="user_features",
    entities=["user_id"],
    ttl=timedelta(days=1),  # How long features are valid
    features=[
        Feature(name="purchase_count_30d", dtype=ValueType.INT64),
        Feature(name="avg_purchase_value_30d", dtype=ValueType.FLOAT),
        Feature(name="days_since_last_purchase", dtype=ValueType.INT64),
        Feature(name="favorite_category", dtype=ValueType.STRING),
    ],
    online=True,   # Serve from online store (Redis)
    offline=True,  # Available for training (data warehouse)
    source=user_activity_source,
)

This feature definition using Feast (a popular open-source feature store) illustrates several critical concepts. The FeatureView defines exactly what features exist and how they're computed. Both the training pipeline and serving pipeline reference this same definition—there's no separate "training features" and "serving features" that could drift apart.

The ttl (time-to-live) parameter specifies how long computed features remain valid. For user purchase behavior, one day makes sense—preferences don't change hourly. For real-time features like "items currently in cart," you'd use a shorter TTL.

The online=True flag enables serving from a low-latency store (typically Redis or DynamoDB), while offline=True makes the same features available in the data warehouse for training. When you fetch features for training, the feature store handles point-in-time correctness—it retrieves the feature values as they existed at the time of each training example, preventing future data leakage.

The power of this approach is enforcement. When a data scientist wants to add a new feature, they define it in the feature store. When the serving team needs that feature at inference time, they fetch it from the same store. There's no opportunity for the training and serving computations to diverge.

Caching Strategies

Caching is essential for meeting latency requirements at scale:

What to Cache:

User features: Change infrequently, cache for hours
Item features: Change infrequently, cache for hours
Candidate lists: Pre-computed, cache for minutes
Model predictions: If inputs are cacheable, cache results

Cache Invalidation:

TTL-based: Expire after fixed time (simple but may be stale)
Event-based: Invalidate when underlying data changes (fresh but complex)
Hybrid: TTL with event-based refresh for critical changes

Cache Hierarchy:

Code

Request → L1 Cache (in-process, microseconds)
       → L2 Cache (Redis, milliseconds)
       → L3 Cache (distributed, tens of milliseconds)
       → Origin (database/compute, hundreds of milliseconds)

Failure Modes and Fallbacks

Design for failure. Every component will fail eventually:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    FAILURE MODES AND FALLBACKS                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  COMPONENT              FAILURE MODE              FALLBACK               │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Model Server           Timeout/Crash            Return popular items   │
│                                                   (pre-computed)         │
│                                                                          │
│  Feature Store          Slow/Unavailable         Use cached features    │
│                                                   (may be stale)         │
│                                                                          │
│  Candidate Generation   No results               Expand to global       │
│                                                   popular items          │
│                                                                          │
│  Real-time Features     Missing                  Use default values     │
│                                                   (document in training) │
│                                                                          │
│  Ranking Model          Wrong predictions        Circuit breaker →      │
│                                                   fallback model         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  GRACEFUL DEGRADATION HIERARCHY:                                        │
│                                                                          │
│  1. Full personalization (everything works)                            │
│  2. Partial personalization (some features missing)                    │
│  3. Segment-based (user cohort defaults)                               │
│  4. Global popular (same for everyone)                                 │
│  5. Static fallback (hardcoded list)                                   │
│                                                                          │
│  Each level should be tested and monitored.                            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Architecture Checklist

Before moving forward, verify:

High-level architecture diagram drawn
Critical path identified (what blocks the response)
Bottlenecks analyzed (CPU, memory, network, disk)
Caching strategy defined (what's cached, TTL, invalidation)
Feature store design completed (or decision to not use one)
Failure modes identified with fallback strategies
Capacity planning done (servers, memory, cost)
Deployment strategy defined (blue-green, canary, etc.)

Common Pitfall: Over-engineering for scale you don't have yet, or under-engineering for scale you'll hit in 3 months. Document what changes at each scale tier and implement incrementally.

Step 5: Data Collection and Preparation

Data is the foundation of any ML system. This step ensures you have the right data, properly prepared, flowing reliably through the system.

Identify Data Sources

Start by mapping all data sources relevant to your ML problem:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DATA SOURCE INVENTORY                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SOURCE TYPE        EXAMPLES                    CONSIDERATIONS          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  User Behavior      Clicks, views, purchases    High volume, real-time │
│  Logs               Time on page, scroll depth  Privacy considerations │
│                                                                          │
│  User Profile       Demographics, preferences   PII handling required  │
│  Data               Account settings            May be incomplete      │
│                                                                          │
│  Item Catalog       Product attributes,         Regular updates needed │
│                     descriptions, images        Quality varies         │
│                                                                          │
│  Transaction        Purchases, refunds,         Delayed feedback       │
│  Data               subscriptions               (churn after months)   │
│                                                                          │
│  External Data      Weather, events,            API costs, reliability │
│                     market prices               Terms of use           │
│                                                                          │
│  User Feedback      Ratings, reviews,           Sparse, biased sample │
│                     surveys                      Selection effects      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  FOR EACH SOURCE, DOCUMENT:                                             │
│  • Owner: Who controls this data?                                      │
│  • Access: How do we get it? (API, database, stream)                  │
│  • Freshness: How often is it updated?                                 │
│  • Quality: What's missing or incorrect?                               │
│  • Volume: How much data? Growth rate?                                 │
│  • Sensitivity: PII? Regulatory constraints?                           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Define Your Labeling Strategy

Labels are the ground truth your model learns from. The labeling strategy is crucial:

Explicit Labels (Human-Generated):

Human annotators label data
High quality but expensive and slow
Subject to inter-annotator disagreement
Examples: Content moderation labels, search relevance judgments

Implicit Labels (Behavior-Derived):

Derived from user behavior
Abundant and free but noisy
May not reflect true preferences
Examples: Clicks (positive), impressions without clicks (negative?)

Programmatic Labels (Rule-Based):

Generated by rules or heuristics
Scalable but limited to rule coverage
Good for bootstrapping
Examples: Regex-based spam detection, threshold-based anomalies

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    LABELING STRATEGIES BY USE CASE                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  RECOMMENDATION SYSTEMS:                                                 │
│  ───────────────────────                                                 │
│  Positive labels: Clicks, purchases, long engagement                   │
│  Negative labels: Impressions without interaction (tricky!)            │
│  Challenge: Exposure bias (only see what was shown)                    │
│  Solution: Use propensity scoring, randomized exploration              │
│                                                                          │
│  FRAUD DETECTION:                                                        │
│  ────────────────                                                        │
│  Positive labels: Confirmed fraud (chargebacks, investigations)        │
│  Negative labels: Transactions not reported as fraud                   │
│  Challenge: Label delay (fraud discovered weeks later)                 │
│  Solution: Retrain with mature labels, use intermediate signals        │
│                                                                          │
│  CONTENT MODERATION:                                                     │
│  ───────────────────                                                     │
│  Positive labels: Human-reviewed violations                            │
│  Negative labels: Content not flagged as violating                     │
│  Challenge: Policy changes, subjective judgments                       │
│  Solution: Regular recalibration, multiple annotators                  │
│                                                                          │
│  SEARCH RANKING:                                                         │
│  ───────────────                                                         │
│  Positive labels: Clicks, successful sessions                          │
│  Negative labels: Skipped results, reformulated queries                │
│  Challenge: Position bias (top results get more clicks)                │
│  Solution: Position debiasing, randomized experiments                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Deep Dive: Why These Labeling Challenges Matter

The diagram above summarizes labeling strategies, but understanding why each challenge exists—and how to solve it—is crucial for building systems that actually work.

Exposure Bias in Recommendations:

When you train a recommendation model, your training data consists of items that were shown to users and whether they clicked. But here's the problem: you only have labels for items that were shown. Items that were never shown have no labels—not because users wouldn't like them, but because your previous model never recommended them.

This creates a feedback loop. Your model learns to recommend items similar to what it already recommends, because those are the only items with positive labels. Items that might be excellent but were never shown remain hidden. Over time, this narrows the diversity of recommendations and can trap users in "filter bubbles."

The solution involves two strategies. First, propensity scoring weights training examples by the probability they would have been shown. If an item had only a 1% chance of being shown but was clicked, that's a much stronger signal than a click on an item that was shown to everyone. Second, randomized exploration deliberately shows some items randomly (not based on model scores) to collect unbiased feedback. This sacrifices short-term engagement for long-term model improvement.

Label Delay in Fraud Detection:

Fraud labels arrive late—sometimes weeks or months after the transaction. A customer disputes a charge, the bank investigates, and eventually confirms fraud. During this delay, your model is training on incomplete data. Recent transactions are labeled "not fraud" simply because there hasn't been time for fraud to be reported.

This creates a systematic bias toward under-detecting fraud. Your model sees recent "not fraud" labels that are actually "fraud not yet discovered" and learns that certain patterns are safe when they're not.

The solution requires training on "mature" labels—data old enough that most fraud would have been discovered. If fraud typically surfaces within 30 days, train on data at least 45 days old. For real-time scoring, use intermediate signals (account flags, suspicious pattern matches) that indicate elevated risk before fraud is confirmed.

Position Bias in Search:

Users are far more likely to click results at the top of the page, regardless of relevance. A mediocre result in position 1 gets more clicks than an excellent result in position 5. If you train naively on click data, your model learns that "being shown first" predicts clicks—which is true but useless, since you're trying to learn which results deserve to be first.

Position debiasing corrects for this. During training, you weight clicks by the inverse of position bias—a click on position 5 counts more than a click on position 1, because it happened despite the position disadvantage. Some teams run randomized experiments where result order is shuffled, collecting unbiased click data at the cost of degraded user experience during the experiment.

Training-Serving Consistency

One of the most common and insidious bugs in ML systems is training-serving skew: features computed differently during training versus serving.

Sources of Skew:

Code skew: Different implementations in training (Python/Spark) vs serving (Java/C++)
Data skew: Different data sources or freshness
Time skew: Using future information during training (leakage)
Processing skew: Different preprocessing (normalization, encoding)

Solutions:

Shared feature definitions: Single source of truth (feature store)
Logged features: Log features at serving time, use for training
Feature validation: Compare training and serving feature distributions
Integration tests: End-to-end tests that catch skew

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TRAINING-SERVING CONSISTENCY                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ANTI-PATTERN (Causes Skew):                                            │
│  ───────────────────────────                                             │
│                                                                          │
│  Training Pipeline:                                                      │
│  Raw Data → Spark Job → Feature Engineering → Training                 │
│                         (Python code A)                                  │
│                                                                          │
│  Serving Pipeline:                                                       │
│  Request → API Server → Feature Engineering → Model → Response         │
│                         (Java code B)                                    │
│                                                                          │
│  Problem: Code A and Code B may compute features differently!          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  GOOD PATTERN (Consistent):                                              │
│  ──────────────────────────                                              │
│                                                                          │
│  Both Pipelines:                                                         │
│  Data → Feature Store → Same Features → Training/Serving               │
│           (single definition)                                            │
│                                                                          │
│  OR:                                                                     │
│                                                                          │
│  Serving:                                                                │
│  Request → Features → Model → Response                                  │
│                │                                                         │
│                └──→ Log Features                                        │
│                          │                                               │
│  Training:               ▼                                               │
│                 Use Logged Features                                      │
│                                                                          │
│  This guarantees training uses exactly what serving computes.          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Concrete Example: How Training-Serving Skew Destroys Model Performance

Let's walk through a real scenario where training-serving skew caused a recommendation model to fail in production, despite excellent offline metrics.

The Setup:

An e-commerce team builds a recommendation model. One important feature is user_avg_purchase_price—the average price of items a user has purchased. During training, they compute this feature using a Spark job that processes the full purchase history.

The Skew:

In their training pipeline (Spark/Python):

Python

# Training: computes average over ALL purchases
avg_price = user_purchases.groupBy("user_id").agg(avg("price"))

In their serving pipeline (Java microservice):

Java

// Serving: computes average over LAST 30 DAYS (for performance)
avgPrice = recentPurchases.stream()
    .mapToDouble(Purchase::getPrice)
    .average();

The training code computes the lifetime average. The serving code, for latency reasons, only uses the last 30 days. Nobody documented this difference. The feature has the same name in both places.

The Impact:

For a user who bought cheap items years ago but now buys expensive items, the training feature might be $50 (lifetime average), while the serving feature is$ 200 (recent average). The model learned patterns based on lifetime averages but receives recent averages at inference time.

The model's offline AUC was 0.82. In production, it performed barely better than random. The team spent weeks debugging before discovering the skew.

How to Detect This:

Feature distribution monitoring: Log feature values at serving time. Compare serving distributions to training distributions daily. The user_avg_purchase_price feature would show a different distribution (higher variance, shifted mean).
Shadow scoring: Run the training pipeline on recent data and compare features to what was logged during serving. Any significant differences indicate skew.
Feature contracts: Document exactly how each feature is computed, including time windows, null handling, and edge cases. Code review changes against this contract.

The Fix:

Either change training to use 30-day windows (matching serving), or change serving to compute lifetime averages (matching training). The specific choice depends on which definition is more predictive—but they must match.

This example illustrates why training-serving consistency is non-negotiable. A feature store that enforces single definitions prevents this entire class of bugs.

Data Quality Assessment

Before training, assess data quality systematically:

Completeness:

What percentage of records have each feature?
Are missing values random or systematic?
How do you handle missing values?

Accuracy:

Are labels correct? (Sample and verify)
Are feature values plausible? (Range checks)
Are there systematic errors? (Biased collection)

Consistency:

Are definitions consistent over time?
Are there duplicate records?
Do related fields agree?

Timeliness:

How fresh is the data?
Is there label delay?
Are features point-in-time correct?

Implementing Data Quality Checks

Data quality assessment shouldn't be manual inspection—it should be automated validation that runs on every data pipeline execution. The code below shows a simplified data validation approach that catches common issues before they corrupt your training data.

Python

def validate_training_data(df, config):
    """
    Validate training data quality before model training.
    Returns validation report with pass/fail status.
    """
    issues = []

    # Check for required columns
    missing_cols = set(config.required_columns) - set(df.columns)
    if missing_cols:
        issues.append(f"Missing columns: {missing_cols}")

    # Check null rates against thresholds
    for col, max_null_rate in config.null_thresholds.items():
        actual_null_rate = df[col].isnull().mean()
        if actual_null_rate > max_null_rate:
            issues.append(
                f"{col}: null rate {actual_null_rate:.2%} exceeds threshold {max_null_rate:.2%}"
            )

    # Check value ranges for numeric features
    for col, (min_val, max_val) in config.value_ranges.items():
        out_of_range = ((df[col] < min_val) | (df[col] > max_val)).mean()
        if out_of_range > 0.01:  # More than 1% out of range
            issues.append(
                f"{col}: {out_of_range:.2%} of values outside [{min_val}, {max_val}]"
            )

    # Check label distribution hasn't shifted dramatically
    label_dist = df[config.label_column].value_counts(normalize=True)
    for label, expected_rate in config.expected_label_rates.items():
        actual_rate = label_dist.get(label, 0)
        if abs(actual_rate - expected_rate) > config.label_drift_threshold:
            issues.append(
                f"Label '{label}': rate {actual_rate:.2%} vs expected {expected_rate:.2%}"
            )

    return {"passed": len(issues) == 0, "issues": issues}

This validation function embodies several important principles. First, it checks for structural issues like missing columns—these indicate upstream pipeline failures and should block training entirely. Second, it monitors null rates per column, because a sudden spike in nulls often indicates a data source problem (API changed, logging broke, upstream job failed). The thresholds should be set based on historical baselines, not arbitrary numbers.

Third, the value range checks catch data corruption. If user ages suddenly include values of 500 or -10, something is wrong upstream. Fourth, label distribution monitoring catches concept drift—if your fraud rate suddenly doubles, either fraud patterns changed or your labeling pipeline broke. Either way, you need to investigate before training.

The key is making these checks automatic and blocking. If validation fails, the training pipeline should stop and alert, not proceed with corrupted data. Bad data produces bad models, and catching issues at validation time is far cheaper than debugging model failures in production.

Data Pipeline Architecture

Design robust pipelines for data flow:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DATA PIPELINE ARCHITECTURE                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  DATA SOURCES                                                            │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  INGESTION LAYER                                                │   │
│  │  • Kafka for streaming data                                     │   │
│  │  • Batch imports for static data                                │   │
│  │  • Schema validation at entry                                   │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  PROCESSING LAYER                                               │   │
│  │  • Spark/Flink for transformation                               │   │
│  │  • Feature computation                                          │   │
│  │  • Data quality checks                                          │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  STORAGE LAYER                                                  │   │
│  │  • Data Lake (S3/GCS) for raw and processed data               │   │
│  │  • Feature Store for ML features                                │   │
│  │  • Data Warehouse for analytics                                 │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  SERVING LAYER                                                  │   │
│  │  • Training data for model development                          │   │
│  │  • Online features for inference                                │   │
│  │  • Batch features for offline scoring                           │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  KEY PRINCIPLES:                                                         │
│  • Idempotent processing (re-runnable without side effects)            │
│  • Schema evolution (handle schema changes gracefully)                 │
│  • Data lineage (track where data came from)                           │
│  • Monitoring and alerting (catch issues early)                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Data Collection Checklist

Before moving forward, verify:

All data sources identified and accessible
Labeling strategy defined (explicit, implicit, or programmatic)
Training-serving consistency ensured
Data quality assessed (completeness, accuracy, consistency)
Data pipeline architecture designed
Privacy and compliance requirements addressed
Data retention and deletion policies defined
Documentation of all feature definitions

Common Pitfall: Spending 80% of time on modeling, 20% on data. Should be reversed. Data quality has more impact on model performance than model architecture for most problems.

Step 6: Offline Model Development and Evaluation

This step covers the actual machine learning: selecting algorithms, training models, and evaluating performance before deployment.

Establish a Baseline

Before building complex models, establish a baseline. This serves multiple purposes:

Proves the problem is solvable
Provides a benchmark for improvement
May be sufficient for the business need
Helps debug more complex models

Good Baselines:

Random: What's random performance? (important for imbalanced classes)
Heuristic: Simple rules based on domain knowledge
Simple ML: Logistic regression, decision tree, k-NN
Previous system: What's the current production model?

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    BASELINE EXAMPLES                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  RECOMMENDATIONS:                                                        │
│  ────────────────                                                        │
│  Random: NDCG@10 = 0.05                                                │
│  Popular items: NDCG@10 = 0.25                                         │
│  User-based CF: NDCG@10 = 0.35                                         │
│  Target: NDCG@10 > 0.45                                                │
│                                                                          │
│  FRAUD DETECTION:                                                        │
│  ────────────────                                                        │
│  Random: Precision@1% = 0.01 (1% of transactions are fraud)           │
│  Rule-based: Precision@1% = 0.30                                       │
│  Logistic regression: Precision@1% = 0.50                             │
│  Target: Precision@1% > 0.70                                           │
│                                                                          │
│  SEARCH RANKING:                                                         │
│  ───────────────                                                         │
│  BM25 (keyword): NDCG@10 = 0.45                                        │
│  + Click features: NDCG@10 = 0.55                                      │
│  Target: NDCG@10 > 0.65                                                │
│                                                                          │
│  ALWAYS REPORT:                                                          │
│  • Baseline performance (what you're improving on)                     │
│  • Model performance (what you achieved)                               │
│  • Improvement (absolute and relative)                                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Model Selection

Choose model architecture based on your requirements:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    MODEL SELECTION GUIDE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  WHEN TO USE SIMPLER MODELS (Linear, Trees, GBMs):                     │
│  ────────────────────────────────────────────────                        │
│  • Limited training data (<100K examples)                              │
│  • Need interpretability (regulated domains)                           │
│  • Strict latency requirements (<10ms)                                 │
│  • Limited engineering resources                                        │
│  • Features are already well-engineered                                │
│                                                                          │
│  WHEN TO USE DEEP LEARNING:                                             │
│  ──────────────────────────                                              │
│  • Abundant training data (>1M examples)                               │
│  • Raw/unstructured data (text, images, sequences)                     │
│  • Complex feature interactions to learn                               │
│  • Have GPU/TPU infrastructure                                          │
│  • State-of-the-art performance required                               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  COMMON ARCHITECTURES BY TASK:                                          │
│  ──────────────────────────────                                          │
│                                                                          │
│  Classification/Regression (tabular):                                   │
│  • XGBoost, LightGBM, CatBoost (gradient boosting)                    │
│  • Wide & Deep (combines memorization + generalization)                │
│  • TabNet (attention for tabular)                                      │
│                                                                          │
│  Ranking:                                                                │
│  • LambdaMART (gradient boosting for ranking)                          │
│  • Two-tower models (separate user/item encoders)                      │
│  • Cross-attention models (BERT-style)                                 │
│                                                                          │
│  Sequence Modeling:                                                      │
│  • Transformers (attention-based)                                      │
│  • LSTMs/GRUs (recurrent)                                              │
│  • Temporal Convolutional Networks                                     │
│                                                                          │
│  Embeddings:                                                             │
│  • Word2Vec, FastText (word embeddings)                                │
│  • Item2Vec (item embeddings from co-occurrence)                       │
│  • Graph Neural Networks (relationship-aware)                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Understanding the Trade-offs: When Simple Beats Complex

The diagram above provides guidelines, but understanding the underlying trade-offs helps you make better decisions for your specific situation.

The Data Efficiency Question:

Deep learning models are hungry for data. A neural network with millions of parameters needs millions of examples to avoid overfitting. Gradient boosted trees, with their inherent regularization and feature selection, can achieve strong performance with orders of magnitude less data.

The crossover point varies by problem complexity. For simple tabular prediction (like click-through rate with well-engineered features), XGBoost often matches or beats neural networks even with 10M+ examples. For complex problems with raw inputs (understanding sentence meaning, detecting objects in images), neural networks pull ahead much sooner.

A practical test: train both approaches on 10% of your data. If XGBoost is competitive, it will likely remain competitive at full scale. If the neural network is already significantly better, the gap will widen with more data.

The Feature Engineering Trade-off:

Gradient boosted models excel when features are well-engineered. They can find complex interactions and thresholds, but they operate on the features you provide. If your features capture the signal in your data, GBMs are hard to beat.

Neural networks learn features from raw data. This is powerful when you don't know the right features (what makes a sentence positive sentiment?) or when the right features are too complex to engineer manually (what patterns in an image indicate a cat?). But this power comes at a cost: more data required, more compute needed, and harder to interpret.

For tabular data with domain expertise available, investing in feature engineering plus XGBoost often beats investing in neural architecture search. The engineering time pays off in faster training, easier debugging, and more interpretable models.

The Latency Trade-off:

Inference latency varies dramatically by model type. A small XGBoost model can score in microseconds. A BERT model requires milliseconds on GPU, tens of milliseconds on CPU. For applications requiring <10ms latency (autocomplete, real-time bidding), this constraint often eliminates deep learning options—or requires significant optimization investment (distillation, quantization, specialized hardware).

Consider total system latency, not just model latency. If feature fetching takes 50ms and you have a 100ms budget, your model has 50ms regardless of architecture. In that case, you might afford a larger model than you initially thought.

The Interpretability Trade-off:

In regulated domains (credit decisions, medical diagnosis), you may need to explain why the model made a prediction. Tree-based models provide natural explanations: "This loan was rejected because income < $50K AND debt-to-income > 40%." Neural networks are black boxes that require additional techniques (SHAP, attention visualization) for explanation, and these explanations are approximations.

Don't overestimate interpretability requirements. Many "we need interpretability" situations actually need auditability (can we check the model isn't discriminating?) or debuggability (can we understand when it fails?). These are different from per-prediction explanations and can be achieved with black-box models.

Training Pipeline Design

Design a reproducible, automated training pipeline:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TRAINING PIPELINE COMPONENTS                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. DATA LOADING                                                         │
│     • Load training, validation, test splits                           │
│     • Apply consistent preprocessing                                    │
│     • Handle data versioning (what data was used?)                     │
│                                                                          │
│  2. FEATURE ENGINEERING                                                  │
│     • Transform raw features to model inputs                           │
│     • Handle missing values, encoding                                   │
│     • Feature selection if needed                                       │
│                                                                          │
│  3. MODEL TRAINING                                                       │
│     • Initialize model with hyperparameters                            │
│     • Train with early stopping on validation                          │
│     • Log metrics, losses, gradients                                   │
│                                                                          │
│  4. HYPERPARAMETER TUNING                                                │
│     • Grid search, random search, or Bayesian optimization            │
│     • Cross-validation for robust estimates                            │
│     • Track all experiments                                            │
│                                                                          │
│  5. EVALUATION                                                           │
│     • Compute metrics on held-out test set                             │
│     • Error analysis (where does model fail?)                          │
│     • Fairness analysis (performance across groups)                    │
│                                                                          │
│  6. ARTIFACT STORAGE                                                     │
│     • Save model weights, config, preprocessing                        │
│     • Version everything (reproducibility)                             │
│     • Store evaluation results                                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TOOLS:                                                                  │
│  • Experiment tracking: MLflow, Weights & Biases, Neptune             │
│  • Pipeline orchestration: Airflow, Kubeflow, Prefect                 │
│  • Model registry: MLflow, Vertex AI, SageMaker                       │
│  • Hyperparameter tuning: Optuna, Ray Tune, Hyperopt                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Evaluation Strategy

Proper evaluation prevents deploying models that look good but perform poorly:

Data Splits:

Training set (70-80%): Used to train the model
Validation set (10-15%): Used for hyperparameter tuning and early stopping
Test set (10-15%): Used only for final evaluation (never look until the end!)

Time-Based Splits (for temporal data):

Code

├─────────── Training ───────────┼── Val ──┼── Test ──┤
│     Jan 1 - Sep 30             │ Oct     │ Nov      │

Stratified Splits: Maintain class distribution in all splits (important for imbalanced data).

Cross-Validation: When data is limited, use k-fold cross-validation for more robust estimates.

Error Analysis

Understanding where your model fails is as important as knowing its overall accuracy:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    ERROR ANALYSIS FRAMEWORK                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. CONFUSION MATRIX ANALYSIS                                            │
│     • False positives: What did we wrongly predict as positive?        │
│     • False negatives: What did we miss?                               │
│     • Are errors concentrated in certain classes?                      │
│                                                                          │
│  2. SLICE ANALYSIS                                                       │
│     • Performance by user segment (new vs returning)                   │
│     • Performance by item category                                     │
│     • Performance by time period                                       │
│     • Performance by feature values                                    │
│                                                                          │
│  3. HARD EXAMPLE MINING                                                  │
│     • Which examples have highest loss?                                │
│     • What patterns do they share?                                     │
│     • Are they mislabeled or genuinely hard?                          │
│                                                                          │
│  4. FEATURE IMPORTANCE                                                   │
│     • Which features drive predictions?                                │
│     • Are important features sensible?                                 │
│     • Are there leakage features?                                      │
│                                                                          │
│  5. CALIBRATION ANALYSIS                                                 │
│     • Do predicted probabilities match actual rates?                   │
│     • Is the model overconfident or underconfident?                   │
│     • Calibration by prediction range                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ACTIONABLE INSIGHTS:                                                    │
│  • Error patterns suggest features to add                              │
│  • Systematic errors suggest labeling issues                           │
│  • Slice performance gaps suggest fairness issues                      │
│  • Calibration issues affect threshold selection                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Offline Development Checklist

Before moving to deployment, verify:

Baseline model established and documented
Model architecture chosen with justification
Training pipeline automated and reproducible
Hyperparameter tuning completed
Offline metrics computed on held-out test set
Error analysis completed (understand failure modes)
Model artifacts saved and versioned
Training reproducible (same data + config = same model)

Common Pitfall: Overfitting to the validation set by iterating too many times. Always have a final holdout test set that you only evaluate once at the end.

Step 7: Online Execution, Testing and Evaluation

This step covers deploying the model to production and validating that it works in the real world. The gap between offline and online performance is often significant.

Deployment Strategies

How you deploy determines your risk exposure:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DEPLOYMENT STRATEGIES                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. SHADOW MODE (Lowest Risk)                                           │
│  ────────────────────────────                                            │
│  • New model runs alongside production                                 │
│  • Predictions logged but not served to users                          │
│  • Compare predictions offline                                          │
│  • Duration: 1-2 weeks                                                 │
│                                                                          │
│  Use for: Major model changes, new ML systems                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  2. CANARY DEPLOYMENT (Low Risk)                                        │
│  ────────────────────────────────                                        │
│  • Route small percentage to new model                                 │
│  • Monitor for errors, latency, metrics                                │
│  • Gradually increase: 1% → 5% → 10% → 25% → 50% → 100%               │
│  • Automated rollback on anomalies                                     │
│                                                                          │
│  Use for: Most deployments, standard practice                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  3. A/B TEST (For Business Impact)                                      │
│  ─────────────────────────────────                                       │
│  • Random 50/50 split between control and treatment                    │
│  • Run for statistical significance (usually 1-4 weeks)               │
│  • Measure impact on business metrics                                  │
│  • Make ship/no-ship decision based on results                        │
│                                                                          │
│  Use for: Validating business impact, major changes                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  4. BLUE-GREEN (Quick Rollback)                                         │
│  ───────────────────────────────                                         │
│  • Two identical environments (blue and green)                         │
│  • Deploy new model to inactive environment                            │
│  • Switch traffic instantly                                            │
│  • Instant rollback by switching back                                  │
│                                                                          │
│  Use for: When fast rollback is critical                               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

A/B Testing Best Practices

A/B testing is how you validate that offline improvements translate to online gains:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    A/B TEST DESIGN                                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  BEFORE THE TEST:                                                        │
│  ─────────────────                                                       │
│  1. Define hypothesis: "New model will increase CTR by 5%"             │
│  2. Choose primary metric (one!)                                        │
│  3. List guardrail metrics (shouldn't regress)                        │
│  4. Calculate sample size for statistical power                        │
│  5. Define success criteria before starting                            │
│                                                                          │
│  DURING THE TEST:                                                        │
│  ────────────────                                                        │
│  1. Ensure random assignment (no selection bias)                       │
│  2. Monitor for technical issues (errors, latency)                    │
│  3. Don't peek at results repeatedly (multiple testing problem)       │
│  4. Run for full planned duration                                      │
│                                                                          │
│  AFTER THE TEST:                                                         │
│  ───────────────                                                         │
│  1. Analyze statistical significance                                   │
│  2. Check guardrail metrics                                            │
│  3. Look for heterogeneous effects (different by segment)             │
│  4. Document findings and decision                                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  COMMON MISTAKES:                                                        │
│                                                                          │
│  • Stopping early when results look good (inflates false positives)   │
│  • Ignoring guardrail regressions                                      │
│  • Not accounting for novelty effect                                   │
│  • Simpson's paradox (aggregate hides segment differences)            │
│  • Network effects (treatment affects control)                         │
│  • Too many metrics (multiple testing)                                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Sample Size Calculation

Running tests with insufficient sample size leads to inconclusive or misleading results:

Key Parameters:

Baseline rate: Current performance (e.g., 2% CTR)
Minimum detectable effect (MDE): Smallest improvement worth detecting (e.g., 5% relative)
Statistical significance (α): Usually 0.05 (5% false positive rate)
Statistical power (1-β): Usually 0.80 (80% chance of detecting true effect)

Rule of Thumb: For small effects (1-5% relative change), you typically need:

10K-50K users per variant for conversion metrics
1K-10K users per variant for engagement metrics
More users for rare events (fraud, churn)

Understanding the Statistics: Why These Numbers Matter

The parameters above encode important trade-offs. Understanding them helps you make better decisions about test design.

Statistical Significance (α = 0.05):

Statistical significance answers: "If the new model is actually no better than the old one, what's the probability we'd see results this extreme by chance?"

Setting α = 0.05 means we accept a 5% chance of a false positive—declaring the new model better when it's actually not. This seems low, but consider: if you run 20 A/B tests per year, you'd expect one false positive annually purely by chance.

Lower α (0.01) reduces false positives but requires larger sample sizes. Higher α (0.10) gets results faster but increases false positives. For high-stakes decisions (major product changes), use α = 0.01. For exploratory tests, α = 0.10 may be acceptable.

Statistical Power (1-β = 0.80):

Power answers a different question: "If the new model really is better by the amount we care about, what's the probability we'll detect it?"

With power = 0.80, there's a 20% chance of missing a real improvement—a false negative. We'd conclude "no significant difference" when the new model actually is better.

Higher power (0.90 or 0.95) reduces missed improvements but requires larger samples. The trade-off is test duration. With limited traffic, running at 0.80 power might take 2 weeks; 0.95 power might take 6 weeks. During those extra 4 weeks, if the new model is better, you're missing out on those gains.

Minimum Detectable Effect (MDE):

MDE is the smallest improvement worth detecting. If your baseline CTR is 2% and MDE is 5% relative, you're designing the test to detect changes from 2.0% to 2.1% or larger.

Smaller MDE requires more samples. A test designed to detect 1% relative change needs roughly 25x more users than one designed to detect 5% relative change. Be realistic about what improvements are meaningful. A 1% relative improvement in CTR might be worth millions in revenue for Google, but irrelevant for a startup. Don't design expensive tests for effects that don't matter to your business.

Why You Shouldn't Peek:

The math above assumes you check results once, at the planned end of the experiment. If you peek at results daily and stop when they look significant, you inflate false positive rates dramatically—potentially to 30% or higher.

This happens because random variation means results fluctuate. Even with no real effect, you'll see "significant" results at some point during the test due to chance. Peeking and stopping at that point locks in the false positive.

If you must monitor ongoing tests, use sequential testing methods (like always valid inference) that account for multiple looks. Or, define stopping rules in advance: "We'll check at 1 week. If p < 0.001, we can stop early. Otherwise, continue to 2 weeks."

Simpson's Paradox: The Hidden Danger:

Simpson's paradox occurs when aggregate results hide segment-level effects. Classic example:

Aggregate: New model has 2.5% CTR vs 2.4% CTR for control. Ship it!
Mobile users (60% of traffic): New model 1.8% vs control 2.0%. Worse!
Desktop users (40% of traffic): New model 3.6% vs control 3.0%. Better!

The new model looks better overall only because it shifted traffic composition—more requests came from desktop during the test period. Within each segment, it's worse for the majority of users.

Always check segment-level results. If the treatment effect varies by segment, investigate before shipping. You might discover that "improvement" only helps a minority while hurting the majority.

Monitoring During Deployment

What to watch during and after deployment:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DEPLOYMENT MONITORING                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SYSTEM HEALTH (Real-time):                                             │
│  ──────────────────────────                                              │
│  • Request rate (QPS)                                                  │
│  • Error rate (5xx, timeouts)                                          │
│  • Latency (p50, p95, p99)                                             │
│  • Resource utilization (CPU, memory, GPU)                             │
│                                                                          │
│  MODEL HEALTH (Near real-time):                                         │
│  ───────────────────────────────                                         │
│  • Prediction distribution (scores, classes)                           │
│  • Feature value distributions                                         │
│  • Null/missing feature rates                                          │
│  • Model version being served                                          │
│                                                                          │
│  BUSINESS METRICS (Hourly/Daily):                                       │
│  ─────────────────────────────────                                       │
│  • Primary metric (CTR, conversion, etc.)                             │
│  • Guardrail metrics                                                   │
│  • Segment-level performance                                           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ALERTING RULES:                                                         │
│                                                                          │
│  Immediate (PagerDuty):                                                 │
│  • Error rate > 1%                                                     │
│  • p99 latency > 500ms                                                 │
│  • Prediction rate drops > 50%                                         │
│                                                                          │
│  Urgent (Slack):                                                        │
│  • Primary metric drops > 10%                                          │
│  • Feature null rate > 5%                                              │
│  • Prediction distribution shift                                       │
│                                                                          │
│  Daily Review:                                                          │
│  • All metric trends                                                   │
│  • A/B test progress                                                   │
│  • Error logs                                                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Rollback Planning

Always have a rollback plan before deploying:

When to Rollback:

System health degradation (errors, latency)
Severe business metric regression
Unexpected behavior (even if metrics look okay)
User complaints spike

How to Rollback:

Instant: Switch traffic back to previous model
Gradual: Reduce new model percentage while monitoring
Feature flag: Disable new model via configuration

Rollback Checklist:

Previous model still deployed and healthy
Traffic routing configurable without code deploy
Rollback decision criteria documented
On-call knows rollback procedure
Tested rollback works before deployment

Online Testing Checklist

Before declaring success, verify:

Shadow mode completed (if applicable)
Canary deployment successful
A/B test designed with proper sample size
Statistical significance reached
Primary metric improved
Guardrail metrics stable
Rollback plan tested
Documentation updated

Common Pitfall: Declaring victory too early. Wait for statistical significance. Watch for novelty effects that fade. Monitor for delayed negative effects (churn shows up weeks later).

Step 8: Monitoring, Scaling and Continual Learning

The work doesn't end at deployment. ML systems require ongoing monitoring, maintenance, and improvement to remain effective.

Comprehensive Monitoring

Monitor at multiple levels:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    MONITORING HIERARCHY                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  LEVEL 1: INFRASTRUCTURE MONITORING                                     │
│  ─────────────────────────────────────                                   │
│  • Server health (CPU, memory, disk, network)                          │
│  • Service availability (uptime, error rates)                          │
│  • Request metrics (QPS, latency, throughput)                          │
│  • Dependencies (database, cache, feature store)                       │
│                                                                          │
│  Tools: Prometheus, Grafana, DataDog, CloudWatch                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LEVEL 2: MODEL MONITORING                                              │
│  ──────────────────────────                                              │
│  • Prediction distributions (score histograms, class ratios)          │
│  • Feature distributions (input drift detection)                       │
│  • Model version and configuration                                     │
│  • Inference latency breakdown                                         │
│                                                                          │
│  Tools: Evidently, Arize, Fiddler, custom logging                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LEVEL 3: DATA QUALITY MONITORING                                       │
│  ─────────────────────────────────                                       │
│  • Feature completeness (null rates)                                   │
│  • Feature freshness (staleness)                                       │
│  • Label availability and quality                                      │
│  • Training data pipeline health                                       │
│                                                                          │
│  Tools: Great Expectations, dbt tests, custom checks                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LEVEL 4: BUSINESS MONITORING                                           │
│  ─────────────────────────────                                           │
│  • Business KPIs (revenue, engagement, conversion)                     │
│  • Model-attributed metrics (recommendations CTR)                      │
│  • User feedback and complaints                                        │
│  • Segment-level performance                                           │
│                                                                          │
│  Tools: Amplitude, Mixpanel, Tableau, custom dashboards               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Implementing Drift Detection

Drift detection shouldn't wait for business metrics to drop—by then, the damage is done. The code below shows how to proactively detect when your model's input distributions shift, alerting you before performance degrades.

Python

from scipy import stats
import numpy as np

def detect_feature_drift(baseline_data, current_data, features, threshold=0.05):
    """
    Detect drift between baseline (training) and current (production) data.
    Uses Kolmogorov-Smirnov test for continuous features.
    Returns features that have drifted significantly.
    """
    drifted_features = []

    for feature in features:
        baseline_values = baseline_data[feature].dropna()
        current_values = current_data[feature].dropna()

        # KS test: are these distributions different?
        statistic, p_value = stats.ks_2samp(baseline_values, current_values)

        if p_value < threshold:
            drifted_features.append({
                "feature": feature,
                "p_value": p_value,
                "ks_statistic": statistic,
                "baseline_mean": baseline_values.mean(),
                "current_mean": current_values.mean(),
                "baseline_std": baseline_values.std(),
                "current_std": current_values.std(),
            })

    return drifted_features

def check_prediction_drift(baseline_predictions, current_predictions, threshold=0.1):
    """
    Check if model prediction distribution has shifted.
    A shift here suggests either input drift or concept drift.
    """
    baseline_mean = np.mean(baseline_predictions)
    current_mean = np.mean(current_predictions)

    relative_change = abs(current_mean - baseline_mean) / baseline_mean

    return {
        "drifted": relative_change > threshold,
        "baseline_mean": baseline_mean,
        "current_mean": current_mean,
        "relative_change": relative_change
    }

This drift detection approach uses statistical tests to compare distributions. The Kolmogorov-Smirnov test asks: "What's the probability these two samples came from the same underlying distribution?" A low p-value (below your threshold) indicates the distributions are likely different—your feature has drifted.

The key insight is monitoring both input drift (feature distributions) and output drift (prediction distributions). Input drift tells you the world has changed—maybe your user demographics shifted or a data source changed format. Output drift tells you your model is behaving differently, even if you can't pinpoint why.

When drift is detected, you have options. Minor drift might just warrant closer monitoring. Significant drift on important features should trigger investigation—is this real-world change or a data pipeline bug? Severe drift might require immediate retraining or rollback to a fallback model.

Run these checks daily on a sample of production traffic compared to your training baseline. Store the results in a time series to spot gradual drift trends, not just sudden shifts. A feature that drifts 1% per week might not trigger alerts but will cause serious problems after a few months.

Detecting and Handling Drift

Model performance degrades over time as the world changes. Detect drift early:

Types of Drift:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TYPES OF DRIFT                                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  DATA DRIFT (Covariate Shift):                                          │
│  ──────────────────────────────                                          │
│  • Input feature distributions change                                  │
│  • Example: User demographics shift, new product categories           │
│  • Detection: Compare feature distributions over time                  │
│  • Response: Retrain on recent data                                   │
│                                                                          │
│  CONCEPT DRIFT:                                                          │
│  ───────────────                                                         │
│  • Relationship between features and target changes                    │
│  • Example: User preferences change, fraud patterns evolve            │
│  • Detection: Monitor prediction-outcome correlation                   │
│  • Response: Retrain with new labels                                  │
│                                                                          │
│  LABEL DRIFT:                                                            │
│  ─────────────                                                           │
│  • Target distribution changes                                         │
│  • Example: Fraud rate increases, conversion rate drops               │
│  • Detection: Monitor label/outcome distributions                      │
│  • Response: Adjust thresholds, retrain                               │
│                                                                          │
│  UPSTREAM DATA DRIFT:                                                    │
│  ─────────────────────                                                   │
│  • Data sources or schemas change                                      │
│  • Example: Logging format changes, new data provider                 │
│  • Detection: Schema validation, data quality checks                   │
│  • Response: Update pipelines, may need model changes                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DRIFT DETECTION METHODS:                                                │
│  • Statistical tests (KS test, chi-squared, PSI)                      │
│  • Distribution comparison (histograms, quantiles)                     │
│  • Model-based detection (classifier to distinguish old vs new)       │
│  • Performance degradation (monitor actual outcomes)                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Retraining Strategy

Define when and how to retrain:

Retraining Triggers:

Scheduled: Fixed cadence (daily, weekly, monthly)
Performance-based: When metrics drop below threshold
Drift-based: When drift detected above threshold
Event-based: After major changes (new features, policy changes)

Retraining Pipeline:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    AUTOMATED RETRAINING PIPELINE                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRIGGER                                                                 │
│     │ (schedule, drift alert, or manual)                                │
│     ▼                                                                    │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  DATA PREPARATION                                               │   │
│  │  • Fetch recent training data                                   │   │
│  │  • Apply data quality checks                                    │   │
│  │  • Generate train/validation splits                             │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│     │                                                                    │
│     ▼                                                                    │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  MODEL TRAINING                                                 │   │
│  │  • Train with same hyperparameters (or re-tune)                │   │
│  │  • Track experiment in MLflow/W&B                              │   │
│  │  • Save model artifacts                                        │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│     │                                                                    │
│     ▼                                                                    │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  VALIDATION                                                     │   │
│  │  • Compute offline metrics on holdout                          │   │
│  │  • Compare to current production model                         │   │
│  │  • Check for regressions on slices                             │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│     │                                                                    │
│     ▼                                                                    │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  AUTOMATED GATES                                                │   │
│  │  • Pass: Metrics improved or stable                            │   │
│  │  • Fail: Metrics regressed (alert, don't deploy)              │   │
│  │  • Flag: Unusual patterns (human review)                       │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│     │ (if passed)                                                        │
│     ▼                                                                    │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  DEPLOYMENT                                                     │   │
│  │  • Register in model registry                                   │   │
│  │  • Deploy via canary                                            │   │
│  │  • Monitor online metrics                                       │   │
│  │  • Auto-rollback if issues                                     │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  FREQUENCY GUIDELINES:                                                   │
│  • High-change domains (news, trending): Daily                        │
│  • Medium-change domains (e-commerce): Weekly                         │
│  • Low-change domains (fraud patterns): Monthly                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Feedback Loops

Close the loop between predictions and outcomes:

Collecting Feedback:

Log predictions with unique IDs
Track outcomes when they occur
Join predictions to outcomes
Use for evaluation and retraining

Feedback Types:

Immediate: Click/no-click, add-to-cart
Delayed: Purchase, subscription, churn
Labeled: Human review, customer support escalation
Implicit: Return visits, time spent, complaints

Avoiding Feedback Loops:

Model predictions influence user behavior
Changed behavior becomes training data
Model reinforces its own biases
Solution: Randomized exploration, propensity scoring

Incident Response

When things go wrong (and they will):

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    INCIDENT RESPONSE RUNBOOK                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SEVERITY LEVELS:                                                        │
│  ────────────────                                                        │
│  P0 (Critical): Model completely down, major business impact           │
│  P1 (High): Significant degradation, metrics dropping fast             │
│  P2 (Medium): Minor degradation, localized issues                      │
│  P3 (Low): Cosmetic issues, no immediate impact                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  IMMEDIATE RESPONSE (P0/P1):                                            │
│  ────────────────────────────                                            │
│  1. Acknowledge alert (5 min)                                          │
│  2. Assess impact scope (10 min)                                       │
│  3. Decide: rollback, hotfix, or investigate (15 min)                 │
│  4. Execute decision (varies)                                          │
│  5. Communicate status to stakeholders                                 │
│                                                                          │
│  INVESTIGATION:                                                          │
│  ──────────────                                                          │
│  • Check recent deployments                                            │
│  • Check data pipeline health                                          │
│  • Check feature store health                                          │
│  • Check upstream dependencies                                         │
│  • Review error logs                                                   │
│                                                                          │
│  POST-INCIDENT:                                                          │
│  ──────────────                                                          │
│  • Write post-mortem (within 48 hours)                                │
│  • Identify root cause                                                 │
│  • Document what went wrong                                            │
│  • Create action items to prevent recurrence                           │
│  • Share learnings with team                                           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  COMMON ISSUES AND SOLUTIONS:                                           │
│                                                                          │
│  Model serving slow:                                                    │
│  → Check GPU utilization, batch sizes, feature fetch latency          │
│                                                                          │
│  Predictions look wrong:                                                │
│  → Check feature values, model version, recent deployments            │
│                                                                          │
│  Metrics dropping:                                                      │
│  → Check for data drift, upstream changes, A/B test issues            │
│                                                                          │
│  Model not serving:                                                     │
│  → Check model server health, dependencies, configuration             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Monitoring and Continual Learning Checklist

Verify ongoing operations:

Monitoring dashboards for all levels (infra, model, data, business)
Alerting rules configured with appropriate thresholds
On-call rotation established
Drift detection automated
Retraining pipeline automated
Incident runbooks documented
Feedback loops closed (predictions → outcomes → training)
Regular model reviews scheduled

Common Pitfall: "Set and forget" mentality. Models degrade over time. Monitoring and retraining are non-negotiable. Build these into the system from day one.

Complete System Design Checklist

Use this comprehensive checklist to ensure thoroughness:

Problem Framing

Scale and Latency

Metrics

Primary offline metric chosen
Primary online metric chosen
Guardrail metrics listed
Offline-online correlation validated
Statistical significance requirements defined

Architecture

High-level architecture diagram drawn
Critical path identified
Caching strategy defined
Failure modes with fallbacks designed
Capacity planning completed
Deployment strategy defined

Data

Offline Development

Online Execution

Monitoring and Operations

Summary

ML system design is more than building models—it's building reliable systems that deliver business value at scale. The framework in this post covers the complete lifecycle:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    KEY TAKEAWAYS                                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. START WITH THE BUSINESS PROBLEM                                     │
│     Don't build a model without understanding why.                      │
│     Define success before writing code.                                 │
│                                                                          │
│  2. DESIGN FOR SCALE (BUT DON'T OVER-ENGINEER)                         │
│     Know your numbers: QPS, latency, data volume.                      │
│     Plan for 10x, implement for 1x.                                    │
│                                                                          │
│  3. METRICS DRIVE EVERYTHING                                            │
│     Offline metrics must correlate with online business impact.        │
│     Guardrails prevent unintended consequences.                        │
│                                                                          │
│  4. DATA QUALITY > MODEL COMPLEXITY                                     │
│     Most improvements come from better data, not fancier models.       │
│     Training-serving consistency is critical.                          │
│                                                                          │
│  5. VALIDATE IN PRODUCTION                                              │
│     Offline success doesn't guarantee online success.                  │
│     A/B test everything significant.                                   │
│                                                                          │
│  6. PLAN FOR FAILURE                                                    │
│     Everything will fail eventually.                                    │
│     Graceful degradation, fast rollback.                               │
│                                                                          │
│  7. MONITOR AND ITERATE                                                 │
│     Models degrade over time.                                           │
│     Continuous monitoring and retraining are mandatory.                │
│                                                                          │
│  THE BEST ML ENGINEERS:                                                 │
│  • Think in systems, not just models                                   │
│  • Obsess over data quality                                            │
│  • Measure everything that matters                                     │
│  • Design for failure and recovery                                     │
│  • Balance complexity with pragmatism                                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

This framework is iterative. You'll revisit earlier steps as you learn more. Don't be afraid to adapt it to your specific needs—the principles matter more than the exact steps.

Sources

Google Machine Learning Crash Course
Made With ML - MLOps Course
Chip Huyen - Designing Machine Learning Systems
Stanford CS 329S - Machine Learning Systems Design
Netflix Technology Blog - ML Infrastructure
Uber Engineering Blog - Michelangelo
Airbnb Engineering Blog - ML Platform

Table of Contents

Why ML System Design Matters

Step 1: Problem Framing

Define the Business Objective

Translate Business Goals to ML Tasks

Understand Failure Modes

Problem Framing Checklist

Step 2: Scale and Latency Requirements

Estimate Request Volume

Determine Latency Requirements

Calculate Data and Storage Requirements

Scale and Latency Checklist

Step 3: Metrics (Offline and Online Evaluation)

The Metrics Hierarchy

Choosing Offline Metrics

Choosing Online Metrics

Guardrail Metrics

Metrics Checklist

Step 4: Architecting for Scale

High-Level Architecture Patterns

Detailed System Architecture

The Feature Store: Heart of ML Infrastructure

Defining Features: The Foundation of Consistency

Caching Strategies

Failure Modes and Fallbacks

Architecture Checklist

Step 5: Data Collection and Preparation

Identify Data Sources

Define Your Labeling Strategy

Deep Dive: Why These Labeling Challenges Matter

Training-Serving Consistency

Concrete Example: How Training-Serving Skew Destroys Model Performance

Data Quality Assessment

Implementing Data Quality Checks

Data Pipeline Architecture

Data Collection Checklist

Step 6: Offline Model Development and Evaluation

Establish a Baseline

Model Selection

Understanding the Trade-offs: When Simple Beats Complex

Training Pipeline Design

Evaluation Strategy

Error Analysis

Offline Development Checklist

Step 7: Online Execution, Testing and Evaluation

Deployment Strategies

A/B Testing Best Practices

Sample Size Calculation

Understanding the Statistics: Why These Numbers Matter

Monitoring During Deployment

Rollback Planning

Online Testing Checklist

Step 8: Monitoring, Scaling and Continual Learning

Comprehensive Monitoring

Implementing Drift Detection

Detecting and Handling Drift

Retraining Strategy

Feedback Loops

Incident Response

Monitoring and Continual Learning Checklist

Complete System Design Checklist

Problem Framing

Scale and Latency

Metrics

Architecture

Data

Offline Development

Online Execution

Monitoring and Operations

Summary

Sources

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Building Production-Ready RAG Systems: Lessons from the Field

LLM Observability and Monitoring: From Development to Production

Agent Evaluation and Testing: From Development to Production

vLLM in Production: The Complete Guide to High-Performance LLM Serving

Building Agentic AI Systems: A Complete Implementation Guide