Should I ever use an LLM for a task where traditional ML would work?

Yes, sometimes. During prototyping, LLMs let you validate ideas before investing in data collection and model training. For low-volume tasks, the development time saved may outweigh the higher per-inference cost. But for production systems at scale, evaluate the trade-offs carefully.

How do I convince stakeholders that we don't need AI for everything?

Focus on outcomes: speed, cost, accuracy, reliability. Show concrete comparisons. "This regex solution is 1000x faster, free, and 100% accurate. The LLM solution costs $X/month and hallucinates 2% of the time."

What if requirements change and I need more flexibility later?

That's a valid concern, and it argues for clean architecture. Design systems so components can be swapped. You can start with rules, move to traditional ML, and add LLMs later—if you've architected for flexibility.

Are LLM costs coming down enough to change this calculus?

Costs are declining, but the fundamentals remain: LLMs will always be more expensive than simpler alternatives. The break-even point shifts, but the decision framework stays the same.

What about fine-tuning—doesn't that make LLMs more like traditional ML?

Fine-tuning does improve efficiency and task-specific accuracy. A fine-tuned small LLM can be surprisingly cost-effective. But it still doesn't match traditional ML on latency, cost, or determinism for structured prediction tasks.

When NOT to Use LLMs: A Practical Guide to Choosing the Right Tool | Enrico Piovano

The LLM Hammer Problem

When you have a shiny new hammer, everything looks like a nail. In 2025, that hammer is the large language model.

The hype is understandable. LLMs can write code, answer questions, analyze documents, generate content, and handle tasks that seemed impossible just a few years ago. Venture capital flows to "AI-native" startups. Job postings demand "LLM experience." Every product roadmap includes an AI feature.

But here's the uncomfortable truth: for many problems, LLMs are the wrong tool. They're slower, more expensive, less reliable, and less accurate than simpler alternatives. Using an LLM where a regex would suffice isn't innovation—it's overengineering.

This guide covers when NOT to use LLMs, what to use instead, and how to make the right tool choice for your problem.

The Case Against LLM-Everything

Cost

LLMs are expensive at scale:

API costs: GPT-4o costs $2.50-$ 10 per million tokens. A high-traffic application processing millions of requests daily can spend tens of thousands of dollars monthly on API calls alone.

Infrastructure costs: Self-hosted models require expensive GPUs. An 8xH100 server costs $200,000+ to purchase or$ 10-30/hour to rent.

Hidden costs: Token optimization, prompt engineering, guardrails, evaluation—LLM projects have substantial ongoing operational overhead.

Compare to traditional ML: A logistic regression model costs essentially nothing to run. A gradient boosted tree handles thousands of predictions per second on a $50/month server.

Latency

LLMs are slow:

Time to first token: 100-500ms typically Full response: 1-10+ seconds for substantial outputs Worst case: Complex prompts with long outputs can take 30+ seconds

For real-time applications—fraud detection, ad bidding, game AI, trading systems—this latency is unacceptable. Traditional ML models provide predictions in single-digit milliseconds.

Reliability

LLMs are stochastic and unpredictable:

Non-determinism: The same input can produce different outputs (even with temperature=0, there's variance across API calls and model versions).

Hallucinations: LLMs confidently generate false information. For applications requiring factual accuracy, this is a fundamental limitation.

Mode collapse: Models can get stuck in patterns, producing repetitive or degraded outputs.

Silent failures: Unlike traditional code that throws errors, LLMs produce plausible-looking wrong answers.

Traditional ML models are deterministic. The same input always produces the same output. Failure modes are understood and predictable.

Accuracy

For many tasks, LLMs are less accurate than specialized solutions:

Structured prediction: For classification, regression, ranking, and other structured tasks with well-defined outputs, task-specific models trained on domain data typically outperform general LLMs.

Mathematical computation: LLMs struggle with arithmetic. A calculator is infinitely more accurate for math.

Factual lookup: A database query returns the correct answer; an LLM might hallucinate.

Pattern matching: Regex, exact matching, and rule-based systems are 100% accurate for well-specified patterns.

Code

┌─────────────────────────────────────────────────────────────────────────────┐
│                    LLM vs. TRADITIONAL ML COMPARISON                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  DIMENSION          │ LLMs                  │ TRADITIONAL ML                │
│  ────────────────────────────────────────────────────────────────────────   │
│                                                                             │
│  Latency            │ 100ms - 10s+          │ 1-10ms                        │
│                                                                             │
│  Cost per 1M ops    │ $1 - $100+            │ $0.01 - $1                    │
│                                                                             │
│  Determinism        │ Non-deterministic     │ Deterministic                 │
│                                                                             │
│  Interpretability   │ Black box             │ Often interpretable           │
│                                                                             │
│  Training data      │ Pre-trained,          │ Requires labeled domain       │
│                     │ can few-shot          │ data                          │
│                                                                             │
│  Deployment         │ GPU required or       │ CPU usually sufficient        │
│                     │ API dependency        │                               │
│                                                                             │
│  Flexibility        │ Handles novel tasks   │ Limited to trained tasks      │
│                                                                             │
│  Failure modes      │ Subtle, hard to       │ Clear, diagnosable            │
│                     │ debug                 │                               │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  THE BOTTOM LINE:                                                           │
│                                                                             │
│  LLMs excel at:                                                             │
│  • Unstructured text generation                                             │
│  • Novel tasks without training data                                        │
│  • Complex reasoning over text                                              │
│  • Flexible, conversational interfaces                                      │
│                                                                             │
│  Traditional ML excels at:                                                  │
│  • Structured prediction (classification, regression)                       │
│  • High-throughput, low-latency requirements                                │
│  • Well-defined tasks with training data                                    │
│  • When interpretability matters                                            │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

When Traditional ML Beats LLMs

Tabular Data and Structured Prediction

For classification and regression on tabular data, traditional ML consistently outperforms LLMs:

The data: Customer features, transaction records, sensor readings, log events—structured data with clear features.

The task: Predict churn, detect fraud, forecast demand, classify tickets.

Why traditional ML wins:

Gradient boosted trees (XGBoost, LightGBM, CatBoost) are state-of-the-art for tabular data
Orders of magnitude faster and cheaper
Better accuracy with proper feature engineering
Interpretable feature importance

The evidence: Despite years of deep learning advances, gradient boosted trees still win Kaggle competitions on tabular data. LLMs processing tabular data as text lose the structural information that makes trees effective.

Recommendation: For tabular prediction tasks, start with XGBoost or LightGBM. Only consider LLMs if the task inherently requires natural language understanding (e.g., incorporating text fields).

Real-Time Systems

Systems requiring sub-100ms latency shouldn't use LLMs in the critical path:

Fraud detection: Transactions must be approved/declined in milliseconds. LLMs are too slow.

Ad bidding: Real-time bidding happens in 10-100ms. No room for LLM inference.

Game AI: NPCs need to respond instantly. Frame-rate-dependent decisions can't wait for LLM calls.

Recommendation systems: Users expect immediate results. LLMs can enhance recommendations but shouldn't block the primary recommendation engine.

Trading systems: Microseconds matter. LLMs are roughly a million times too slow.

For these applications, use traditional ML models designed for low-latency inference. If you need LLM capabilities, use them asynchronously (pre-compute embeddings, batch process, etc.).

High-Volume, Low-Margin Operations

When you're processing millions of items with thin margins per item, LLM costs become prohibitive:

Email classification at scale: Processing 100M emails/day at $0.01/email =$ 1M/day. A trained classifier costs nearly nothing.

Log analysis: Billions of log lines per day. LLMs are economically impossible; rule-based systems and traditional anomaly detection work.

Content moderation at scale: Social media volumes require cheap, fast classifiers. LLMs can handle edge cases escalated from primary classifiers.

Recommendation systems: Computing recommendations for millions of users on every page load requires sub-millisecond models.

Rule of thumb: If your per-item profit margin is less than the LLM API cost, you can't afford to use LLMs for every item.

When Interpretability Matters

Regulated industries and high-stakes decisions often require explainable models:

Credit decisions: Regulations require explaining why an application was denied. "The LLM said so" isn't acceptable.

Medical diagnosis support: Clinicians need to understand why a model flagged something.

Legal applications: Decisions must be justifiable and auditable.

HR/hiring: Explaining hiring decisions is legally required in many jurisdictions.

Traditional ML models (linear models, decision trees, rule-based systems) provide clear explanations. LLMs are black boxes—we can ask them to explain, but their explanations may not reflect their actual decision process.

Well-Defined Tasks with Training Data

If your task is well-specified and you have training data, a task-specific model usually wins:

Sentiment analysis: Fine-tuned BERT or even simpler classifiers outperform prompted LLMs while being much faster and cheaper.

Named entity recognition: SpaCy or fine-tuned NER models are better than prompting GPT-4.

Text classification: A fine-tuned classifier beats prompting for most classification tasks where you have labeled data.

Translation: Dedicated translation models (NLLB, OPUS) often beat general LLMs on specific language pairs.

The pattern: LLMs are generalists; specialists beat generalists on specific tasks.

Case Studies: Wrong Tool, Right Tool

Let's examine real scenarios where choosing the right tool makes orders-of-magnitude difference.

Case Study 1: Email Routing

The task: Route incoming support emails to the right department (billing, technical, sales, general).

The LLM approach: Send each email to GPT-4 with a prompt asking it to classify. Cost: ~$0.01 per email. Latency: 500ms-2s. Accuracy: ~95%.

The right approach: Fine-tune a DistilBERT classifier on 5,000 labeled historical emails. Cost: ~$0.0001 per email (100x cheaper). Latency: 10ms (50-200x faster). Accuracy: 97% (actually better with domain-specific training).

The math at scale:

100,000 emails/day
LLM: $1,000/day, 500ms latency
Classifier: $10/day, 10ms latency

Over a year, the classifier saves $361,000 and provides better user experience.

When LLM makes sense: During initial development (before you have training data), or for the 2% of emails the classifier is uncertain about.

Case Study 2: Log Anomaly Detection

The task: Detect anomalous patterns in application logs (100 million log lines per day).

The LLM approach: Impossible. Even at $0.001 per log line =$ 100,000/day. And latency would mean anomalies detected hours late.

The right approach:

Parse logs with regex into structured data
Statistical anomaly detection (isolation forests, z-scores)
Rule-based alerts for known patterns
Aggregate dashboards for human review

Cost: Effectively zero (runs on existing infrastructure). Latency: Real-time. Accuracy: Tuned to your actual anomaly patterns.

When LLM makes sense: Root cause analysis after an anomaly is detected. "Here are 50 suspicious log lines—what might have caused this?"

Case Study 3: Product Search

The task: Search a catalog of 10 million products.

The LLM approach: Generate product descriptions with LLM, use embeddings for semantic search. Reasonable for the search itself, but...

The pitfall: Using LLM at query time to rerank or filter results. At 1 million searches/day, even $0.001/search =$ 1,000/day just for search.

The right approach:

Pre-compute embeddings for all products (one-time cost)
Use vector database for fast similarity search (Pinecone, Qdrant, etc.)
Combine with traditional filters (price, category, availability)
Use lightweight reranker (cross-encoder) for top 100 results

When LLM makes sense: Query understanding ("blue dress for summer wedding" → structured query), or conversational commerce where natural interaction adds value.

Case Study 4: Data Extraction from Invoices

The task: Extract vendor name, amount, date, line items from invoices.

The LLM approach: Send invoice image to GPT-4V. Works well! But costs $0.01-0.10 per invoice.

The right approach (for standardized invoices):

Template matching for known vendors
OCR + regex for structured fields
Traditional ML for field detection in new formats
LLM fallback for unrecognized formats

Hybrid result: 80% of invoices processed for ~ $0.001 (known templates). 20% processed by LLM at$ 0.05. Average cost: $0.01 (5-10x cheaper than LLM-only).

Key insight: Most real-world data has structure. Exploit that structure when it exists.

Code

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CASE STUDY COST COMPARISON                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  TASK                 │ VOLUME      │ LLM COST    │ RIGHT TOOL  │ SAVINGS   │
│  ────────────────────────────────────────────────────────────────────────   │
│                                                                             │
│  Email classification │ 100K/day    │ $1,000/day  │ $10/day     │ 99%       │
│                                                                             │
│  Log analysis         │ 100M/day    │ Impossible  │ ~$0/day     │ ∞         │
│                                                                             │
│  Product search       │ 1M/day      │ $1,000/day  │ $50/day     │ 95%       │
│                                                                             │
│  Invoice extraction   │ 10K/day     │ $500/day    │ $100/day    │ 80%       │
│                                                                             │
│  Sentiment analysis   │ 1M/day      │ $5,000/day  │ $20/day     │ 99.6%     │
│                                                                             │
│  Language detection   │ 10M/day     │ $10,000/day │ $1/day      │ 99.99%    │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  ANNUAL SAVINGS BY USING THE RIGHT TOOL:                                    │
│                                                                             │
│  These six examples alone: ~$6M in API costs avoided per year               │
│  Plus: Lower latency, higher reliability, better accuracy                   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

When Simple Heuristics Beat LLMs

Sometimes you don't need ML at all:

Pattern Matching

Email parsing: Extract sender, subject, date with regex—100% accurate, instant, free.

Log parsing: Structured logs have known formats. Regex or parsing libraries are perfect.

Data validation: Phone numbers, emails, URLs, dates—regex validates perfectly.

Document structure: Extracting sections from documents with known templates doesn't need ML.

Lookup Tables

Currency conversion: Multiply by exchange rate. No LLM needed.

Unit conversion: Simple arithmetic. Perfect accuracy.

Reference data: Looking up product info, user profiles, historical data—database queries, not LLMs.

Mapping codes: ICD codes to descriptions, country codes to names—lookup tables.

Rule-Based Systems

Business logic: Tax calculations, pricing rules, eligibility determination—rules are deterministic and auditable.

Workflow routing: If from VIP customer, route to priority queue. Simple conditions.

Validation rules: Age must be > 0 and < 150. No ML required.

Alert thresholds: CPU > 90% for 5 minutes → alert. Thresholds, not neural networks.

Code

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DECISION FRAMEWORK: WHAT TO USE                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  START HERE: What is your task?                                             │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  Is the task well-defined with clear rules?                                 │
│      │                                                                      │
│      ├── YES → Can you write explicit rules?                                │
│      │         │                                                            │
│      │         ├── YES → Use RULES / REGEX / LOOKUP                         │
│      │         │         (Deterministic, fast, free, debuggable)            │
│      │         │                                                            │
│      │         └── NO → Do you have training data?                          │
│      │                  │                                                   │
│      │                  ├── YES → Use TRADITIONAL ML                        │
│      │                  │         (Classifiers, regression, trees)          │
│      │                  │                                                   │
│      │                  └── NO → Consider LLM with few-shot learning        │
│      │                                                                      │
│      └── NO (ambiguous, open-ended) → Continue below                        │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  Does the task require natural language generation?                         │
│      │                                                                      │
│      ├── YES → LLM is appropriate                                           │
│      │         (Writing, summarization, conversation, explanation)          │
│      │                                                                      │
│      └── NO → Does it require complex reasoning over text?                  │
│               │                                                             │
│               ├── YES → LLM is appropriate                                  │
│               │         (Analysis, comparison, synthesis)                   │
│               │                                                             │
│               └── NO → Does it require flexibility for novel inputs?        │
│                        │                                                    │
│                        ├── YES → LLM may be appropriate                     │
│                        │                                                    │
│                        └── NO → Reconsider simpler approaches               │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  CONSTRAINT CHECK (even if LLM seems appropriate):                          │
│                                                                             │
│  □ Latency requirement < 100ms? → Consider traditional ML or hybrid        │
│  □ Volume > 1M/day with thin margins? → Consider traditional ML            │
│  □ Must be deterministic? → Avoid LLMs or add verification layer           │
│  □ Must be interpretable? → Use traditional ML or hybrid                   │
│  □ Factual accuracy critical? → Add retrieval, verification, guardrails    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

When LLMs ARE the Right Choice

To be clear, LLMs are transformative for the right problems:

Open-Ended Text Generation

Writing, summarization, explanation, creative content—tasks where the output is natural language and the space of correct answers is large. This is what LLMs are designed for.

Tasks Without Training Data

When you can't collect labeled data (new task, cold start, rare events), LLMs can work with just a description and examples. Few-shot and zero-shot learning are genuine capabilities.

Complex Reasoning Over Text

Analyzing documents, synthesizing information, comparing options, understanding nuance—tasks requiring comprehension and reasoning that rule-based systems can't handle.

Flexible Interfaces

Chatbots, assistants, natural language interfaces—anywhere users expect to communicate in natural language and receive natural language back.

Rapid Prototyping

When you need something working today to validate an idea, LLMs let you skip the data collection and model training phases. Validate first, optimize later.

Hybrid Approaches

The best systems often combine approaches:

LLM for Edge Cases, Traditional ML for Volume

Pattern: Traditional ML handles 90%+ of traffic cheaply and quickly. LLM handles the long tail of difficult cases.

Example: Content moderation. Fast classifier catches obvious violations. Borderline cases escalate to LLM (or human) review.

LLM for Understanding, Traditional Systems for Action

Pattern: LLM interprets intent; deterministic systems execute.

Example: "Transfer $500 to John" → LLM extracts intent (transfer, amount, recipient) → banking system executes with proper validation.

LLM for Generation, Verification for Accuracy

Pattern: LLM generates; another system verifies.

Example: LLM generates SQL query → query parser validates syntax → execution engine verifies safety → database runs query.

Retrieval + LLM (RAG)

Pattern: Traditional retrieval finds relevant information; LLM synthesizes.

This is RAG—combining the factual grounding of retrieval with the synthesis capabilities of LLMs.

The Decision Framework

When evaluating whether to use an LLM, ask:

1. What's the simplest solution that could work?

Start simple. Regex, rules, lookup tables, existing libraries. Only add complexity if simple solutions fail.

2. Do I have training data?

If yes, and the task is well-defined, a fine-tuned task-specific model likely beats prompting a general LLM.

3. What are my latency requirements?

Sub-100ms: Avoid LLMs in the critical path. 100ms-1s: LLMs possible but consider impact. 1s+: LLMs are fine latency-wise.

4. What's my cost budget per operation?

If processing millions of items at thin margins, LLM costs may be prohibitive.

5. How important is determinism?

If the same input must always produce the same output, LLMs are problematic.

6. How important is interpretability?

If you need to explain decisions, prefer interpretable models.

7. What's the cost of errors?

High-stakes: Add verification, guardrails, human review regardless of approach. Low-stakes: LLM errors may be acceptable.

8. Is this a text generation task?

If the output is natural language, LLMs are likely appropriate. If the output is structured (numbers, categories, actions), consider alternatives.

Technical Deep Dive: Why Specialized Models Win

Understanding why specialized models outperform LLMs on specific tasks helps make better decisions.

Information Density and Tokenization

LLMs process text through tokenization, which introduces overhead for structured data:

Example: Processing a transaction record

Structured data: {user_id: 12345, amount: 99.99, timestamp: 1704067200}

As LLM input: "The user with ID 12345 made a purchase of $99.99 at timestamp 1704067200" → ~30 tokens

A gradient boosted tree sees: [12345, 99.99, 1704067200] → 3 features

The LLM must:

Parse the text structure
Extract numerical values
Learn that these are features
Learn their relationships

The tree directly operates on features, with explicit splits learned from data. It's fundamentally more efficient for tabular prediction.

Latency Breakdown

LLM inference has multiple latency components:

Time to first token (TTFT): 100-500ms

Tokenization
KV cache computation (if long context)
First forward pass

Time per output token (TPOT): 10-50ms each

Autoregressive generation
KV cache updates
Sampling

Total latency for 100-token response: 1-5 seconds

Compare to traditional ML:

Inference latency: 0.1-10ms total

No tokenization (direct feature vectors)
Single forward pass
No autoregressive generation

This 100-1000x difference is fundamental to the architectures, not just current implementations.

Parameter Efficiency

LLMs use parameters inefficiently for narrow tasks:

GPT-4: ~1.7 trillion parameters (estimated)

Knows about: everything (history, science, code, languages, etc.)
Knows about your task: a tiny fraction

Task-specific classifier: 10-100 million parameters

Knows about: your specific task
Parameter efficiency: 10,000x better for your task

Those extra LLM parameters aren't free—they add latency, cost, and potential for irrelevant knowledge to interfere.

Calibration and Uncertainty

LLMs are notoriously poorly calibrated—their confidence doesn't match their accuracy.

Traditional ML advantages:

Probabilistic outputs (0.73 confidence means ~73% accuracy)
Well-understood uncertainty quantification
Can be calibrated post-hoc

LLM challenges:

Confidence often doesn't correlate with correctness
"Hallucinations" occur with high apparent confidence
Temperature controls randomness, not true uncertainty

For decisions requiring reliable confidence estimates (medical triage, fraud risk scoring), traditional ML is far more trustworthy.

Code

┌─────────────────────────────────────────────────────────────────────────────┐
│                    WHY SPECIALIZED MODELS WIN: TECHNICAL SUMMARY            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  DIMENSION           │ LLM                      │ SPECIALIZED MODEL         │
│  ────────────────────────────────────────────────────────────────────────   │
│                                                                             │
│  Input processing    │ Text tokenization        │ Direct feature vectors    │
│                      │ 30+ tokens/record        │ N features (raw)          │
│                                                                             │
│  Parameters used     │ 1T+ (vast majority       │ 10-100M (all focused      │
│                      │ irrelevant to task)      │ on task)                  │
│                                                                             │
│  Inference steps     │ Autoregressive           │ Single forward pass       │
│                      │ (N tokens × forward)     │                           │
│                                                                             │
│  Latency floor       │ ~100ms (TTFT alone)      │ ~0.1ms                    │
│                                                                             │
│  Calibration         │ Poor (overconfident)     │ Good (can calibrate)      │
│                                                                             │
│  Failure mode        │ Plausible wrong answers  │ Obvious errors            │
│                                                                             │
│  Training signal     │ Generic next-token       │ Task-specific loss        │
│                                                                             │
│  Hardware needed     │ GPU (often multiple)     │ CPU often sufficient      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Anti-Patterns to Avoid

The "AI-Native" Trap

Building LLM-first without considering simpler alternatives. Often driven by marketing rather than engineering.

Sign: "We're using GPT-4 for everything."

Fix: Evaluate each component independently. Use LLMs where they add value.

Prompt Engineering When You Should Train

Spending weeks engineering prompts for a task where a simple fine-tuned classifier would work better.

Sign: Prompt is 2000 tokens of examples and instructions for a classification task.

Fix: If you have labeled data and a well-defined task, train a classifier.

LLM as Database

Using LLMs to retrieve facts they might hallucinate.

Sign: "The model knows the product catalog."

Fix: Use retrieval. LLMs synthesize; databases store.

Ignoring Failure Modes

Deploying LLMs without considering hallucinations, prompt injection, or edge cases.

Sign: No guardrails, no evaluation, no monitoring.

Fix: Implement proper evaluation, guardrails, and observability.

Premature Optimization (in either direction)

Either over-engineering with LLMs from day one, or refusing to use them when they're the right tool.

Fix: Match tool to problem. Be willing to change as requirements become clearer.

The Pragmatic Path Forward

The best engineers use the right tool for the job. In 2025, that means:

Defaulting to simplicity: Start with the simplest solution. Add complexity only when needed.

Knowing your options: Understand traditional ML, rule-based systems, AND LLMs well enough to choose appropriately.

Measuring what matters: Define success metrics before choosing an approach. Evaluate options against those metrics.

Staying flexible: Be willing to change approaches as you learn more about the problem.

Combining strengths: The best systems often combine multiple approaches—LLMs for flexibility, traditional ML for efficiency, rules for reliability.

LLMs are powerful tools. But so are hammers, and not everything is a nail.

Table of Contents

The LLM Hammer Problem

The Case Against LLM-Everything

Cost

Latency

Reliability

Accuracy

When Traditional ML Beats LLMs

Tabular Data and Structured Prediction

Real-Time Systems

High-Volume, Low-Margin Operations

When Interpretability Matters

Well-Defined Tasks with Training Data

Case Studies: Wrong Tool, Right Tool

Case Study 1: Email Routing

Case Study 2: Log Anomaly Detection

Case Study 3: Product Search

Case Study 4: Data Extraction from Invoices

When Simple Heuristics Beat LLMs

Pattern Matching

Lookup Tables

Rule-Based Systems

When LLMs ARE the Right Choice

Open-Ended Text Generation

Tasks Without Training Data

Complex Reasoning Over Text

Flexible Interfaces

Rapid Prototyping

Hybrid Approaches

LLM for Edge Cases, Traditional ML for Volume

LLM for Understanding, Traditional Systems for Action

LLM for Generation, Verification for Accuracy

Retrieval + LLM (RAG)

The Decision Framework

1. What's the simplest solution that could work?

2. Do I have training data?

3. What are my latency requirements?

4. What's my cost budget per operation?

5. How important is determinism?

6. How important is interpretability?

7. What's the cost of errors?

8. Is this a text generation task?

Technical Deep Dive: Why Specialized Models Win

Information Density and Tokenization

Latency Breakdown

Parameter Efficiency

Calibration and Uncertainty

Anti-Patterns to Avoid

The "AI-Native" Trap

Prompt Engineering When You Should Train

LLM as Database

Ignoring Failure Modes

Premature Optimization (in either direction)

The Pragmatic Path Forward

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

LLM Cost Engineering: Token Budgeting, Caching, and Model Routing for Production

ML System Design: A Complete Framework for Production Systems

Workflows vs Agents: A Practical Decision Framework

Building Production-Ready RAG Systems: Lessons from the Field