Skip to main content
Back to Blog

When NOT to Use LLMs: A Practical Guide to Choosing the Right Tool

A contrarian but practical guide to when large language models are the wrong choice. Understanding when traditional ML, simple heuristics, or no ML at all will outperform LLMs on cost, latency, reliability, and accuracy.

12 min read
Share:

The LLM Hammer Problem

When you have a shiny new hammer, everything looks like a nail. In 2025, that hammer is the large language model.

The hype is understandable. LLMs can write code, answer questions, analyze documents, generate content, and handle tasks that seemed impossible just a few years ago. Venture capital flows to "AI-native" startups. Job postings demand "LLM experience." Every product roadmap includes an AI feature.

But here's the uncomfortable truth: for many problems, LLMs are the wrong tool. They're slower, more expensive, less reliable, and less accurate than simpler alternatives. Using an LLM where a regex would suffice isn't innovation—it's overengineering.

This guide covers when NOT to use LLMs, what to use instead, and how to make the right tool choice for your problem.

The Case Against LLM-Everything

Cost

LLMs are expensive at scale:

API costs: GPT-4o costs 2.502.50-10 per million tokens. A high-traffic application processing millions of requests daily can spend tens of thousands of dollars monthly on API calls alone.

Infrastructure costs: Self-hosted models require expensive GPUs. An 8xH100 server costs 200,000+topurchaseor200,000+ to purchase or 10-30/hour to rent.

Hidden costs: Token optimization, prompt engineering, guardrails, evaluation—LLM projects have substantial ongoing operational overhead.

Compare to traditional ML: A logistic regression model costs essentially nothing to run. A gradient boosted tree handles thousands of predictions per second on a $50/month server.

Latency

LLMs are slow:

Time to first token: 100-500ms typically Full response: 1-10+ seconds for substantial outputs Worst case: Complex prompts with long outputs can take 30+ seconds

For real-time applications—fraud detection, ad bidding, game AI, trading systems—this latency is unacceptable. Traditional ML models provide predictions in single-digit milliseconds.

Reliability

LLMs are stochastic and unpredictable:

Non-determinism: The same input can produce different outputs (even with temperature=0, there's variance across API calls and model versions).

Hallucinations: LLMs confidently generate false information. For applications requiring factual accuracy, this is a fundamental limitation.

Mode collapse: Models can get stuck in patterns, producing repetitive or degraded outputs.

Silent failures: Unlike traditional code that throws errors, LLMs produce plausible-looking wrong answers.

Traditional ML models are deterministic. The same input always produces the same output. Failure modes are understood and predictable.

Accuracy

For many tasks, LLMs are less accurate than specialized solutions:

Structured prediction: For classification, regression, ranking, and other structured tasks with well-defined outputs, task-specific models trained on domain data typically outperform general LLMs.

Mathematical computation: LLMs struggle with arithmetic. A calculator is infinitely more accurate for math.

Factual lookup: A database query returns the correct answer; an LLM might hallucinate.

Pattern matching: Regex, exact matching, and rule-based systems are 100% accurate for well-specified patterns.

Code
┌─────────────────────────────────────────────────────────────────────────────┐
│                    LLM vs. TRADITIONAL ML COMPARISON                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  DIMENSION          │ LLMs                  │ TRADITIONAL ML                │
│  ────────────────────────────────────────────────────────────────────────   │
│                                                                             │
│  Latency            │ 100ms - 10s+          │ 1-10ms                        │
│                                                                             │
│  Cost per 1M ops    │ $1 - $100+            │ $0.01 - $1                    │
│                                                                             │
│  Determinism        │ Non-deterministic     │ Deterministic                 │
│                                                                             │
│  Interpretability   │ Black box             │ Often interpretable           │
│                                                                             │
│  Training data      │ Pre-trained,          │ Requires labeled domain       │
│                     │ can few-shot          │ data                          │
│                                                                             │
│  Deployment         │ GPU required or       │ CPU usually sufficient        │
│                     │ API dependency        │                               │
│                                                                             │
│  Flexibility        │ Handles novel tasks   │ Limited to trained tasks      │
│                                                                             │
│  Failure modes      │ Subtle, hard to       │ Clear, diagnosable            │
│                     │ debug                 │                               │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  THE BOTTOM LINE:                                                           │
│                                                                             │
│  LLMs excel at:                                                             │
│  • Unstructured text generation                                             │
│  • Novel tasks without training data                                        │
│  • Complex reasoning over text                                              │
│  • Flexible, conversational interfaces                                      │
│                                                                             │
│  Traditional ML excels at:                                                  │
│  • Structured prediction (classification, regression)                       │
│  • High-throughput, low-latency requirements                                │
│  • Well-defined tasks with training data                                    │
│  • When interpretability matters                                            │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

When Traditional ML Beats LLMs

Tabular Data and Structured Prediction

For classification and regression on tabular data, traditional ML consistently outperforms LLMs:

The data: Customer features, transaction records, sensor readings, log events—structured data with clear features.

The task: Predict churn, detect fraud, forecast demand, classify tickets.

Why traditional ML wins:

  • Gradient boosted trees (XGBoost, LightGBM, CatBoost) are state-of-the-art for tabular data
  • Orders of magnitude faster and cheaper
  • Better accuracy with proper feature engineering
  • Interpretable feature importance

The evidence: Despite years of deep learning advances, gradient boosted trees still win Kaggle competitions on tabular data. LLMs processing tabular data as text lose the structural information that makes trees effective.

Recommendation: For tabular prediction tasks, start with XGBoost or LightGBM. Only consider LLMs if the task inherently requires natural language understanding (e.g., incorporating text fields).

Real-Time Systems

Systems requiring sub-100ms latency shouldn't use LLMs in the critical path:

Fraud detection: Transactions must be approved/declined in milliseconds. LLMs are too slow.

Ad bidding: Real-time bidding happens in 10-100ms. No room for LLM inference.

Game AI: NPCs need to respond instantly. Frame-rate-dependent decisions can't wait for LLM calls.

Recommendation systems: Users expect immediate results. LLMs can enhance recommendations but shouldn't block the primary recommendation engine.

Trading systems: Microseconds matter. LLMs are roughly a million times too slow.

For these applications, use traditional ML models designed for low-latency inference. If you need LLM capabilities, use them asynchronously (pre-compute embeddings, batch process, etc.).

High-Volume, Low-Margin Operations

When you're processing millions of items with thin margins per item, LLM costs become prohibitive:

Email classification at scale: Processing 100M emails/day at 0.01/email=0.01/email = 1M/day. A trained classifier costs nearly nothing.

Log analysis: Billions of log lines per day. LLMs are economically impossible; rule-based systems and traditional anomaly detection work.

Content moderation at scale: Social media volumes require cheap, fast classifiers. LLMs can handle edge cases escalated from primary classifiers.

Recommendation systems: Computing recommendations for millions of users on every page load requires sub-millisecond models.

Rule of thumb: If your per-item profit margin is less than the LLM API cost, you can't afford to use LLMs for every item.

When Interpretability Matters

Regulated industries and high-stakes decisions often require explainable models:

Credit decisions: Regulations require explaining why an application was denied. "The LLM said so" isn't acceptable.

Medical diagnosis support: Clinicians need to understand why a model flagged something.

Legal applications: Decisions must be justifiable and auditable.

HR/hiring: Explaining hiring decisions is legally required in many jurisdictions.

Traditional ML models (linear models, decision trees, rule-based systems) provide clear explanations. LLMs are black boxes—we can ask them to explain, but their explanations may not reflect their actual decision process.

Well-Defined Tasks with Training Data

If your task is well-specified and you have training data, a task-specific model usually wins:

Sentiment analysis: Fine-tuned BERT or even simpler classifiers outperform prompted LLMs while being much faster and cheaper.

Named entity recognition: SpaCy or fine-tuned NER models are better than prompting GPT-4.

Text classification: A fine-tuned classifier beats prompting for most classification tasks where you have labeled data.

Translation: Dedicated translation models (NLLB, OPUS) often beat general LLMs on specific language pairs.

The pattern: LLMs are generalists; specialists beat generalists on specific tasks.

Case Studies: Wrong Tool, Right Tool

Let's examine real scenarios where choosing the right tool makes orders-of-magnitude difference.

Case Study 1: Email Routing

The task: Route incoming support emails to the right department (billing, technical, sales, general).

The LLM approach: Send each email to GPT-4 with a prompt asking it to classify. Cost: ~$0.01 per email. Latency: 500ms-2s. Accuracy: ~95%.

The right approach: Fine-tune a DistilBERT classifier on 5,000 labeled historical emails. Cost: ~$0.0001 per email (100x cheaper). Latency: 10ms (50-200x faster). Accuracy: 97% (actually better with domain-specific training).

The math at scale:

  • 100,000 emails/day
  • LLM: $1,000/day, 500ms latency
  • Classifier: $10/day, 10ms latency

Over a year, the classifier saves $361,000 and provides better user experience.

When LLM makes sense: During initial development (before you have training data), or for the 2% of emails the classifier is uncertain about.

Case Study 2: Log Anomaly Detection

The task: Detect anomalous patterns in application logs (100 million log lines per day).

The LLM approach: Impossible. Even at 0.001perlogline=0.001 per log line = 100,000/day. And latency would mean anomalies detected hours late.

The right approach:

  1. Parse logs with regex into structured data
  2. Statistical anomaly detection (isolation forests, z-scores)
  3. Rule-based alerts for known patterns
  4. Aggregate dashboards for human review

Cost: Effectively zero (runs on existing infrastructure). Latency: Real-time. Accuracy: Tuned to your actual anomaly patterns.

When LLM makes sense: Root cause analysis after an anomaly is detected. "Here are 50 suspicious log lines—what might have caused this?"

The task: Search a catalog of 10 million products.

The LLM approach: Generate product descriptions with LLM, use embeddings for semantic search. Reasonable for the search itself, but...

The pitfall: Using LLM at query time to rerank or filter results. At 1 million searches/day, even 0.001/search=0.001/search = 1,000/day just for search.

The right approach:

  1. Pre-compute embeddings for all products (one-time cost)
  2. Use vector database for fast similarity search (Pinecone, Qdrant, etc.)
  3. Combine with traditional filters (price, category, availability)
  4. Use lightweight reranker (cross-encoder) for top 100 results

When LLM makes sense: Query understanding ("blue dress for summer wedding" → structured query), or conversational commerce where natural interaction adds value.

Case Study 4: Data Extraction from Invoices

The task: Extract vendor name, amount, date, line items from invoices.

The LLM approach: Send invoice image to GPT-4V. Works well! But costs $0.01-0.10 per invoice.

The right approach (for standardized invoices):

  1. Template matching for known vendors
  2. OCR + regex for structured fields
  3. Traditional ML for field detection in new formats
  4. LLM fallback for unrecognized formats

Hybrid result: 80% of invoices processed for ~0.001(knowntemplates).200.001 (known templates). 20% processed by LLM at 0.05. Average cost: $0.01 (5-10x cheaper than LLM-only).

Key insight: Most real-world data has structure. Exploit that structure when it exists.

Code
┌─────────────────────────────────────────────────────────────────────────────┐
│                    CASE STUDY COST COMPARISON                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  TASK                 │ VOLUME      │ LLM COST    │ RIGHT TOOL  │ SAVINGS   │
│  ────────────────────────────────────────────────────────────────────────   │
│                                                                             │
│  Email classification │ 100K/day    │ $1,000/day  │ $10/day     │ 99%       │
│                                                                             │
│  Log analysis         │ 100M/day    │ Impossible  │ ~$0/day     │ ∞         │
│                                                                             │
│  Product search       │ 1M/day      │ $1,000/day  │ $50/day     │ 95%       │
│                                                                             │
│  Invoice extraction   │ 10K/day     │ $500/day    │ $100/day    │ 80%       │
│                                                                             │
│  Sentiment analysis   │ 1M/day      │ $5,000/day  │ $20/day     │ 99.6%     │
│                                                                             │
│  Language detection   │ 10M/day     │ $10,000/day │ $1/day      │ 99.99%    │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  ANNUAL SAVINGS BY USING THE RIGHT TOOL:                                    │
│                                                                             │
│  These six examples alone: ~$6M in API costs avoided per year               │
│  Plus: Lower latency, higher reliability, better accuracy                   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

When Simple Heuristics Beat LLMs

Sometimes you don't need ML at all:

Pattern Matching

Email parsing: Extract sender, subject, date with regex—100% accurate, instant, free.

Log parsing: Structured logs have known formats. Regex or parsing libraries are perfect.

Data validation: Phone numbers, emails, URLs, dates—regex validates perfectly.

Document structure: Extracting sections from documents with known templates doesn't need ML.

Lookup Tables

Currency conversion: Multiply by exchange rate. No LLM needed.

Unit conversion: Simple arithmetic. Perfect accuracy.

Reference data: Looking up product info, user profiles, historical data—database queries, not LLMs.

Mapping codes: ICD codes to descriptions, country codes to names—lookup tables.

Rule-Based Systems

Business logic: Tax calculations, pricing rules, eligibility determination—rules are deterministic and auditable.

Workflow routing: If from VIP customer, route to priority queue. Simple conditions.

Validation rules: Age must be > 0 and < 150. No ML required.

Alert thresholds: CPU > 90% for 5 minutes → alert. Thresholds, not neural networks.

Code
┌─────────────────────────────────────────────────────────────────────────────┐
│                    DECISION FRAMEWORK: WHAT TO USE                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  START HERE: What is your task?                                             │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  Is the task well-defined with clear rules?                                 │
│      │                                                                      │
│      ├── YES → Can you write explicit rules?                                │
│      │         │                                                            │
│      │         ├── YES → Use RULES / REGEX / LOOKUP                         │
│      │         │         (Deterministic, fast, free, debuggable)            │
│      │         │                                                            │
│      │         └── NO → Do you have training data?                          │
│      │                  │                                                   │
│      │                  ├── YES → Use TRADITIONAL ML                        │
│      │                  │         (Classifiers, regression, trees)          │
│      │                  │                                                   │
│      │                  └── NO → Consider LLM with few-shot learning        │
│      │                                                                      │
│      └── NO (ambiguous, open-ended) → Continue below                        │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  Does the task require natural language generation?                         │
│      │                                                                      │
│      ├── YES → LLM is appropriate                                           │
│      │         (Writing, summarization, conversation, explanation)          │
│      │                                                                      │
│      └── NO → Does it require complex reasoning over text?                  │
│               │                                                             │
│               ├── YES → LLM is appropriate                                  │
│               │         (Analysis, comparison, synthesis)                   │
│               │                                                             │
│               └── NO → Does it require flexibility for novel inputs?        │
│                        │                                                    │
│                        ├── YES → LLM may be appropriate                     │
│                        │                                                    │
│                        └── NO → Reconsider simpler approaches               │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  CONSTRAINT CHECK (even if LLM seems appropriate):                          │
│                                                                             │
│  □ Latency requirement < 100ms? → Consider traditional ML or hybrid        │
│  □ Volume > 1M/day with thin margins? → Consider traditional ML            │
│  □ Must be deterministic? → Avoid LLMs or add verification layer           │
│  □ Must be interpretable? → Use traditional ML or hybrid                   │
│  □ Factual accuracy critical? → Add retrieval, verification, guardrails    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

When LLMs ARE the Right Choice

To be clear, LLMs are transformative for the right problems:

Open-Ended Text Generation

Writing, summarization, explanation, creative content—tasks where the output is natural language and the space of correct answers is large. This is what LLMs are designed for.

Tasks Without Training Data

When you can't collect labeled data (new task, cold start, rare events), LLMs can work with just a description and examples. Few-shot and zero-shot learning are genuine capabilities.

Complex Reasoning Over Text

Analyzing documents, synthesizing information, comparing options, understanding nuance—tasks requiring comprehension and reasoning that rule-based systems can't handle.

Flexible Interfaces

Chatbots, assistants, natural language interfaces—anywhere users expect to communicate in natural language and receive natural language back.

Rapid Prototyping

When you need something working today to validate an idea, LLMs let you skip the data collection and model training phases. Validate first, optimize later.

Hybrid Approaches

The best systems often combine approaches:

LLM for Edge Cases, Traditional ML for Volume

Pattern: Traditional ML handles 90%+ of traffic cheaply and quickly. LLM handles the long tail of difficult cases.

Example: Content moderation. Fast classifier catches obvious violations. Borderline cases escalate to LLM (or human) review.

LLM for Understanding, Traditional Systems for Action

Pattern: LLM interprets intent; deterministic systems execute.

Example: "Transfer $500 to John" → LLM extracts intent (transfer, amount, recipient) → banking system executes with proper validation.

LLM for Generation, Verification for Accuracy

Pattern: LLM generates; another system verifies.

Example: LLM generates SQL query → query parser validates syntax → execution engine verifies safety → database runs query.

Retrieval + LLM (RAG)

Pattern: Traditional retrieval finds relevant information; LLM synthesizes.

This is RAG—combining the factual grounding of retrieval with the synthesis capabilities of LLMs.

The Decision Framework

When evaluating whether to use an LLM, ask:

1. What's the simplest solution that could work?

Start simple. Regex, rules, lookup tables, existing libraries. Only add complexity if simple solutions fail.

2. Do I have training data?

If yes, and the task is well-defined, a fine-tuned task-specific model likely beats prompting a general LLM.

3. What are my latency requirements?

Sub-100ms: Avoid LLMs in the critical path. 100ms-1s: LLMs possible but consider impact. 1s+: LLMs are fine latency-wise.

4. What's my cost budget per operation?

If processing millions of items at thin margins, LLM costs may be prohibitive.

5. How important is determinism?

If the same input must always produce the same output, LLMs are problematic.

6. How important is interpretability?

If you need to explain decisions, prefer interpretable models.

7. What's the cost of errors?

High-stakes: Add verification, guardrails, human review regardless of approach. Low-stakes: LLM errors may be acceptable.

8. Is this a text generation task?

If the output is natural language, LLMs are likely appropriate. If the output is structured (numbers, categories, actions), consider alternatives.

Technical Deep Dive: Why Specialized Models Win

Understanding why specialized models outperform LLMs on specific tasks helps make better decisions.

Information Density and Tokenization

LLMs process text through tokenization, which introduces overhead for structured data:

Example: Processing a transaction record

Structured data: {user_id: 12345, amount: 99.99, timestamp: 1704067200}

As LLM input: "The user with ID 12345 made a purchase of $99.99 at timestamp 1704067200" → ~30 tokens

A gradient boosted tree sees: [12345, 99.99, 1704067200] → 3 features

The LLM must:

  • Parse the text structure
  • Extract numerical values
  • Learn that these are features
  • Learn their relationships

The tree directly operates on features, with explicit splits learned from data. It's fundamentally more efficient for tabular prediction.

Latency Breakdown

LLM inference has multiple latency components:

Time to first token (TTFT): 100-500ms

  • Tokenization
  • KV cache computation (if long context)
  • First forward pass

Time per output token (TPOT): 10-50ms each

  • Autoregressive generation
  • KV cache updates
  • Sampling

Total latency for 100-token response: 1-5 seconds

Compare to traditional ML:

Inference latency: 0.1-10ms total

  • No tokenization (direct feature vectors)
  • Single forward pass
  • No autoregressive generation

This 100-1000x difference is fundamental to the architectures, not just current implementations.

Parameter Efficiency

LLMs use parameters inefficiently for narrow tasks:

GPT-4: ~1.7 trillion parameters (estimated)

  • Knows about: everything (history, science, code, languages, etc.)
  • Knows about your task: a tiny fraction

Task-specific classifier: 10-100 million parameters

  • Knows about: your specific task
  • Parameter efficiency: 10,000x better for your task

Those extra LLM parameters aren't free—they add latency, cost, and potential for irrelevant knowledge to interfere.

Calibration and Uncertainty

LLMs are notoriously poorly calibrated—their confidence doesn't match their accuracy.

Traditional ML advantages:

  • Probabilistic outputs (0.73 confidence means ~73% accuracy)
  • Well-understood uncertainty quantification
  • Can be calibrated post-hoc

LLM challenges:

  • Confidence often doesn't correlate with correctness
  • "Hallucinations" occur with high apparent confidence
  • Temperature controls randomness, not true uncertainty

For decisions requiring reliable confidence estimates (medical triage, fraud risk scoring), traditional ML is far more trustworthy.

Code
┌─────────────────────────────────────────────────────────────────────────────┐
│                    WHY SPECIALIZED MODELS WIN: TECHNICAL SUMMARY            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  DIMENSION           │ LLM                      │ SPECIALIZED MODEL         │
│  ────────────────────────────────────────────────────────────────────────   │
│                                                                             │
│  Input processing    │ Text tokenization        │ Direct feature vectors    │
│                      │ 30+ tokens/record        │ N features (raw)          │
│                                                                             │
│  Parameters used     │ 1T+ (vast majority       │ 10-100M (all focused      │
│                      │ irrelevant to task)      │ on task)                  │
│                                                                             │
│  Inference steps     │ Autoregressive           │ Single forward pass       │
│                      │ (N tokens × forward)     │                           │
│                                                                             │
│  Latency floor       │ ~100ms (TTFT alone)      │ ~0.1ms                    │
│                                                                             │
│  Calibration         │ Poor (overconfident)     │ Good (can calibrate)      │
│                                                                             │
│  Failure mode        │ Plausible wrong answers  │ Obvious errors            │
│                                                                             │
│  Training signal     │ Generic next-token       │ Task-specific loss        │
│                                                                             │
│  Hardware needed     │ GPU (often multiple)     │ CPU often sufficient      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Anti-Patterns to Avoid

The "AI-Native" Trap

Building LLM-first without considering simpler alternatives. Often driven by marketing rather than engineering.

Sign: "We're using GPT-4 for everything."

Fix: Evaluate each component independently. Use LLMs where they add value.

Prompt Engineering When You Should Train

Spending weeks engineering prompts for a task where a simple fine-tuned classifier would work better.

Sign: Prompt is 2000 tokens of examples and instructions for a classification task.

Fix: If you have labeled data and a well-defined task, train a classifier.

LLM as Database

Using LLMs to retrieve facts they might hallucinate.

Sign: "The model knows the product catalog."

Fix: Use retrieval. LLMs synthesize; databases store.

Ignoring Failure Modes

Deploying LLMs without considering hallucinations, prompt injection, or edge cases.

Sign: No guardrails, no evaluation, no monitoring.

Fix: Implement proper evaluation, guardrails, and observability.

Premature Optimization (in either direction)

Either over-engineering with LLMs from day one, or refusing to use them when they're the right tool.

Fix: Match tool to problem. Be willing to change as requirements become clearer.

The Pragmatic Path Forward

The best engineers use the right tool for the job. In 2025, that means:

Defaulting to simplicity: Start with the simplest solution. Add complexity only when needed.

Knowing your options: Understand traditional ML, rule-based systems, AND LLMs well enough to choose appropriately.

Measuring what matters: Define success metrics before choosing an approach. Evaluate options against those metrics.

Staying flexible: Be willing to change approaches as you learn more about the problem.

Combining strengths: The best systems often combine multiple approaches—LLMs for flexibility, traditional ML for efficiency, rules for reliability.

LLMs are powerful tools. But so are hammers, and not everything is a nail.


Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles