Skip to main content
Back to Blog

Time Series Forecasting with Foundation Models: From ARIMA to Chronos

A comprehensive guide to modern time series forecasting—from classical statistical methods to transformer-based architectures, state space models, diffusion models, and zero-shot foundation models like Chronos, TimesFM, Moirai, and TiRex.

19 min read
Share:

The Revolution in Time Series Forecasting

Time series forecasting has undergone a transformation as dramatic as the one that reshaped natural language processing. For decades, statistical methods like ARIMA dominated. Then deep learning approaches—LSTMs, GRUs, and neural basis expansion—showed promise for capturing complex temporal patterns. Transformers brought attention mechanisms, state space models offered linear complexity alternatives, and diffusion models introduced probabilistic generation. Now, we're witnessing foundation models for time series: large pretrained models that forecast any time series with zero-shot learning.

This shift matters because time series forecasting is everywhere: demand planning, financial markets, energy load prediction, traffic optimization, healthcare monitoring, climate modeling. The new generation of models delivers better accuracy, often without any training on your data.

This guide covers the complete landscape: classical methods, deep learning architectures (N-BEATS, LSTMs), transformers (Informer, Autoformer, iTransformer, TimeXer), state space models (Mamba, xLSTM/TiRex), diffusion models (TimeGrad, CSDI), novel architectures (KAN, Neural ODEs, Graph NNs), self-supervised learning (TS2Vec), and foundation models (Chronos, TimesFM, Moirai, TiRex, Toto).

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              EVOLUTION OF TIME SERIES FORECASTING                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1970s-2010s: STATISTICAL METHODS                                        │
│  ARIMA, Exponential Smoothing, Prophet                                   │
│  ✓ Interpretable    ✓ Fast    ✗ Limited patterns                        │
│                                                                          │
│  2015-2019: EARLY DEEP LEARNING                                          │
│  LSTM, GRU, TCN, DeepAR                                                  │
│  ✓ Complex patterns    ✓ Multivariate    ✗ Needs lots of data           │
│                                                                          │
│  2019-2021: NEURAL BASIS EXPANSION                                       │
│  N-BEATS, N-HiTS, NBEATSx                                                │
│  ✓ Interpretable    ✓ No hand-crafted features    ✓ Fast                │
│                                                                          │
│  2020-2023: TRANSFORMERS                                                 │
│  Informer, Autoformer, TFT, PatchTST, iTransformer, TimeXer              │
│  ✓ Long-range dependencies    ✓ Attention    ✗ Quadratic complexity     │
│                                                                          │
│  2023-2024: STATE SPACE & DIFFUSION                                      │
│  Mamba, MambaTS, S-Mamba, TimeGrad, CSDI                                 │
│  ✓ Linear complexity    ✓ Probabilistic    ✓ Efficient                  │
│                                                                          │
│  2024-NOW: FOUNDATION MODELS                                             │
│  Chronos, TimesFM, Moirai, TiRex, Toto, Timer-XL                         │
│  ✓ Zero-shot    ✓ Universal    ✓ Pretrained on billions of points       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Classical Statistical Methods

Before diving into neural approaches, understanding classical methods is essential. They remain the right choice for many problems.

ARIMA: The Statistical Workhorse

ARIMA (AutoRegressive Integrated Moving Average) models time series as a combination of three components:

  • AR (AutoRegressive): Current value depends on previous values
  • I (Integrated): Differencing to make the series stationary
  • MA (Moving Average): Current value depends on previous forecast errors

Why ARIMA still matters: Excels at short-term forecasting with stationary data. Fast, interpretable, requires no training data beyond the series itself. For univariate forecasting with clear autocorrelation, ARIMA often matches neural networks—especially on small datasets.

Limitations: Non-linear relationships, multiple seasonalities, external variables, long-term dependencies.

Exponential Smoothing (ETS)

ETS decomposes time series into Error, Trend, and Seasonality. The family includes:

  • Simple Exponential Smoothing: No trend or seasonality
  • Holt's Linear: Adds trend
  • Holt-Winters: Adds seasonality (additive or multiplicative)

Prophet: Business Forecasting

Prophet (Meta) models time series as: y(t) = trend(t) + seasonality(t) + holidays(t) + error(t)

Strengths: Handles missing data and outliers, automatic changepoint detection, easy domain knowledge incorporation.

Limitations: Research shows Prophet underperforms ARIMA and neural networks on accuracy. Value lies in ease of use, not peak performance.


Deep Learning Architectures

Recurrent Networks: LSTM and GRU

Long Short-Term Memory (LSTM) networks solve vanilla RNNs' vanishing gradient problem with gates:

  • Forget gate: What to discard from memory
  • Input gate: What new information to store
  • Output gate: What to output

Gated Recurrent Units (GRUs) simplify to two gates (update, reset) with comparable performance.

Why LSTMs work: Learn long-range dependencies, handle variable-length sequences, incorporate multiple features.

Limitations: Sequential training (no parallelization), hyperparameter tuning difficulty, substantial data requirements.

Temporal Convolutional Networks (TCN)

TCNs apply 1D dilated convolutions across time:

Code
Dilation pattern:    1, 2, 4, 8, 16...
Receptive field grows exponentially while parameters grow linearly

TCN advantages: Parallelizable, flexible receptive field, faster training, stable gradients.

DeepAR: Probabilistic Forecasting at Scale

Amazon's DeepAR trains a single LSTM across thousands of related series:

  • Probabilistic outputs: Predicts distributions, not point estimates
  • Global model: Learns cross-series patterns
  • Autoregressive generation: Uses sampled predictions for subsequent steps

N-BEATS and N-HiTS: Neural Basis Expansion

N-BEATS Architecture

N-BEATS (Neural Basis Expansion Analysis for Time Series), developed by ElementAI/ServiceNow, revolutionized interpretable deep forecasting. The key insight: use neural basis expansion to decompose forecasts.

How it works:

  1. Blocks process the input through fully-connected layers
  2. Each block outputs expansion coefficients (θ) for both backward (backcast) and forward (forecast)
  3. Coefficients are projected through basis functions to produce predictions
  4. Doubly residual stacking: Blocks are organized into stacks, with residual connections both within and across stacks
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    N-BEATS ARCHITECTURE                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Input: [x₁, x₂, ..., xₜ] (lookback window)                             │
│              │                                                           │
│              ▼                                                           │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │  STACK 1: Trend                                                 │    │
│  │  ┌─────────┐   ┌─────────┐   ┌─────────┐                       │    │
│  │  │ Block 1 │ → │ Block 2 │ → │ Block 3 │                       │    │
│  │  └────┬────┘   └────┬────┘   └────┬────┘                       │    │
│  │       │             │             │                             │    │
│  │       ▼             ▼             ▼                             │    │
│  │  [θ_trend] → Polynomial Basis → Trend Forecast                 │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│              │ (residual)                                                │
│              ▼                                                           │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │  STACK 2: Seasonality                                           │    │
│  │  ┌─────────┐   ┌─────────┐   ┌─────────┐                       │    │
│  │  │ Block 1 │ → │ Block 2 │ → │ Block 3 │                       │    │
│  │  └────┬────┘   └────┬────┘   └────┬────┘                       │    │
│  │       │             │             │                             │    │
│  │       ▼             ▼             ▼                             │    │
│  │  [θ_season] → Fourier Basis → Seasonality Forecast             │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│              │                                                           │
│              ▼                                                           │
│  Final Forecast = Σ(Trend + Seasonality forecasts from all blocks)      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Two configurations:

  1. Generic: No time-series-specific components—proves deep learning primitives alone solve forecasting
  2. Interpretable: Uses trend (polynomial) and seasonality (Fourier) basis functions

Performance: Improved forecast accuracy by 11% over statistical benchmarks, 3% over M4 competition winner.

N-HiTS: Hierarchical Interpolation

N-HiTS extends N-BEATS for long-horizon forecasting with:

  • Hierarchical interpolation: Different blocks specialize in different frequency bands
  • Multi-rate data sampling: Blocks see the series at different temporal resolutions

Result: Outperforms Transformer-based models by 25%+ on benchmarks.

NBEATSx: Adding Exogenous Variables

NBEATSx extends N-BEATS to incorporate external features, improving accuracy by ~20% over vanilla N-BEATS.


Transformer Architectures for Time Series

Why Vanilla Transformers Struggle

Self-attention has O(n²) complexity—prohibitive for long sequences. Additional issues:

  • Permutation invariance: Transformers don't inherently understand order
  • Point-wise attention: May miss local patterns spanning multiple points

Informer: Efficient Long-Sequence Forecasting

Informer (2020) introduced ProbSparse self-attention:

  • Identifies important queries (those attending to many keys)
  • Only computes attention for those queries
  • Reduces complexity from O(n²) to O(n log n)

Additional innovations:

  • Self-attention distilling: Progressively halves sequence length between layers
  • Generative decoder: Predicts entire output in one pass, avoiding error accumulation

Autoformer: Decomposition Meets Attention

Autoformer (2021) exploits time series structure:

Series decomposition blocks: After each layer, decompose into trend and seasonal components using moving averages.

Auto-correlation attention: Instead of point-wise attention, compute attention based on series periodicity—find similar sub-series and aggregate them.

Results: 10-12% improvement over Informer on periodic data.

Temporal Fusion Transformer (TFT)

TFT (Google) prioritizes interpretability:

  • Variable selection networks: Learn which features matter for each prediction
  • Gating mechanisms: Control information flow
  • Multi-horizon attention: Different attention patterns for different forecast horizons
  • Quantile outputs: Prediction intervals, not just point forecasts

PatchTST: The Patching Breakthrough

PatchTST (2023) introduced patching—grouping time points into chunks:

Why patching works:

  • Local semantics: A patch captures local patterns as a unit
  • Sequence reduction: 512 points → 32 patches
  • Better representations: Encode local patterns into embeddings

iTransformer: Inverted Attention (ICLR 2024 Spotlight)

iTransformer made a simple but powerful change: swap the modeling axes.

Traditional transformers: Each time step is a token, attention captures temporal dependencies.

iTransformer: Each variate (channel) is embedded as a token. Attention models cross-variate correlations, while FFN learns temporal representations per variate.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    iTransformer INVERSION                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRADITIONAL TRANSFORMER (time as tokens):                               │
│  ──────────────────────────────────────────                              │
│  Tokens:    [t₁] [t₂] [t₃] ... [tₙ]                                     │
│              ↑    ↑    ↑        ↑                                        │
│  Each token = one timestamp, all variates                                │
│  Attention learns: temporal dependencies                                 │
│                                                                          │
│  ──────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  iTRANSFORMER (variates as tokens):                                     │
│  ──────────────────────────────────                                      │
│  Tokens:    [var₁] [var₂] [var₃] ... [varₘ]                             │
│               ↑      ↑      ↑          ↑                                 │
│  Each token = entire history of one variate                              │
│  Attention learns: cross-variate correlations                            │
│  FFN learns: temporal patterns per variate                               │
│                                                                          │
│  RESULT: Stronger accuracy, better scaling, no positional encoding       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Key insight: This simple axis swap yields stronger accuracy without new modules.

TimeXer: Exogenous Variables (NeurIPS 2024)

TimeXer handles both endogenous and exogenous features:

  • Patch-wise self-attention: For endogenous series
  • Variate-wise cross-attention: To incorporate exogenous information
  • Global endogenous tokens: Bridge causal information from exogenous series

State-of-the-art on twelve benchmarks for forecasting with external variables.

Other Notable Transformers

  • FEDformer: Frequency-enhanced decomposed transformer using Fourier/wavelet transforms
  • ETSformer: Exponential smoothing attention with level, growth, and seasonal components
  • Crossformer: Two-stage attention (cross-time, cross-dimension) with segment embedding

MLP-Based Models (Efficient Alternatives)

Recent research shows simple MLPs can match or beat transformers:

TSMixer: MLP mixing in time and channel domains directly—surprisingly competitive.

WPMixer (AAAI 2025): Wavelet Patch Mixer combines:

  • Multi-resolution wavelet decomposition (time + frequency domains)
  • Patching for extended lookback and local information
  • MLP mixing for global information
  • 10x more efficient than TimeMixer, lower variance
  • Outperforms transformer-based models for long-term forecasting

DLinear: Simple linear layers beat many transformers—challenged the field's assumptions.


State Space Models: Linear Complexity Alternatives

Mamba for Time Series

Mamba, based on selective state space models (SSMs), offers transformer-competitive performance with linear complexity. For time series, this enables efficient processing of very long sequences.

MambaTS: Temporal Mamba Blocks

MambaTS addresses Mamba's limitations for long-term forecasting:

  • Variable scan along time: Arranges historical information of all variables together
  • Temporal Mamba Block (TMB): Specialized architecture for time series

Result: State-of-the-art on eight public datasets.

S-Mamba (Simple-Mamba)

S-Mamba takes a simpler approach:

  1. Tokenize time points per variate via linear layer
  2. Bidirectional Mamba layer extracts inter-variate correlations
  3. FFN learns temporal dependencies

Low computational overhead with leading performance on thirteen datasets.

Mamba4Cast: Zero-Shot with Mamba

Mamba4Cast is a zero-shot foundation model using Mamba:

  • Inspired by Prior-data Fitted Networks (PFNs)
  • Trained solely on synthetic data
  • Generates forecasts for entire horizons in one pass
  • Much lower inference time than transformers

SOR-Mamba: Order-Robust

Channels in time series have no specific order, introducing sequential bias. SOR-Mamba uses regularization to minimize discrepancy between embeddings from reversed channel orders.

TiRex: xLSTM Foundation Model (NeurIPS 2025)

TiRex demonstrates that LSTMs are back. Based on xLSTM (extended LSTM), TiRex is a 35M parameter foundation model that:

  • Sets state-of-the-art on GiftEval and Chronos-ZS benchmarks
  • Outperforms much larger models (Chronos-Bolt 200M, TimesFM 500M, Toto 151M)
  • Provides both point and quantile predictions

Key innovation: State-tracking capability critical for long-horizon forecasting. Unlike transformers or Mamba, TiRex retains explicit state tracking via sLSTM modules.

CPM (Causal Prediction Masking): Training-time masking strategy that enhances state-tracking ability.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              TiRex vs TRANSFORMER vs MAMBA                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Architecture      State Tracking    Complexity    Zero-Shot Rank       │
│  ────────────────────────────────────────────────────────────────────── │
│  Transformer       ✗ (attention)     O(n²)         Good                 │
│  Mamba             ✗ (selective)     O(n)          Good                 │
│  TiRex (xLSTM)     ✓ (explicit)      O(n)          BEST (NeurIPS 2025)  │
│                                                                          │
│  Key insight: State-tracking matters for forecasting                     │
│  TiRex's sLSTM maintains explicit state across sequence                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Diffusion Models for Time Series

Diffusion models, successful in image generation, have been adapted for probabilistic time series forecasting.

TimeGrad: Autoregressive Diffusion

TimeGrad (first diffusion model for time series) uses:

  • Conditional diffusion: Denoising guided by RNN hidden state
  • Autoregressive generation: Process sequence with recurrent cell, maintain hidden state

Output: Highly diverse probabilistic forecasts with uncertainty quantification.

CSDI: Non-Autoregressive Diffusion

CSDI (Conditional Score-based Diffusion for Imputation) uses:

  • Self-supervised strategy: Input masking for training
  • Stacked attention: Temporal and feature-wise attention for conditioning
  • Non-autoregressive: Faster predictions than TimeGrad

Diffusion Model Improvements (2024-2025)

Known limitations of TimeGrad/CSDI:

  • Optimize likelihood only—generate diverse but poorly aligned samples
  • Training instability
  • Boundary disharmony problems

Recent advances:

  • MG-TSD: Multi-scale generation—predict main components then details
  • mr-diff: Separately predict trend and seasonal components
  • CCDM, TimeDiff, S²DBM, SimDiff: Achieve best/second-best on benchmarks
  • LDM4TS: Translate time series to visual encodings, denoise in image-latent space
  • MCD-TSF: Multimodal (text, timestamp) cross-attention with classifier-free guidance
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DIFFUSION MODEL PIPELINE                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRAINING (Forward Process):                                             │
│  ───────────────────────────                                             │
│  Clean forecast → Add noise (T steps) → Pure noise                       │
│  x₀ ────────────────────────────────────────────────→ xₜ                │
│                                                                          │
│  INFERENCE (Reverse Process):                                            │
│  ────────────────────────────                                            │
│  Pure noise → Denoise (T steps) → Clean forecast                         │
│  xₜ ────────────────────────────────────────────────→ x₀                │
│       ↑                                                                  │
│       │ Conditioned on historical observations                           │
│       │ (via attention or RNN hidden state)                              │
│                                                                          │
│  KEY BENEFIT: Probabilistic forecasts with uncertainty                   │
│  Sample multiple trajectories → get prediction intervals                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Novel Architectures

Kolmogorov-Arnold Networks (KAN)

KAN, proposed by MIT in 2024, is a revolutionary architecture based on the Kolmogorov-Arnold representation theorem. Instead of fixed activation functions, KAN uses spline-parametrized learnable activations.

For time series:

  • T-KAN: Temporal KAN for univariate forecasting
  • MT-KAN: Multivariate Time Series KAN
  • TKAT: Temporal Kolmogorov-Arnold Transformer (combines KAN with attention)

Performance: KAN-based models achieve state-of-the-art with up to 98% lower MSE than transformers on some benchmarks, while offering unparalleled interpretability.

Key advantages:

  • Interpretability: Learnable activations reveal feature relationships
  • Efficiency: Fewer parameters for equivalent performance
  • Adaptivity: Dynamic learning of activation patterns

Limitations: ~10x slower training than MLPs (diverse activations can't fully leverage batching).

Neural ODEs and Neural CDEs

Neural ODEs model continuous-time dynamics:

  • Learn differential equation governing latent state
  • Natural handling of irregular time steps
  • Continuous trajectory between observations

Neural CDEs (Controlled Differential Equations) extend for time series:

  • Continuous analogue of RNNs
  • Handle irregularly sampled and partially observed data
  • State-of-the-art on irregular time series

Recent extensions:

  • ANCDE: Attentive Neural CDE with attention for dynamic path construction
  • DualDynamics (2025): Combines explicit (Neural ODE) and implicit updates
  • STG-NCDE: Extends to multivariate with graph convolution

Graph Neural Networks for Spatial-Temporal Forecasting

When time series have underlying graph structure (traffic networks, sensor grids), Graph Neural Networks capture spatial dependencies:

Key architectures:

  • STGCN: Graph convolutions + temporal convolutions
  • DCRNN: Diffusion convolutions + GRU
  • Graph WaveNet: Adaptive graph learning + dilated causal convolutions
  • MTGNN: Multivariate Time Series GNN
  • STG-NCDE: Graph + Neural CDE for irregular data

Frequency Domain Methods

FourierGNN: Rethinks forecasting from pure graph perspective:

  • Hypervariate graph: Each series value (any variate, any timestamp) is a node
  • Fourier Graph Operator: Matrix multiplication in Fourier space
  • Much lower complexity with adequate expressiveness

FreTS: Frequency-domain MLPs:

  • Transform time signal to frequency domain using DFT
  • Apply MLP in frequency domain
  • Capture global patterns more effectively

Recent improvements:

  • FreDN (2025): Learnable frequency decomposition, 10-14% better than FreTS
  • FBM: Fourier Basis Mapping addressing time-dependency issues

Self-Supervised Learning for Time Series

Self-supervised pretraining learns representations without labels, then transfers to downstream tasks.

TS2Vec: Universal Representations

TS2Vec learns representations through hierarchical contrastive learning:

  • Creates augmented context views
  • Contrastive loss at multiple temporal scales
  • Timestamp-level representations (not just sequence-level)

Results: State-of-the-art on 125 UCR and 29 UEA datasets for classification, plus strong forecasting and anomaly detection.

TNC (Temporal Neighborhood Coding)

TNC exploits local temporal smoothness:

  • Positive pairs: Neighboring segments (stationary neighborhoods)
  • Negative pairs: Distant segments
  • Assumes local segments share generative process

TS2Vec-Ensemble (2025)

Standard contrastive learning prioritizes instance discrimination over capturing deterministic patterns (seasonality, trend). TS2Vec-Ensemble addresses this:

  • Fuses learned dynamics from TS2Vec encoder
  • With explicit engineered time features encoding periodic cycles
  • Significantly outperforms TS2Vec and state-of-the-art on ETT benchmarks

Other Approaches

  • TF-C: Time-frequency consistency learning
  • TS-TCC/CA-TCC: Temporal-contextual contrasting
  • Series2Vec (2024): Predicts similarity in temporal AND spectral domains
  • SoftCLT (ICLR 2024): Soft contrastive learning with augmentation strategies

Foundation Models for Time Series

The Foundation Model Paradigm

Train once on massive diverse data, apply anywhere:

  1. Pretrain on billions of time points across domains
  2. Zero-shot: Feed any series, get forecasts without training
  3. Fine-tune (optional) for improved domain accuracy

Amazon Chronos Family

Chronos (March 2024): Treats time series as language

  1. Scale and quantize values to 4,096 discrete tokens
  2. Train T5 encoder-decoder with cross-entropy loss
  3. Probabilistic forecasting via sampling multiple trajectories

Chronos-Bolt (November 2024): Massive efficiency gains

  • 250x faster, 20x less memory, 5% lower error
  • Patch-based: chunks of observations instead of individual points
  • Direct multi-step forecasting (no autoregressive bottleneck)

Chronos-2 (October 2025): Universal forecasting

  • 120M parameters, encoder-only
  • Univariate + Multivariate + Covariates in single architecture
  • Time and Group Attention: Alternates between temporal and cross-series attention

Google TimesFM

Decoder-only approach inspired by GPT:

  • Patching: 32 time points per token
  • Pretrained on 100B+ real-world points

Versions:

  • TimesFM 1.0 (200M): Up to 512 context
  • TimesFM 2.0 (500M): 2048 context, 25% better
  • TimesFM 2.5: Architecture optimizations

2025: Integrated into BigQuery/AlloyDB; training corpus expanded to 400B+ points; few-shot learning capabilities.

Salesforce Moirai

Universal forecaster handling any domain, variable lengths, and exogenous features.

Moirai-MoE (October 2024): First mixture-of-experts time series foundation model

  • 17% better than dense Moirai
  • Comparable accuracy with 65x fewer activated parameters

Moirai 2.0 (November 2025): Decoder-only

  • Quantile forecasting + multi-token prediction
  • 2x faster, 30x smaller than Moirai 1.0-Large

TiRex (NeurIPS 2025)

xLSTM-based foundation model:

  • Only 35M parameters
  • State-of-the-art on GiftEval, Chronos-ZS benchmarks
  • Outperforms Chronos-Bolt (200M), TimesFM (500M), Toto (151M)
  • State-tracking via sLSTM critical for long-horizon forecasting

Datadog Toto

Time Series Optimized Transformer for Observability:

  • Trained on ~1 trillion data points (largest dataset among published models)
  • 750 billion anonymous numerical metrics from Datadog platform
  • Optimized for observability (monitoring, alerting) use cases

Timer-XL (ICLR 2025)

Causal transformer for unified forecasting:

  • Generalizes next-token prediction to multivariate next-patch prediction
  • Handles univariate, multivariate, and covariate-informed forecasting
  • Patch-level generation based on long-context sequences

TimeGPT (Nixtla)

Production-ready foundation model:

  • Trained on 100B+ data points
  • Zero-shot forecasting AND anomaly detection
  • Does NOT use patching (unlike most others)
  • Available via API and SDK

Lag-Llama (ServiceNow)

Probabilistic univariate foundation model:

  • Decoder-only transformer with lagged features (t-1, t-7, t-30, etc.)
  • Outputs Student's t distribution parameters
  • Strong zero-shot generalization

Sundial (ICML 2025 Oral - Top 1%)

Generative foundation model from Tsinghua University—current benchmark leader:

  • Pre-trained on TimeBench (10^12 time points—largest pretraining corpus)
  • TimeFlow Loss: Flow-matching for native continuous-valued training (no tokenization)
  • #1 on GIFT-Eval (MASE) and Time-Series-Library (MSE/MAE)
  • Zero-shot predictions in milliseconds
  • Generates multiple probable predictions without specifying prior distributions
  • Mitigates mode collapse via TimeFlow Loss

Time-MoE (ICLR 2025 Spotlight)

Billion-scale MoE foundation model—the largest open time series model:

  • Scales up to 2.4 billion parameters
  • Time-300B: Largest open time series dataset (300B+ points across 9 domains)
  • Sparse MoE activates only subset of experts per prediction
  • Context length up to 4096, arbitrary prediction horizons
  • Validates scaling laws for time series (more data + bigger model = better)

TabPFN-TS (January 2025)

Tabular foundation model adapted for time series—surprisingly effective:

  • Only 11M parameters but top rank on GIFT-Eval
  • Recasts forecasting as tabular regression problem
  • Outperforms Chronos-Mini (20M) by 7.7%
  • Outperforms Chronos-Large (710M) by 3.0% in zero-shot
  • No time series pretraining needed—uses only tabular/synthetic data
  • Supports both point and probabilistic forecasting

Time-LLM (ICLR 2024)

LLM reprogramming framework—1000+ citations in 2 years:

  • Repurposes frozen LLMs (Llama-7B, GPT-2, BERT) for forecasting
  • Prompt-as-Prefix (PaP): Enriches input with domain knowledge and task instructions
  • Reprogram time series into text prototypes the LLM understands
  • Excels in few-shot and zero-shot scenarios
  • Adopted by industry for solar, wind, and weather forecasting

LLM-Based Forecasting (GPT-4, Claude)

General-purpose LLMs can forecast time series via prompting:

  • LLMTIME: Encodes series as digit strings, adapts discrete distributions
  • Surprising finding: GPT-4 performs worse than GPT-3 for forecasting (RLHF may hurt calibration)
  • Specialized foundation models still outperform general LLMs
  • Best use: Combine LLM reasoning with specialized forecasters

Benchmarks: GIFT-Eval

GIFT-Eval (General Time Series Forecasting Model Evaluation) is the new standard benchmark:

  • 28 datasets, 144,000+ time series, 177 million data points
  • 7 domains, 10 frequencies, short to long-term horizons
  • Non-leaking pretraining data: 230 billion points for fair evaluation
  • Evaluates univariate, multivariate, and zero-shot capabilities

Current Leaderboard (2025):

  1. Sundial (128M) - #1 MASE on GIFT-Eval, #1 MSE/MAE on TSLib (ICML 2025 Oral)
  2. TabPFN-TS (11M) - Top rank on point forecasting despite tiny size
  3. TiRex (35M) - State-of-the-art xLSTM, best zero-shot
  4. TimeCopilot - Ensemble of Chronos-2 + TimesFM-2.5 + TiRex
  5. Chronos-2 - Best full-featured single model

Key Finding: Short-term → foundation models win. Long-term → fine-tuned models (PatchTST, iTransformer) catch up.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│              FOUNDATION MODEL COMPARISON (2025)                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Model         Arch        Params   Multi  Covar  Prob   Best For       │
│  ────────────────────────────────────────────────────────────────────── │
│  Sundial       Flow-Match  128M     ✓      ✗      ✓     GIFT-Eval #1    │
│  TabPFN-TS     PFN         11M      ✗      ✗      ✓     Point forecast  │
│  TiRex         xLSTM       35M      ✗      ✗      ✓     Zero-shot best  │
│  Chronos-2     Enc-only    120M     ✓      ✓      ✓     Full-featured   │
│  Chronos-Bolt  Enc-Dec     varies   ✗      ✗      ✓     Fast inference  │
│  Time-MoE      MoE         2.4B     ✓      ✗      ✓     Scale leader    │
│  TimesFM-2.5   Dec-only    500M     ✗      ✗      ✗     Google Cloud    │
│  Moirai-2.0    Dec-only    varies   ✓      ✓      ✓     Efficiency      │
│  Moirai-MoE    MoE         varies   ✓      ✓      ✓     Sparse compute  │
│  Toto          Transformer 151M     varies varies ✓     Observability   │
│  Timer-XL      Causal      varies   ✓      ✓      ✓     Unified tasks   │
│  Time-LLM      LLM-reprg   7B+      ✓      ✓      ✓     LLM leverage    │
│  TimeGPT       varies      varies   ✓      ✓      ✓     Production API  │
│  Lag-Llama     Dec-only    varies   ✗      ✗      ✓     Probabilistic   │
│                                                                          │
│  GIFT-EVAL LEADERBOARD (2025):                                          │
│  1. Sundial (128M) - Flow-matching, #1 MASE (ICML 2025 Oral)           │
│  2. TabPFN-TS (11M) - Tabular model beats specialized models            │
│  3. TiRex (35M) - xLSTM with state-tracking                             │
│  4. TimeCopilot - Ensemble (Chronos-2 + TimesFM-2.5 + TiRex)           │
│  5. Chronos-2 - Best single full-featured model                         │
│                                                                          │
│  KEY TRENDS 2025:                                                        │
│  • Smaller models with better architectures beating giants               │
│  • MoE for efficient scaling (Time-MoE, Moirai-MoE)                     │
│  • Tabular models surprisingly competitive (TabPFN-TS)                  │
│  • Ensembles of top models achieve best results                         │
│  • Scaling laws validated for time series                               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

How Tokenization Works

The key challenge: language is discrete (words), time series are continuous (numbers).

Quantization (Chronos)

  1. Scale series to mean 0, std 1
  2. Bin values into 4,096 buckets
  3. Each bucket = one vocabulary token
  4. Train to predict tokens

Patching (TimesFM, PatchTST)

  1. Chunk into non-overlapping patches (e.g., 32 points)
  2. Embed each patch via learned projection
  3. Feed patch embeddings to transformer

Benefits: Dramatic sequence reduction, local pattern capture.

Wavelet-Based (WaveToken)

  1. Wavelet decomposition: Separate coarse (trend) and fine (detail)
  2. Quantize wavelet coefficients
  3. Predict future coefficients, reconstruct

Multi-Resolution

Use patches of varying sizes simultaneously (8, 32, 128 steps) to capture patterns at different temporal scales.


Training Deep Learning Models for Time Series

Training time series forecasting models requires careful attention to loss functions, data preparation, validation strategies, and hyperparameter tuning. This section covers the complete training pipeline.

Loss Functions

The choice of loss function fundamentally shapes what your model learns. Different losses suit different forecasting objectives.

Point Forecast Losses

Mean Squared Error (MSE): MSE=1ni=1n(yiy^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

  • Penalizes large errors heavily (squared term)
  • Sensitive to outliers
  • Use when large errors are particularly costly

Mean Absolute Error (MAE): MAE=1ni=1nyiy^i\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|

  • More robust to outliers than MSE
  • All errors penalized equally
  • Better when outliers are noise, not signal

Mean Absolute Percentage Error (MAPE): MAPE=100ni=1nyiy^iyi\text{MAPE} = \frac{100}{n} \sum_{i=1}^{n} \frac{|y_i - \hat{y}_i|}{|y_i|}

  • Scale-independent (useful for comparing across series)
  • Problem: Undefined when y=0, asymmetric

Symmetric MAPE (sMAPE): sMAPE=100ni=1nyiy^i(yi+y^i)/2\text{sMAPE} = \frac{100}{n} \sum_{i=1}^{n} \frac{|y_i - \hat{y}_i|}{(|y_i| + |\hat{y}_i|)/2}

  • Addresses MAPE's asymmetry
  • Bounded between 0% and 200%

Probabilistic Forecast Losses

Quantile Loss (Pinball Loss): Lq(y,y^)=q(yy^)++(1q)(y^y)+L_q(y, \hat{y}) = q \cdot (y - \hat{y})^+ + (1-q) \cdot (\hat{y} - y)^+

  • Trains model to predict specific quantiles (e.g., 10th, 50th, 90th percentile)
  • Asymmetric: Higher quantiles penalize underestimates more
  • Sum quantile losses across multiple quantiles for full distribution
Python
def quantile_loss(y_true, y_pred, quantile):
    error = y_true - y_pred
    return torch.max(quantile * error, (quantile - 1) * error).mean()

# Train for multiple quantiles
quantiles = [0.1, 0.5, 0.9]
total_loss = sum(quantile_loss(y, pred[q], q) for q in quantiles)

Continuous Ranked Probability Score (CRPS):

Code
CRPS = ∫ (F(x) - 1{x ≥ y})² dx
  • Evaluates the entire predicted distribution against the observation
  • Generalizes MAE: when prediction is a point (delta function), CRPS = MAE
  • Key advantage: Avoids quantile crossing (where predicted quantiles intersect)
  • Used by Chronos, DeepAR, and other probabilistic models
Python
def crps_gaussian(y_true, mu, sigma):
    """CRPS for Gaussian predictive distribution"""
    z = (y_true - mu) / sigma
    crps = sigma * (z * (2 * norm.cdf(z) - 1) +
                    2 * norm.pdf(z) - 1 / np.sqrt(np.pi))
    return crps.mean()

Negative Log-Likelihood (NLL):

  • For parametric distributions (Gaussian, Student's t, Negative Binomial)
  • Model predicts distribution parameters (μ, σ for Gaussian)
  • Loss is negative log probability of observation under predicted distribution
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    LOSS FUNCTION SELECTION GUIDE                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  OBJECTIVE                      RECOMMENDED LOSS                         │
│  ────────────────────────────────────────────────────────────────────── │
│  Point forecast, outliers OK    MSE                                      │
│  Point forecast, robust         MAE                                      │
│  Scale-independent comparison   sMAPE, MASE                              │
│  Prediction intervals           Quantile Loss (multiple quantiles)       │
│  Full distribution              CRPS or NLL                              │
│  Business asymmetric cost       Custom asymmetric loss                   │
│                                                                          │
│  FOUNDATION MODEL TRAINING:                                              │
│  Chronos: Cross-entropy on quantized tokens                              │
│  TimesFM: MSE on patches                                                 │
│  Moirai: Mixture distribution NLL                                        │
│  TiRex: Quantile loss                                                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Data Preparation and Preprocessing

Normalization Strategies

Per-series normalization (most common):

Python
def normalize_series(series):
    mean = series.mean()
    std = series.std() + 1e-8  # Avoid division by zero
    normalized = (series - mean) / std
    return normalized, mean, std

def denormalize(normalized, mean, std):
    return normalized * std + mean

Global normalization (for related series):

  • Compute statistics across all series in dataset
  • Use when series share similar scales
  • Common in foundation model pretraining

Log transformation (for multiplicative patterns):

Python
# For data with multiplicative seasonality/trends
log_series = np.log1p(series)  # log(1+x) handles zeros
# Train on log_series, then expm1() to reverse

Handling Missing Values

Forward fill: Use last known value Interpolation: Linear or spline interpolation Masking: Mark missing values, let model handle (Chronos-2 supports this) Imputation models: Use CSDI or other imputation methods first

Handling Irregular Timestamps

For irregular time series:

  • Resample to regular frequency (information loss)
  • Neural CDEs: Designed for irregular data
  • Time-aware models: Encode time gaps as features

Data Augmentation for Time Series

Data augmentation is critical for training generalizable models, especially foundation models.

TSMix (Time Series Mixup)

TSMix creates synthetic series by combining real series:

Python
def tsmix(series_a, series_b, alpha=0.5):
    """Mix two time series with interpolation"""
    # Random mixing coefficient
    lam = np.random.beta(alpha, alpha)
    mixed = lam * series_a + (1 - lam) * series_b
    return mixed

Used in Chronos training: 10M TSMix augmentations from 28 datasets.

Key insight: TSMix improves zero-shot performance but not in-domain performance.

KernelSynth (Synthetic from Gaussian Processes)

KernelSynth generates unlimited synthetic series using Gaussian Processes:

Python
from sklearn.gaussian_process.kernels import RBF, Periodic, WhiteKernel

def kernel_synth(length=512, n_samples=1000):
    """Generate synthetic time series from GP kernels"""
    synthetic_series = []

    for _ in range(n_samples):
        # Randomly compose kernels
        kernel = (
            RBF(length_scale=np.random.uniform(10, 100)) +
            Periodic(length_scale=np.random.uniform(5, 50),
                    periodicity=np.random.uniform(10, 100)) +
            WhiteKernel(noise_level=np.random.uniform(0.01, 0.1))
        )

        # Sample from GP
        X = np.arange(length).reshape(-1, 1)
        K = kernel(X)
        series = np.random.multivariate_normal(np.zeros(length), K)
        synthetic_series.append(series)

    return synthetic_series

Optimal ratio: ~10% synthetic data. More synthetic data tends to worsen performance.

Other Augmentation Techniques

Jittering: Add small random noise

Python
augmented = series + np.random.normal(0, 0.01, len(series))

Scaling: Random amplitude scaling

Python
augmented = series * np.random.uniform(0.8, 1.2)

Time warping: Stretch/compress time axis locally Window slicing: Random crops from longer series Magnitude warping: Smooth amplitude variations

Validation Strategies for Time Series

Critical: Standard k-fold cross-validation violates temporal ordering and causes data leakage. Use time-respecting validation.

Train-Test Split

Simple temporal split—train on past, test on future:

Python
def temporal_split(series, train_ratio=0.8):
    split_idx = int(len(series) * train_ratio)
    train = series[:split_idx]
    test = series[split_idx:]
    return train, test

Problem: Single split may not be representative.

Walk-Forward Validation (Rolling Origin)

The gold standard for time series. Train on expanding/rolling window, test on next period:

Python
from sklearn.model_selection import TimeSeriesSplit

def walk_forward_validation(series, model, n_splits=5):
    tscv = TimeSeriesSplit(n_splits=n_splits)
    scores = []

    for train_idx, test_idx in tscv.split(series):
        train, test = series[train_idx], series[test_idx]

        # Train model
        model.fit(train)

        # Predict and evaluate
        predictions = model.predict(len(test))
        score = evaluate(test, predictions)
        scores.append(score)

    return np.mean(scores), np.std(scores)
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    WALK-FORWARD VALIDATION                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Time →                                                                  │
│  ────────────────────────────────────────────────────────────────────── │
│                                                                          │
│  Fold 1: [TRAIN════════] [TEST]                                         │
│  Fold 2: [TRAIN════════════════] [TEST]                                 │
│  Fold 3: [TRAIN════════════════════════] [TEST]                         │
│  Fold 4: [TRAIN════════════════════════════════] [TEST]                 │
│  Fold 5: [TRAIN════════════════════════════════════════] [TEST]         │
│                                                                          │
│  Each fold: Train on all prior data, test on next period                │
│  Never use future data to predict past                                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Adding a Gap (Embargo Period)

Prevent information leakage from recent observations:

Python
def walk_forward_with_gap(series, gap=7):
    """Gap between train end and test start"""
    # If predicting weekly, gap=7 prevents day-of-week leakage
    ...

Expanding vs Rolling Window

Expanding window: Train set grows with each fold (more data) Rolling window: Fixed-size train set slides forward (handles non-stationarity)

Python
# Rolling window
tscv = TimeSeriesSplit(n_splits=5, max_train_size=365*3)  # Max 3 years

Hyperparameter Tuning

Key hyperparameters for time series deep learning models:

Lookback Window (Context Length)

How much history the model sees:

  • Too short: Misses seasonal patterns
  • Too long: Noise, computational cost, vanishing gradients

Rule of thumb: At least 2-3x the longest seasonality period.

Python
# If data has yearly seasonality (365 days)
lookback = 365 * 2  # ~2 years of history

# For foundation models
# Chronos: Up to 4096 tokens
# TimesFM: Up to 2048 points
# TiRex: Variable, designed for long context

Forecast Horizon

How far ahead to predict:

  • Longer horizons → harder, more uncertainty
  • Match to business needs

Learning Rate

Most important hyperparameter:

  • Too high: Divergence, oscillation
  • Too low: Slow convergence, stuck in local minima

Recommendations:

  • Start with 1e-3 to 1e-4 for Adam
  • Use learning rate schedulers (cosine, reduce-on-plateau)
  • Linear scaling rule: When batch size × k, learning rate × k
Python
# Learning rate scheduling
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=num_epochs, eta_min=1e-6
)

# Or reduce on plateau
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.5, patience=10
)

Batch Size

  • Smaller (16-64): More noise, better generalization, slower
  • Larger (128-512): Stable gradients, faster, may overfit

For time series: Research suggests optimal batch size may correlate with data periodicity (use FFT to find dominant frequencies).

Python
# Typical starting points
batch_size = 32  # Good default
# Increase if training is stable, decrease if loss is noisy

Model-Specific Hyperparameters

LSTM/GRU:

  • Hidden size: 64-512
  • Number of layers: 1-3
  • Dropout: 0.1-0.3

Transformer:

  • d_model: 64-512
  • n_heads: 4-8
  • n_layers: 2-6
  • Patch size: 8-64

N-BEATS:

  • Stack types: [trend, seasonality] or [generic]
  • Blocks per stack: 3-5
  • Layer width: 256-512

Fine-Tuning Foundation Models

Foundation models work zero-shot, but fine-tuning can significantly improve domain-specific performance.

When to Fine-Tune

Fine-tune when:

  • Your domain has unique patterns not in pretraining data
  • You have sufficient domain data (1000+ series or 100K+ points)
  • Zero-shot performance is good but not good enough
  • Computational resources available

Don't fine-tune when:

  • Very limited data (risk of overfitting)
  • Zero-shot already meets requirements
  • Need rapid deployment

Fine-Tuning Strategies

Full Fine-Tuning

Update all model weights:

Python
# Load pretrained model
model = ChronosModel.from_pretrained("amazon/chronos-t5-base")

# Unfreeze all parameters
for param in model.parameters():
    param.requires_grad = True

# Train with lower learning rate
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)  # 10x lower than pretraining

Pros: Maximum adaptation Cons: Needs more data, risk of catastrophic forgetting

Adapter-Based Fine-Tuning (PEFT)

Add small trainable modules, freeze pretrained weights:

Python
from peft import get_peft_model, LoraConfig

# Add LoRA adapters
lora_config = LoraConfig(
    r=8,  # Rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
)
model = get_peft_model(base_model, lora_config)

# Only adapter parameters are trainable
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
# Typically <1% of total parameters

Pros: Efficient, less overfitting, can store multiple adapters Cons: Less expressive than full fine-tuning

ChronosX uses adapters to add covariate support to Chronos, achieving ~22% improvement.

Last-Layer Fine-Tuning

Only train the output projection:

Python
# Freeze all but last layer
for param in model.parameters():
    param.requires_grad = False
for param in model.output_projection.parameters():
    param.requires_grad = True

Pros: Fast, minimal overfitting risk Cons: Limited adaptation

Fine-Tuning Code Examples

Fine-Tuning Chronos

Python
import torch
from chronos import ChronosConfig, ChronosModel
from torch.utils.data import DataLoader

# Load pretrained
config = ChronosConfig.from_pretrained("amazon/chronos-t5-small")
model = ChronosModel.from_pretrained("amazon/chronos-t5-small")

# Prepare your data
train_dataset = YourTimeSeriesDataset(your_data)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Fine-tuning setup
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10)

model.train()
for epoch in range(10):
    for batch in train_loader:
        context, target = batch

        # Tokenize (Chronos-specific)
        tokens = model.tokenizer.encode(context)
        target_tokens = model.tokenizer.encode(target)

        # Forward pass
        loss = model(tokens, labels=target_tokens).loss

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()

    scheduler.step()

# Save fine-tuned model
model.save_pretrained("./chronos-finetuned")

Fine-Tuning with NeuralForecast

Python
from neuralforecast import NeuralForecast
from neuralforecast.models import NBEATS, NHITS, TFT

# Define models with hyperparameters
models = [
    NBEATS(
        input_size=7*24,  # 1 week of hourly data
        h=24,  # Predict 24 hours ahead
        max_steps=1000,
        learning_rate=1e-3,
        batch_size=32,
    ),
    NHITS(
        input_size=7*24,
        h=24,
        max_steps=1000,
        n_freq_downsample=[24, 12, 1],  # Multi-resolution
    ),
    TFT(
        input_size=7*24,
        h=24,
        hidden_size=128,
        max_steps=1000,
    )
]

# Train
nf = NeuralForecast(models=models, freq='H')
nf.fit(df=train_df)

# Predict
forecasts = nf.predict()

Training Infrastructure

Memory Optimization

Python
# Gradient checkpointing (trade compute for memory)
model.gradient_checkpointing_enable()

# Mixed precision training
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
with autocast():
    loss = model(inputs).loss
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Distributed Training

Python
# PyTorch DDP
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel

dist.init_process_group(backend='nccl')
model = DistributedDataParallel(model)

# Or use Hugging Face Accelerate
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_loader = accelerator.prepare(
    model, optimizer, train_loader
)

Evaluation Metrics

Point Forecast Metrics

Python
def mase(y_true, y_pred, y_train, seasonality=1):
    """Mean Absolute Scaled Error - compares to naive baseline"""
    naive_mae = np.mean(np.abs(y_train[seasonality:] - y_train[:-seasonality]))
    forecast_mae = np.mean(np.abs(y_true - y_pred))
    return forecast_mae / naive_mae

def smape(y_true, y_pred):
    """Symmetric Mean Absolute Percentage Error"""
    return 100 * np.mean(2 * np.abs(y_true - y_pred) /
                         (np.abs(y_true) + np.abs(y_pred) + 1e-8))

Probabilistic Metrics

Python
def coverage(y_true, lower, upper):
    """What fraction of observations fall within prediction interval"""
    return np.mean((y_true >= lower) & (y_true <= upper))

def interval_width(lower, upper):
    """Average width of prediction intervals"""
    return np.mean(upper - lower)

# Good model: High coverage + narrow intervals
Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    COMPLETE TRAINING PIPELINE                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. DATA PREPARATION                                                     │
│     ├─ Handle missing values                                             │
│     ├─ Normalize (per-series or global)                                  │
│     ├─ Create train/val/test splits (temporal)                          │
│     └─ Apply augmentation (TSMix, KernelSynth)                          │
│                                                                          │
│  2. MODEL SELECTION                                                      │
│     ├─ Baseline: ARIMA/ETS for comparison                               │
│     ├─ Choose architecture based on data/requirements                    │
│     └─ Foundation model for zero-shot baseline                          │
│                                                                          │
│  3. TRAINING                                                             │
│     ├─ Set loss function (MSE/MAE/Quantile/CRPS)                        │
│     ├─ Configure optimizer (AdamW, lr=1e-3 to 1e-4)                     │
│     ├─ Add learning rate scheduler                                       │
│     ├─ Implement early stopping on validation loss                       │
│     └─ Use gradient clipping (max_norm=1.0)                             │
│                                                                          │
│  4. VALIDATION                                                           │
│     ├─ Walk-forward validation (5+ folds)                               │
│     ├─ Track point metrics (MAE, MASE, sMAPE)                           │
│     └─ Track probabilistic metrics (CRPS, coverage)                     │
│                                                                          │
│  5. FINE-TUNING (if using foundation model)                             │
│     ├─ Start with zero-shot baseline                                    │
│     ├─ Try adapter-based fine-tuning first                              │
│     ├─ Use lower learning rate (10x lower)                              │
│     └─ Monitor for overfitting                                          │
│                                                                          │
│  6. PRODUCTION                                                           │
│     ├─ Save model + normalization params                                │
│     ├─ Set up monitoring (accuracy drift)                               │
│     └─ Plan retraining schedule                                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Model Selection Guide

When Classical Methods Win

Use ARIMA/ETS when:

  • Series < 1000 points with clear autocorrelation
  • Interpretability > marginal accuracy gains
  • No deep learning infrastructure

Use Prophet when:

  • Business forecasting with seasonalities/holidays
  • Non-technical stakeholders need understanding
  • Speed of development > accuracy

When Deep Learning Excels

Use N-BEATS/N-HiTS when:

  • Interpretability needed but classical methods insufficient
  • Long-horizon forecasting
  • No time-series-specific feature engineering desired

Use LSTM/Transformer when:

  • Complex patterns, substantial data
  • Multivariate with interactions
  • Long sequences (use Informer/Autoformer for efficiency)

When Foundation Models Shine

Use foundation models when:

  • Zero-shot acceptable: No training time
  • Limited data: Cold-start problems
  • Diverse domains: One model across everything
  • Rapid prototyping: Test feasibility first

Model recommendations:

  • Overall best: TiRex (35M, state-of-the-art)
  • Full-featured: Chronos-2 (covariates, multivariate)
  • Fast inference: Chronos-Bolt (250x faster)
  • Efficiency: Moirai-2.0 (2x faster, 30x smaller)
  • Google Cloud: TimesFM (BigQuery integration)
  • Probabilistic: Lag-Llama (uncertainty quantification)

Production Considerations

Data Pipeline

  • Missing value handling strategy
  • Normalization parameters versioned
  • Frequency validation
  • Outlier detection

Inference

  • Context window limits
  • GPU batching optimization
  • Fallback models
  • Caching for repeated series

Output

  • Prediction intervals (not just point forecasts)
  • Denormalization
  • Forecast metadata (model version, confidence)

Monitoring

  • Accuracy metrics over time
  • Data/concept drift detection
  • Degradation alerts
  • Regular backtesting

Practical Examples

Using Chronos

Python
import torch
from chronos import ChronosPipeline

pipeline = ChronosPipeline.from_pretrained(
    "amazon/chronos-t5-base",
    device_map="cuda",
    torch_dtype=torch.bfloat16,
)

context = torch.tensor([[1.2, 1.5, 1.3, 1.8, 2.1, 1.9, 2.3, 2.0]])

forecast = pipeline.predict(
    context,
    prediction_length=12,
    num_samples=20,
)

median = forecast.median(dim=1).values
lower = forecast.quantile(0.1, dim=1).values
upper = forecast.quantile(0.9, dim=1).values

Using TiRex

Python
from tirex import TiRexPipeline

pipeline = TiRexPipeline.from_pretrained("NX-AI/TiRex")

# TiRex provides both point and quantile predictions
forecast = pipeline.predict(
    context,
    prediction_length=24,
)

Using TimesFM

Python
import timesfm

tfm = timesfm.TimesFm(
    context_len=512,
    horizon_len=128,
    input_patch_len=32,
    output_patch_len=128,
)
tfm.load_from_checkpoint(repo_id="google/timesfm-1.0-200m")

forecast = tfm.forecast(context)

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles