How do foundation models handle different frequencies?

Most are frequency-agnostic—they learn patterns from raw sequences. Some (TimesFM) accept optional frequency hints. Models learn to recognize patterns at whatever scale appears in context.

Can I fine-tune foundation models?

Yes. Chronos, TimesFM, Moirai support fine-tuning. Start with pretrained weights, fine-tune with lower learning rate. Helps most when your domain has unique patterns not in pretraining data.

How do they compare to AutoML (AutoGluon)?

AutoML ensembles multiple models and selects best per-series. Often outperforms single models. Foundation models can be components in AutoML—AutoGluon includes Chronos. For max accuracy, consider AutoML with foundation models as candidates.

What's the maximum context length?

Chronos-2: Extended via patching, TimesFM-2.5: 2048 points, TiRex: Variable (designed for long-horizon).

Foundation models run on CPU but much faster on GPU. Chronos-Bolt optimized for CPU. For production throughput, GPU recommended.

Why is TiRex (35M) beating larger models?

**State-tracking**. Transformers use attention (no explicit state), Mamba uses selective state (loses some tracking). TiRex's xLSTM maintains explicit state critical for forecasting. Architecture matters more than size.

Are diffusion models better than transformers?

For **probabilistic forecasting** with uncertainty quantification, diffusion models excel. For point forecasts, transformers/foundation models often win. Best recent diffusion models (CCDM, SimDiff) are competitive on both. ---

Time Series Forecasting with Foundation Models: From ARIMA to Chronos | Enrico Piovano

The Revolution in Time Series Forecasting

Time series forecasting has undergone a transformation as dramatic as the one that reshaped natural language processing. For decades, statistical methods like ARIMA dominated. Then deep learning approaches—LSTMs, GRUs, and neural basis expansion—showed promise for capturing complex temporal patterns. Transformers brought attention mechanisms, state space models offered linear complexity alternatives, and diffusion models introduced probabilistic generation. Now, we're witnessing foundation models for time series: large pretrained models that forecast any time series with zero-shot learning.

This shift matters because time series forecasting is everywhere: demand planning, financial markets, energy load prediction, traffic optimization, healthcare monitoring, climate modeling. The new generation of models delivers better accuracy, often without any training on your data.

This guide covers the complete landscape: classical methods, deep learning architectures (N-BEATS, LSTMs), transformers (Informer, Autoformer, iTransformer, TimeXer), state space models (Mamba, xLSTM/TiRex), diffusion models (TimeGrad, CSDI), novel architectures (KAN, Neural ODEs, Graph NNs), self-supervised learning (TS2Vec), and foundation models (Chronos, TimesFM, Moirai, TiRex, Toto).

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              EVOLUTION OF TIME SERIES FORECASTING                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1970s-2010s: STATISTICAL METHODS                                        │
│  ARIMA, Exponential Smoothing, Prophet                                   │
│  ✓ Interpretable    ✓ Fast    ✗ Limited patterns                        │
│                                                                          │
│  2015-2019: EARLY DEEP LEARNING                                          │
│  LSTM, GRU, TCN, DeepAR                                                  │
│  ✓ Complex patterns    ✓ Multivariate    ✗ Needs lots of data           │
│                                                                          │
│  2019-2021: NEURAL BASIS EXPANSION                                       │
│  N-BEATS, N-HiTS, NBEATSx                                                │
│  ✓ Interpretable    ✓ No hand-crafted features    ✓ Fast                │
│                                                                          │
│  2020-2023: TRANSFORMERS                                                 │
│  Informer, Autoformer, TFT, PatchTST, iTransformer, TimeXer              │
│  ✓ Long-range dependencies    ✓ Attention    ✗ Quadratic complexity     │
│                                                                          │
│  2023-2024: STATE SPACE & DIFFUSION                                      │
│  Mamba, MambaTS, S-Mamba, TimeGrad, CSDI                                 │
│  ✓ Linear complexity    ✓ Probabilistic    ✓ Efficient                  │
│                                                                          │
│  2024-NOW: FOUNDATION MODELS                                             │
│  Chronos, TimesFM, Moirai, TiRex, Toto, Timer-XL                         │
│  ✓ Zero-shot    ✓ Universal    ✓ Pretrained on billions of points       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Classical Statistical Methods

Before diving into neural approaches, understanding classical methods is essential. They remain the right choice for many problems.

ARIMA: The Statistical Workhorse

ARIMA (AutoRegressive Integrated Moving Average) models time series as a combination of three components:

AR (AutoRegressive): Current value depends on previous values
I (Integrated): Differencing to make the series stationary
MA (Moving Average): Current value depends on previous forecast errors

Why ARIMA still matters: Excels at short-term forecasting with stationary data. Fast, interpretable, requires no training data beyond the series itself. For univariate forecasting with clear autocorrelation, ARIMA often matches neural networks—especially on small datasets.

Limitations: Non-linear relationships, multiple seasonalities, external variables, long-term dependencies.

Exponential Smoothing (ETS)

ETS decomposes time series into Error, Trend, and Seasonality. The family includes:

Simple Exponential Smoothing: No trend or seasonality
Holt's Linear: Adds trend
Holt-Winters: Adds seasonality (additive or multiplicative)

Prophet: Business Forecasting

Prophet (Meta) models time series as: y(t) = trend(t) + seasonality(t) + holidays(t) + error(t)

Strengths: Handles missing data and outliers, automatic changepoint detection, easy domain knowledge incorporation.

Limitations: Research shows Prophet underperforms ARIMA and neural networks on accuracy. Value lies in ease of use, not peak performance.

Deep Learning Architectures

Recurrent Networks: LSTM and GRU

Long Short-Term Memory (LSTM) networks solve vanilla RNNs' vanishing gradient problem with gates:

Forget gate: What to discard from memory
Input gate: What new information to store
Output gate: What to output

Gated Recurrent Units (GRUs) simplify to two gates (update, reset) with comparable performance.

Why LSTMs work: Learn long-range dependencies, handle variable-length sequences, incorporate multiple features.

Limitations: Sequential training (no parallelization), hyperparameter tuning difficulty, substantial data requirements.

Temporal Convolutional Networks (TCN)

TCNs apply 1D dilated convolutions across time:

Code

Dilation pattern:    1, 2, 4, 8, 16...
Receptive field grows exponentially while parameters grow linearly

TCN advantages: Parallelizable, flexible receptive field, faster training, stable gradients.

DeepAR: Probabilistic Forecasting at Scale

Amazon's DeepAR trains a single LSTM across thousands of related series:

Probabilistic outputs: Predicts distributions, not point estimates
Global model: Learns cross-series patterns
Autoregressive generation: Uses sampled predictions for subsequent steps

N-BEATS and N-HiTS: Neural Basis Expansion

N-BEATS Architecture

N-BEATS (Neural Basis Expansion Analysis for Time Series), developed by ElementAI/ServiceNow, revolutionized interpretable deep forecasting. The key insight: use neural basis expansion to decompose forecasts.

How it works:

Blocks process the input through fully-connected layers
Each block outputs expansion coefficients (θ) for both backward (backcast) and forward (forecast)
Coefficients are projected through basis functions to produce predictions
Doubly residual stacking: Blocks are organized into stacks, with residual connections both within and across stacks

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    N-BEATS ARCHITECTURE                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Input: [x₁, x₂, ..., xₜ] (lookback window)                             │
│              │                                                           │
│              ▼                                                           │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │  STACK 1: Trend                                                 │    │
│  │  ┌─────────┐   ┌─────────┐   ┌─────────┐                       │    │
│  │  │ Block 1 │ → │ Block 2 │ → │ Block 3 │                       │    │
│  │  └────┬────┘   └────┬────┘   └────┬────┘                       │    │
│  │       │             │             │                             │    │
│  │       ▼             ▼             ▼                             │    │
│  │  [θ_trend] → Polynomial Basis → Trend Forecast                 │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│              │ (residual)                                                │
│              ▼                                                           │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │  STACK 2: Seasonality                                           │    │
│  │  ┌─────────┐   ┌─────────┐   ┌─────────┐                       │    │
│  │  │ Block 1 │ → │ Block 2 │ → │ Block 3 │                       │    │
│  │  └────┬────┘   └────┬────┘   └────┬────┘                       │    │
│  │       │             │             │                             │    │
│  │       ▼             ▼             ▼                             │    │
│  │  [θ_season] → Fourier Basis → Seasonality Forecast             │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│              │                                                           │
│              ▼                                                           │
│  Final Forecast = Σ(Trend + Seasonality forecasts from all blocks)      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Two configurations:

Generic: No time-series-specific components—proves deep learning primitives alone solve forecasting
Interpretable: Uses trend (polynomial) and seasonality (Fourier) basis functions

Performance: Improved forecast accuracy by 11% over statistical benchmarks, 3% over M4 competition winner.

N-HiTS: Hierarchical Interpolation

N-HiTS extends N-BEATS for long-horizon forecasting with:

Hierarchical interpolation: Different blocks specialize in different frequency bands
Multi-rate data sampling: Blocks see the series at different temporal resolutions

Result: Outperforms Transformer-based models by 25%+ on benchmarks.

NBEATSx: Adding Exogenous Variables

NBEATSx extends N-BEATS to incorporate external features, improving accuracy by ~20% over vanilla N-BEATS.

Transformer Architectures for Time Series

Why Vanilla Transformers Struggle

Self-attention has O(n²) complexity—prohibitive for long sequences. Additional issues:

Permutation invariance: Transformers don't inherently understand order
Point-wise attention: May miss local patterns spanning multiple points

Informer: Efficient Long-Sequence Forecasting

Informer (2020) introduced ProbSparse self-attention:

Identifies important queries (those attending to many keys)
Only computes attention for those queries
Reduces complexity from O(n²) to O(n log n)

Additional innovations:

Self-attention distilling: Progressively halves sequence length between layers
Generative decoder: Predicts entire output in one pass, avoiding error accumulation

Autoformer: Decomposition Meets Attention

Autoformer (2021) exploits time series structure:

Series decomposition blocks: After each layer, decompose into trend and seasonal components using moving averages.

Auto-correlation attention: Instead of point-wise attention, compute attention based on series periodicity—find similar sub-series and aggregate them.

Results: 10-12% improvement over Informer on periodic data.

Temporal Fusion Transformer (TFT)

TFT (Google) prioritizes interpretability:

Variable selection networks: Learn which features matter for each prediction
Gating mechanisms: Control information flow
Multi-horizon attention: Different attention patterns for different forecast horizons
Quantile outputs: Prediction intervals, not just point forecasts

PatchTST: The Patching Breakthrough

PatchTST (2023) introduced patching—grouping time points into chunks:

Why patching works:

Local semantics: A patch captures local patterns as a unit
Sequence reduction: 512 points → 32 patches
Better representations: Encode local patterns into embeddings

iTransformer: Inverted Attention (ICLR 2024 Spotlight)

iTransformer made a simple but powerful change: swap the modeling axes.

Traditional transformers: Each time step is a token, attention captures temporal dependencies.

iTransformer: Each variate (channel) is embedded as a token. Attention models cross-variate correlations, while FFN learns temporal representations per variate.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    iTransformer INVERSION                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRADITIONAL TRANSFORMER (time as tokens):                               │
│  ──────────────────────────────────────────                              │
│  Tokens:    [t₁] [t₂] [t₃] ... [tₙ]                                     │
│              ↑    ↑    ↑        ↑                                        │
│  Each token = one timestamp, all variates                                │
│  Attention learns: temporal dependencies                                 │
│                                                                          │
│  ──────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  iTRANSFORMER (variates as tokens):                                     │
│  ──────────────────────────────────                                      │
│  Tokens:    [var₁] [var₂] [var₃] ... [varₘ]                             │
│               ↑      ↑      ↑          ↑                                 │
│  Each token = entire history of one variate                              │
│  Attention learns: cross-variate correlations                            │
│  FFN learns: temporal patterns per variate                               │
│                                                                          │
│  RESULT: Stronger accuracy, better scaling, no positional encoding       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Key insight: This simple axis swap yields stronger accuracy without new modules.

TimeXer: Exogenous Variables (NeurIPS 2024)

TimeXer handles both endogenous and exogenous features:

Patch-wise self-attention: For endogenous series
Variate-wise cross-attention: To incorporate exogenous information
Global endogenous tokens: Bridge causal information from exogenous series

State-of-the-art on twelve benchmarks for forecasting with external variables.

Other Notable Transformers

FEDformer: Frequency-enhanced decomposed transformer using Fourier/wavelet transforms
ETSformer: Exponential smoothing attention with level, growth, and seasonal components
Crossformer: Two-stage attention (cross-time, cross-dimension) with segment embedding

MLP-Based Models (Efficient Alternatives)

Recent research shows simple MLPs can match or beat transformers:

TSMixer: MLP mixing in time and channel domains directly—surprisingly competitive.

WPMixer (AAAI 2025): Wavelet Patch Mixer combines:

Multi-resolution wavelet decomposition (time + frequency domains)
Patching for extended lookback and local information
MLP mixing for global information
10x more efficient than TimeMixer, lower variance
Outperforms transformer-based models for long-term forecasting

DLinear: Simple linear layers beat many transformers—challenged the field's assumptions.

State Space Models: Linear Complexity Alternatives

Mamba for Time Series

Mamba, based on selective state space models (SSMs), offers transformer-competitive performance with linear complexity. For time series, this enables efficient processing of very long sequences.

MambaTS: Temporal Mamba Blocks

MambaTS addresses Mamba's limitations for long-term forecasting:

Variable scan along time: Arranges historical information of all variables together
Temporal Mamba Block (TMB): Specialized architecture for time series

Result: State-of-the-art on eight public datasets.

S-Mamba (Simple-Mamba)

S-Mamba takes a simpler approach:

Tokenize time points per variate via linear layer
Bidirectional Mamba layer extracts inter-variate correlations
FFN learns temporal dependencies

Low computational overhead with leading performance on thirteen datasets.

Mamba4Cast: Zero-Shot with Mamba

Mamba4Cast is a zero-shot foundation model using Mamba:

Inspired by Prior-data Fitted Networks (PFNs)
Trained solely on synthetic data
Generates forecasts for entire horizons in one pass
Much lower inference time than transformers

SOR-Mamba: Order-Robust

Channels in time series have no specific order, introducing sequential bias. SOR-Mamba uses regularization to minimize discrepancy between embeddings from reversed channel orders.

TiRex: xLSTM Foundation Model (NeurIPS 2025)

TiRex demonstrates that LSTMs are back. Based on xLSTM (extended LSTM), TiRex is a 35M parameter foundation model that:

Sets state-of-the-art on GiftEval and Chronos-ZS benchmarks
Outperforms much larger models (Chronos-Bolt 200M, TimesFM 500M, Toto 151M)
Provides both point and quantile predictions

Key innovation: State-tracking capability critical for long-horizon forecasting. Unlike transformers or Mamba, TiRex retains explicit state tracking via sLSTM modules.

CPM (Causal Prediction Masking): Training-time masking strategy that enhances state-tracking ability.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              TiRex vs TRANSFORMER vs MAMBA                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Architecture      State Tracking    Complexity    Zero-Shot Rank       │
│  ────────────────────────────────────────────────────────────────────── │
│  Transformer       ✗ (attention)     O(n²)         Good                 │
│  Mamba             ✗ (selective)     O(n)          Good                 │
│  TiRex (xLSTM)     ✓ (explicit)      O(n)          BEST (NeurIPS 2025)  │
│                                                                          │
│  Key insight: State-tracking matters for forecasting                     │
│  TiRex's sLSTM maintains explicit state across sequence                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Diffusion Models for Time Series

Diffusion models, successful in image generation, have been adapted for probabilistic time series forecasting.

TimeGrad: Autoregressive Diffusion

TimeGrad (first diffusion model for time series) uses:

Conditional diffusion: Denoising guided by RNN hidden state
Autoregressive generation: Process sequence with recurrent cell, maintain hidden state

Output: Highly diverse probabilistic forecasts with uncertainty quantification.

CSDI: Non-Autoregressive Diffusion

CSDI (Conditional Score-based Diffusion for Imputation) uses:

Self-supervised strategy: Input masking for training
Stacked attention: Temporal and feature-wise attention for conditioning
Non-autoregressive: Faster predictions than TimeGrad

Diffusion Model Improvements (2024-2025)

Known limitations of TimeGrad/CSDI:

Optimize likelihood only—generate diverse but poorly aligned samples
Training instability
Boundary disharmony problems

Recent advances:

MG-TSD: Multi-scale generation—predict main components then details
mr-diff: Separately predict trend and seasonal components
CCDM, TimeDiff, S²DBM, SimDiff: Achieve best/second-best on benchmarks
LDM4TS: Translate time series to visual encodings, denoise in image-latent space
MCD-TSF: Multimodal (text, timestamp) cross-attention with classifier-free guidance

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DIFFUSION MODEL PIPELINE                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRAINING (Forward Process):                                             │
│  ───────────────────────────                                             │
│  Clean forecast → Add noise (T steps) → Pure noise                       │
│  x₀ ────────────────────────────────────────────────→ xₜ                │
│                                                                          │
│  INFERENCE (Reverse Process):                                            │
│  ────────────────────────────                                            │
│  Pure noise → Denoise (T steps) → Clean forecast                         │
│  xₜ ────────────────────────────────────────────────→ x₀                │
│       ↑                                                                  │
│       │ Conditioned on historical observations                           │
│       │ (via attention or RNN hidden state)                              │
│                                                                          │
│  KEY BENEFIT: Probabilistic forecasts with uncertainty                   │
│  Sample multiple trajectories → get prediction intervals                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Novel Architectures

Kolmogorov-Arnold Networks (KAN)

KAN, proposed by MIT in 2024, is a revolutionary architecture based on the Kolmogorov-Arnold representation theorem. Instead of fixed activation functions, KAN uses spline-parametrized learnable activations.

For time series:

T-KAN: Temporal KAN for univariate forecasting
MT-KAN: Multivariate Time Series KAN
TKAT: Temporal Kolmogorov-Arnold Transformer (combines KAN with attention)

Performance: KAN-based models achieve state-of-the-art with up to 98% lower MSE than transformers on some benchmarks, while offering unparalleled interpretability.

Key advantages:

Interpretability: Learnable activations reveal feature relationships
Efficiency: Fewer parameters for equivalent performance
Adaptivity: Dynamic learning of activation patterns

Limitations: ~10x slower training than MLPs (diverse activations can't fully leverage batching).

Neural ODEs and Neural CDEs

Neural ODEs model continuous-time dynamics:

Learn differential equation governing latent state
Natural handling of irregular time steps
Continuous trajectory between observations

Neural CDEs (Controlled Differential Equations) extend for time series:

Continuous analogue of RNNs
Handle irregularly sampled and partially observed data
State-of-the-art on irregular time series

Recent extensions:

ANCDE: Attentive Neural CDE with attention for dynamic path construction
DualDynamics (2025): Combines explicit (Neural ODE) and implicit updates
STG-NCDE: Extends to multivariate with graph convolution

Graph Neural Networks for Spatial-Temporal Forecasting

When time series have underlying graph structure (traffic networks, sensor grids), Graph Neural Networks capture spatial dependencies:

Key architectures:

STGCN: Graph convolutions + temporal convolutions
DCRNN: Diffusion convolutions + GRU
Graph WaveNet: Adaptive graph learning + dilated causal convolutions
MTGNN: Multivariate Time Series GNN
STG-NCDE: Graph + Neural CDE for irregular data

Frequency Domain Methods

FourierGNN: Rethinks forecasting from pure graph perspective:

Hypervariate graph: Each series value (any variate, any timestamp) is a node
Fourier Graph Operator: Matrix multiplication in Fourier space
Much lower complexity with adequate expressiveness

FreTS: Frequency-domain MLPs:

Transform time signal to frequency domain using DFT
Apply MLP in frequency domain
Capture global patterns more effectively

Recent improvements:

FreDN (2025): Learnable frequency decomposition, 10-14% better than FreTS
FBM: Fourier Basis Mapping addressing time-dependency issues

Self-Supervised Learning for Time Series

Self-supervised pretraining learns representations without labels, then transfers to downstream tasks.

TS2Vec: Universal Representations

TS2Vec learns representations through hierarchical contrastive learning:

Creates augmented context views
Contrastive loss at multiple temporal scales
Timestamp-level representations (not just sequence-level)

Results: State-of-the-art on 125 UCR and 29 UEA datasets for classification, plus strong forecasting and anomaly detection.

TNC (Temporal Neighborhood Coding)

TNC exploits local temporal smoothness:

Positive pairs: Neighboring segments (stationary neighborhoods)
Negative pairs: Distant segments
Assumes local segments share generative process

TS2Vec-Ensemble (2025)

Standard contrastive learning prioritizes instance discrimination over capturing deterministic patterns (seasonality, trend). TS2Vec-Ensemble addresses this:

Fuses learned dynamics from TS2Vec encoder
With explicit engineered time features encoding periodic cycles
Significantly outperforms TS2Vec and state-of-the-art on ETT benchmarks

Other Approaches

TF-C: Time-frequency consistency learning
TS-TCC/CA-TCC: Temporal-contextual contrasting
Series2Vec (2024): Predicts similarity in temporal AND spectral domains
SoftCLT (ICLR 2024): Soft contrastive learning with augmentation strategies

Foundation Models for Time Series

The Foundation Model Paradigm

Train once on massive diverse data, apply anywhere:

Pretrain on billions of time points across domains
Zero-shot: Feed any series, get forecasts without training
Fine-tune (optional) for improved domain accuracy

Amazon Chronos Family

Chronos (March 2024): Treats time series as language

Scale and quantize values to 4,096 discrete tokens
Train T5 encoder-decoder with cross-entropy loss
Probabilistic forecasting via sampling multiple trajectories

Chronos-Bolt (November 2024): Massive efficiency gains

250x faster, 20x less memory, 5% lower error
Patch-based: chunks of observations instead of individual points
Direct multi-step forecasting (no autoregressive bottleneck)

Chronos-2 (October 2025): Universal forecasting

120M parameters, encoder-only
Univariate + Multivariate + Covariates in single architecture
Time and Group Attention: Alternates between temporal and cross-series attention

Google TimesFM

Decoder-only approach inspired by GPT:

Patching: 32 time points per token
Pretrained on 100B+ real-world points

Versions:

TimesFM 1.0 (200M): Up to 512 context
TimesFM 2.0 (500M): 2048 context, 25% better
TimesFM 2.5: Architecture optimizations

2025: Integrated into BigQuery/AlloyDB; training corpus expanded to 400B+ points; few-shot learning capabilities.

Salesforce Moirai

Universal forecaster handling any domain, variable lengths, and exogenous features.

Moirai-MoE (October 2024): First mixture-of-experts time series foundation model

17% better than dense Moirai
Comparable accuracy with 65x fewer activated parameters

Moirai 2.0 (November 2025): Decoder-only

Quantile forecasting + multi-token prediction
2x faster, 30x smaller than Moirai 1.0-Large

TiRex (NeurIPS 2025)

xLSTM-based foundation model:

Only 35M parameters
State-of-the-art on GiftEval, Chronos-ZS benchmarks
Outperforms Chronos-Bolt (200M), TimesFM (500M), Toto (151M)
State-tracking via sLSTM critical for long-horizon forecasting

Datadog Toto

Time Series Optimized Transformer for Observability:

Trained on ~1 trillion data points (largest dataset among published models)
750 billion anonymous numerical metrics from Datadog platform
Optimized for observability (monitoring, alerting) use cases

Timer-XL (ICLR 2025)

Causal transformer for unified forecasting:

Generalizes next-token prediction to multivariate next-patch prediction
Handles univariate, multivariate, and covariate-informed forecasting
Patch-level generation based on long-context sequences

TimeGPT (Nixtla)

Production-ready foundation model:

Trained on 100B+ data points
Zero-shot forecasting AND anomaly detection
Does NOT use patching (unlike most others)
Available via API and SDK

Lag-Llama (ServiceNow)

Probabilistic univariate foundation model:

Decoder-only transformer with lagged features (t-1, t-7, t-30, etc.)
Outputs Student's t distribution parameters
Strong zero-shot generalization

Sundial (ICML 2025 Oral - Top 1%)

Generative foundation model from Tsinghua University—current benchmark leader:

Pre-trained on TimeBench (10^12 time points—largest pretraining corpus)
TimeFlow Loss: Flow-matching for native continuous-valued training (no tokenization)
#1 on GIFT-Eval (MASE) and Time-Series-Library (MSE/MAE)
Zero-shot predictions in milliseconds
Generates multiple probable predictions without specifying prior distributions
Mitigates mode collapse via TimeFlow Loss

Time-MoE (ICLR 2025 Spotlight)

Billion-scale MoE foundation model—the largest open time series model:

Scales up to 2.4 billion parameters
Time-300B: Largest open time series dataset (300B+ points across 9 domains)
Sparse MoE activates only subset of experts per prediction
Context length up to 4096, arbitrary prediction horizons
Validates scaling laws for time series (more data + bigger model = better)

TabPFN-TS (January 2025)

Tabular foundation model adapted for time series—surprisingly effective:

Only 11M parameters but top rank on GIFT-Eval
Recasts forecasting as tabular regression problem
Outperforms Chronos-Mini (20M) by 7.7%
Outperforms Chronos-Large (710M) by 3.0% in zero-shot
No time series pretraining needed—uses only tabular/synthetic data
Supports both point and probabilistic forecasting

Time-LLM (ICLR 2024)

LLM reprogramming framework—1000+ citations in 2 years:

Repurposes frozen LLMs (Llama-7B, GPT-2, BERT) for forecasting
Prompt-as-Prefix (PaP): Enriches input with domain knowledge and task instructions
Reprogram time series into text prototypes the LLM understands
Excels in few-shot and zero-shot scenarios
Adopted by industry for solar, wind, and weather forecasting

LLM-Based Forecasting (GPT-4, Claude)

General-purpose LLMs can forecast time series via prompting:

LLMTIME: Encodes series as digit strings, adapts discrete distributions
Surprising finding: GPT-4 performs worse than GPT-3 for forecasting (RLHF may hurt calibration)
Specialized foundation models still outperform general LLMs
Best use: Combine LLM reasoning with specialized forecasters

Benchmarks: GIFT-Eval

GIFT-Eval (General Time Series Forecasting Model Evaluation) is the new standard benchmark:

28 datasets, 144,000+ time series, 177 million data points
7 domains, 10 frequencies, short to long-term horizons
Non-leaking pretraining data: 230 billion points for fair evaluation
Evaluates univariate, multivariate, and zero-shot capabilities

Current Leaderboard (2025):

Sundial (128M) - #1 MASE on GIFT-Eval, #1 MSE/MAE on TSLib (ICML 2025 Oral)
TabPFN-TS (11M) - Top rank on point forecasting despite tiny size
TiRex (35M) - State-of-the-art xLSTM, best zero-shot
TimeCopilot - Ensemble of Chronos-2 + TimesFM-2.5 + TiRex
Chronos-2 - Best full-featured single model

Key Finding: Short-term → foundation models win. Long-term → fine-tuned models (PatchTST, iTransformer) catch up.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│              FOUNDATION MODEL COMPARISON (2025)                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Model         Arch        Params   Multi  Covar  Prob   Best For       │
│  ────────────────────────────────────────────────────────────────────── │
│  Sundial       Flow-Match  128M     ✓      ✗      ✓     GIFT-Eval #1    │
│  TabPFN-TS     PFN         11M      ✗      ✗      ✓     Point forecast  │
│  TiRex         xLSTM       35M      ✗      ✗      ✓     Zero-shot best  │
│  Chronos-2     Enc-only    120M     ✓      ✓      ✓     Full-featured   │
│  Chronos-Bolt  Enc-Dec     varies   ✗      ✗      ✓     Fast inference  │
│  Time-MoE      MoE         2.4B     ✓      ✗      ✓     Scale leader    │
│  TimesFM-2.5   Dec-only    500M     ✗      ✗      ✗     Google Cloud    │
│  Moirai-2.0    Dec-only    varies   ✓      ✓      ✓     Efficiency      │
│  Moirai-MoE    MoE         varies   ✓      ✓      ✓     Sparse compute  │
│  Toto          Transformer 151M     varies varies ✓     Observability   │
│  Timer-XL      Causal      varies   ✓      ✓      ✓     Unified tasks   │
│  Time-LLM      LLM-reprg   7B+      ✓      ✓      ✓     LLM leverage    │
│  TimeGPT       varies      varies   ✓      ✓      ✓     Production API  │
│  Lag-Llama     Dec-only    varies   ✗      ✗      ✓     Probabilistic   │
│                                                                          │
│  GIFT-EVAL LEADERBOARD (2025):                                          │
│  1. Sundial (128M) - Flow-matching, #1 MASE (ICML 2025 Oral)           │
│  2. TabPFN-TS (11M) - Tabular model beats specialized models            │
│  3. TiRex (35M) - xLSTM with state-tracking                             │
│  4. TimeCopilot - Ensemble (Chronos-2 + TimesFM-2.5 + TiRex)           │
│  5. Chronos-2 - Best single full-featured model                         │
│                                                                          │
│  KEY TRENDS 2025:                                                        │
│  • Smaller models with better architectures beating giants               │
│  • MoE for efficient scaling (Time-MoE, Moirai-MoE)                     │
│  • Tabular models surprisingly competitive (TabPFN-TS)                  │
│  • Ensembles of top models achieve best results                         │
│  • Scaling laws validated for time series                               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

How Tokenization Works

The key challenge: language is discrete (words), time series are continuous (numbers).

Quantization (Chronos)

Scale series to mean 0, std 1
Bin values into 4,096 buckets
Each bucket = one vocabulary token
Train to predict tokens

Patching (TimesFM, PatchTST)

Chunk into non-overlapping patches (e.g., 32 points)
Embed each patch via learned projection
Feed patch embeddings to transformer

Benefits: Dramatic sequence reduction, local pattern capture.

Wavelet-Based (WaveToken)

Wavelet decomposition: Separate coarse (trend) and fine (detail)
Quantize wavelet coefficients
Predict future coefficients, reconstruct

Multi-Resolution

Use patches of varying sizes simultaneously (8, 32, 128 steps) to capture patterns at different temporal scales.

Training Deep Learning Models for Time Series

Training time series forecasting models requires careful attention to loss functions, data preparation, validation strategies, and hyperparameter tuning. This section covers the complete training pipeline.

Loss Functions

The choice of loss function fundamentally shapes what your model learns. Different losses suit different forecasting objectives.

Point Forecast Losses

Mean Squared Error (MSE): $\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

Penalizes large errors heavily (squared term)
Sensitive to outliers
Use when large errors are particularly costly

Mean Absolute Error (MAE): $\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$

More robust to outliers than MSE
All errors penalized equally
Better when outliers are noise, not signal

Mean Absolute Percentage Error (MAPE): $\text{MAPE} = \frac{100}{n} \sum_{i=1}^{n} \frac{|y_i - \hat{y}_i|}{|y_i|}$

Scale-independent (useful for comparing across series)
Problem: Undefined when y=0, asymmetric

Symmetric MAPE (sMAPE): $\text{sMAPE} = \frac{100}{n} \sum_{i=1}^{n} \frac{|y_i - \hat{y}_i|}{(|y_i| + |\hat{y}_i|)/2}$

Addresses MAPE's asymmetry
Bounded between 0% and 200%

Probabilistic Forecast Losses

Quantile Loss (Pinball Loss): $L_q(y, \hat{y}) = q \cdot (y - \hat{y})^+ + (1-q) \cdot (\hat{y} - y)^+$

Trains model to predict specific quantiles (e.g., 10th, 50th, 90th percentile)
Asymmetric: Higher quantiles penalize underestimates more
Sum quantile losses across multiple quantiles for full distribution

Python

def quantile_loss(y_true, y_pred, quantile):
    error = y_true - y_pred
    return torch.max(quantile * error, (quantile - 1) * error).mean()

# Train for multiple quantiles
quantiles = [0.1, 0.5, 0.9]
total_loss = sum(quantile_loss(y, pred[q], q) for q in quantiles)

Continuous Ranked Probability Score (CRPS):

Code

CRPS = ∫ (F(x) - 1{x ≥ y})² dx

Evaluates the entire predicted distribution against the observation
Generalizes MAE: when prediction is a point (delta function), CRPS = MAE
Key advantage: Avoids quantile crossing (where predicted quantiles intersect)
Used by Chronos, DeepAR, and other probabilistic models

Python

def crps_gaussian(y_true, mu, sigma):
    """CRPS for Gaussian predictive distribution"""
    z = (y_true - mu) / sigma
    crps = sigma * (z * (2 * norm.cdf(z) - 1) +
                    2 * norm.pdf(z) - 1 / np.sqrt(np.pi))
    return crps.mean()

Negative Log-Likelihood (NLL):

For parametric distributions (Gaussian, Student's t, Negative Binomial)
Model predicts distribution parameters (μ, σ for Gaussian)
Loss is negative log probability of observation under predicted distribution

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    LOSS FUNCTION SELECTION GUIDE                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  OBJECTIVE                      RECOMMENDED LOSS                         │
│  ────────────────────────────────────────────────────────────────────── │
│  Point forecast, outliers OK    MSE                                      │
│  Point forecast, robust         MAE                                      │
│  Scale-independent comparison   sMAPE, MASE                              │
│  Prediction intervals           Quantile Loss (multiple quantiles)       │
│  Full distribution              CRPS or NLL                              │
│  Business asymmetric cost       Custom asymmetric loss                   │
│                                                                          │
│  FOUNDATION MODEL TRAINING:                                              │
│  Chronos: Cross-entropy on quantized tokens                              │
│  TimesFM: MSE on patches                                                 │
│  Moirai: Mixture distribution NLL                                        │
│  TiRex: Quantile loss                                                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Data Preparation and Preprocessing

Normalization Strategies

Per-series normalization (most common):

Python

def normalize_series(series):
    mean = series.mean()
    std = series.std() + 1e-8  # Avoid division by zero
    normalized = (series - mean) / std
    return normalized, mean, std

def denormalize(normalized, mean, std):
    return normalized * std + mean

Global normalization (for related series):

Compute statistics across all series in dataset
Use when series share similar scales
Common in foundation model pretraining

Log transformation (for multiplicative patterns):

Python

# For data with multiplicative seasonality/trends
log_series = np.log1p(series)  # log(1+x) handles zeros
# Train on log_series, then expm1() to reverse

Handling Missing Values

Forward fill: Use last known value Interpolation: Linear or spline interpolation Masking: Mark missing values, let model handle (Chronos-2 supports this) Imputation models: Use CSDI or other imputation methods first

Handling Irregular Timestamps

For irregular time series:

Resample to regular frequency (information loss)
Neural CDEs: Designed for irregular data
Time-aware models: Encode time gaps as features

Data Augmentation for Time Series

Data augmentation is critical for training generalizable models, especially foundation models.

TSMix (Time Series Mixup)

TSMix creates synthetic series by combining real series:

Python

def tsmix(series_a, series_b, alpha=0.5):
    """Mix two time series with interpolation"""
    # Random mixing coefficient
    lam = np.random.beta(alpha, alpha)
    mixed = lam * series_a + (1 - lam) * series_b
    return mixed

Used in Chronos training: 10M TSMix augmentations from 28 datasets.

Key insight: TSMix improves zero-shot performance but not in-domain performance.

KernelSynth (Synthetic from Gaussian Processes)

KernelSynth generates unlimited synthetic series using Gaussian Processes:

Python

from sklearn.gaussian_process.kernels import RBF, Periodic, WhiteKernel

def kernel_synth(length=512, n_samples=1000):
    """Generate synthetic time series from GP kernels"""
    synthetic_series = []

    for _ in range(n_samples):
        # Randomly compose kernels
        kernel = (
            RBF(length_scale=np.random.uniform(10, 100)) +
            Periodic(length_scale=np.random.uniform(5, 50),
                    periodicity=np.random.uniform(10, 100)) +
            WhiteKernel(noise_level=np.random.uniform(0.01, 0.1))
        )

        # Sample from GP
        X = np.arange(length).reshape(-1, 1)
        K = kernel(X)
        series = np.random.multivariate_normal(np.zeros(length), K)
        synthetic_series.append(series)

    return synthetic_series

Optimal ratio: ~10% synthetic data. More synthetic data tends to worsen performance.

Other Augmentation Techniques

Jittering: Add small random noise

Python

augmented = series + np.random.normal(0, 0.01, len(series))

Scaling: Random amplitude scaling

Python

augmented = series * np.random.uniform(0.8, 1.2)

Time warping: Stretch/compress time axis locally Window slicing: Random crops from longer series Magnitude warping: Smooth amplitude variations

Validation Strategies for Time Series

Critical: Standard k-fold cross-validation violates temporal ordering and causes data leakage. Use time-respecting validation.

Train-Test Split

Simple temporal split—train on past, test on future:

Python

def temporal_split(series, train_ratio=0.8):
    split_idx = int(len(series) * train_ratio)
    train = series[:split_idx]
    test = series[split_idx:]
    return train, test

Problem: Single split may not be representative.

Walk-Forward Validation (Rolling Origin)

The gold standard for time series. Train on expanding/rolling window, test on next period:

Python

from sklearn.model_selection import TimeSeriesSplit

def walk_forward_validation(series, model, n_splits=5):
    tscv = TimeSeriesSplit(n_splits=n_splits)
    scores = []

    for train_idx, test_idx in tscv.split(series):
        train, test = series[train_idx], series[test_idx]

        # Train model
        model.fit(train)

        # Predict and evaluate
        predictions = model.predict(len(test))
        score = evaluate(test, predictions)
        scores.append(score)

    return np.mean(scores), np.std(scores)

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    WALK-FORWARD VALIDATION                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Time →                                                                  │
│  ────────────────────────────────────────────────────────────────────── │
│                                                                          │
│  Fold 1: [TRAIN════════] [TEST]                                         │
│  Fold 2: [TRAIN════════════════] [TEST]                                 │
│  Fold 3: [TRAIN════════════════════════] [TEST]                         │
│  Fold 4: [TRAIN════════════════════════════════] [TEST]                 │
│  Fold 5: [TRAIN════════════════════════════════════════] [TEST]         │
│                                                                          │
│  Each fold: Train on all prior data, test on next period                │
│  Never use future data to predict past                                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Adding a Gap (Embargo Period)

Prevent information leakage from recent observations:

Python

def walk_forward_with_gap(series, gap=7):
    """Gap between train end and test start"""
    # If predicting weekly, gap=7 prevents day-of-week leakage
    ...

Expanding vs Rolling Window

Expanding window: Train set grows with each fold (more data) Rolling window: Fixed-size train set slides forward (handles non-stationarity)

Python

# Rolling window
tscv = TimeSeriesSplit(n_splits=5, max_train_size=365*3)  # Max 3 years

Hyperparameter Tuning

Key hyperparameters for time series deep learning models:

Lookback Window (Context Length)

How much history the model sees:

Too short: Misses seasonal patterns
Too long: Noise, computational cost, vanishing gradients

Rule of thumb: At least 2-3x the longest seasonality period.

Python

# If data has yearly seasonality (365 days)
lookback = 365 * 2  # ~2 years of history

# For foundation models
# Chronos: Up to 4096 tokens
# TimesFM: Up to 2048 points
# TiRex: Variable, designed for long context

Forecast Horizon

How far ahead to predict:

Longer horizons → harder, more uncertainty
Match to business needs

Learning Rate

Most important hyperparameter:

Too high: Divergence, oscillation
Too low: Slow convergence, stuck in local minima

Recommendations:

Start with 1e-3 to 1e-4 for Adam
Use learning rate schedulers (cosine, reduce-on-plateau)
Linear scaling rule: When batch size × k, learning rate × k

Python

# Learning rate scheduling
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=num_epochs, eta_min=1e-6
)

# Or reduce on plateau
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.5, patience=10
)

Batch Size

Smaller (16-64): More noise, better generalization, slower
Larger (128-512): Stable gradients, faster, may overfit

For time series: Research suggests optimal batch size may correlate with data periodicity (use FFT to find dominant frequencies).

Python

# Typical starting points
batch_size = 32  # Good default
# Increase if training is stable, decrease if loss is noisy

Model-Specific Hyperparameters

LSTM/GRU:

Hidden size: 64-512
Number of layers: 1-3
Dropout: 0.1-0.3

Transformer:

d_model: 64-512
n_heads: 4-8
n_layers: 2-6
Patch size: 8-64

N-BEATS:

Stack types: [trend, seasonality] or [generic]
Blocks per stack: 3-5
Layer width: 256-512

Fine-Tuning Foundation Models

Foundation models work zero-shot, but fine-tuning can significantly improve domain-specific performance.

When to Fine-Tune

Fine-tune when:

Your domain has unique patterns not in pretraining data
You have sufficient domain data (1000+ series or 100K+ points)
Zero-shot performance is good but not good enough
Computational resources available

Don't fine-tune when:

Very limited data (risk of overfitting)
Zero-shot already meets requirements
Need rapid deployment

Fine-Tuning Strategies

Full Fine-Tuning

Update all model weights:

Python

# Load pretrained model
model = ChronosModel.from_pretrained("amazon/chronos-t5-base")

# Unfreeze all parameters
for param in model.parameters():
    param.requires_grad = True

# Train with lower learning rate
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)  # 10x lower than pretraining

Pros: Maximum adaptation Cons: Needs more data, risk of catastrophic forgetting

Adapter-Based Fine-Tuning (PEFT)

Add small trainable modules, freeze pretrained weights:

Python

from peft import get_peft_model, LoraConfig

# Add LoRA adapters
lora_config = LoraConfig(
    r=8,  # Rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
)
model = get_peft_model(base_model, lora_config)

# Only adapter parameters are trainable
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
# Typically <1% of total parameters

Pros: Efficient, less overfitting, can store multiple adapters Cons: Less expressive than full fine-tuning

ChronosX uses adapters to add covariate support to Chronos, achieving ~22% improvement.

Last-Layer Fine-Tuning

Only train the output projection:

Python

# Freeze all but last layer
for param in model.parameters():
    param.requires_grad = False
for param in model.output_projection.parameters():
    param.requires_grad = True

Pros: Fast, minimal overfitting risk Cons: Limited adaptation

Fine-Tuning Code Examples

Fine-Tuning Chronos

Python

import torch
from chronos import ChronosConfig, ChronosModel
from torch.utils.data import DataLoader

# Load pretrained
config = ChronosConfig.from_pretrained("amazon/chronos-t5-small")
model = ChronosModel.from_pretrained("amazon/chronos-t5-small")

# Prepare your data
train_dataset = YourTimeSeriesDataset(your_data)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Fine-tuning setup
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10)

model.train()
for epoch in range(10):
    for batch in train_loader:
        context, target = batch

        # Tokenize (Chronos-specific)
        tokens = model.tokenizer.encode(context)
        target_tokens = model.tokenizer.encode(target)

        # Forward pass
        loss = model(tokens, labels=target_tokens).loss

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()

    scheduler.step()

# Save fine-tuned model
model.save_pretrained("./chronos-finetuned")

Fine-Tuning with NeuralForecast

Python

from neuralforecast import NeuralForecast
from neuralforecast.models import NBEATS, NHITS, TFT

# Define models with hyperparameters
models = [
    NBEATS(
        input_size=7*24,  # 1 week of hourly data
        h=24,  # Predict 24 hours ahead
        max_steps=1000,
        learning_rate=1e-3,
        batch_size=32,
    ),
    NHITS(
        input_size=7*24,
        h=24,
        max_steps=1000,
        n_freq_downsample=[24, 12, 1],  # Multi-resolution
    ),
    TFT(
        input_size=7*24,
        h=24,
        hidden_size=128,
        max_steps=1000,
    )
]

# Train
nf = NeuralForecast(models=models, freq='H')
nf.fit(df=train_df)

# Predict
forecasts = nf.predict()

Training Infrastructure

Memory Optimization

Python

# Gradient checkpointing (trade compute for memory)
model.gradient_checkpointing_enable()

# Mixed precision training
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
with autocast():
    loss = model(inputs).loss
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Distributed Training

Python

# PyTorch DDP
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel

dist.init_process_group(backend='nccl')
model = DistributedDataParallel(model)

# Or use Hugging Face Accelerate
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_loader = accelerator.prepare(
    model, optimizer, train_loader
)

Evaluation Metrics

Point Forecast Metrics

Python

def mase(y_true, y_pred, y_train, seasonality=1):
    """Mean Absolute Scaled Error - compares to naive baseline"""
    naive_mae = np.mean(np.abs(y_train[seasonality:] - y_train[:-seasonality]))
    forecast_mae = np.mean(np.abs(y_true - y_pred))
    return forecast_mae / naive_mae

def smape(y_true, y_pred):
    """Symmetric Mean Absolute Percentage Error"""
    return 100 * np.mean(2 * np.abs(y_true - y_pred) /
                         (np.abs(y_true) + np.abs(y_pred) + 1e-8))

Probabilistic Metrics

Python

def coverage(y_true, lower, upper):
    """What fraction of observations fall within prediction interval"""
    return np.mean((y_true >= lower) & (y_true <= upper))

def interval_width(lower, upper):
    """Average width of prediction intervals"""
    return np.mean(upper - lower)

# Good model: High coverage + narrow intervals

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    COMPLETE TRAINING PIPELINE                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. DATA PREPARATION                                                     │
│     ├─ Handle missing values                                             │
│     ├─ Normalize (per-series or global)                                  │
│     ├─ Create train/val/test splits (temporal)                          │
│     └─ Apply augmentation (TSMix, KernelSynth)                          │
│                                                                          │
│  2. MODEL SELECTION                                                      │
│     ├─ Baseline: ARIMA/ETS for comparison                               │
│     ├─ Choose architecture based on data/requirements                    │
│     └─ Foundation model for zero-shot baseline                          │
│                                                                          │
│  3. TRAINING                                                             │
│     ├─ Set loss function (MSE/MAE/Quantile/CRPS)                        │
│     ├─ Configure optimizer (AdamW, lr=1e-3 to 1e-4)                     │
│     ├─ Add learning rate scheduler                                       │
│     ├─ Implement early stopping on validation loss                       │
│     └─ Use gradient clipping (max_norm=1.0)                             │
│                                                                          │
│  4. VALIDATION                                                           │
│     ├─ Walk-forward validation (5+ folds)                               │
│     ├─ Track point metrics (MAE, MASE, sMAPE)                           │
│     └─ Track probabilistic metrics (CRPS, coverage)                     │
│                                                                          │
│  5. FINE-TUNING (if using foundation model)                             │
│     ├─ Start with zero-shot baseline                                    │
│     ├─ Try adapter-based fine-tuning first                              │
│     ├─ Use lower learning rate (10x lower)                              │
│     └─ Monitor for overfitting                                          │
│                                                                          │
│  6. PRODUCTION                                                           │
│     ├─ Save model + normalization params                                │
│     ├─ Set up monitoring (accuracy drift)                               │
│     └─ Plan retraining schedule                                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Model Selection Guide

When Classical Methods Win

Use ARIMA/ETS when:

Series < 1000 points with clear autocorrelation
Interpretability > marginal accuracy gains
No deep learning infrastructure

Use Prophet when:

Business forecasting with seasonalities/holidays
Non-technical stakeholders need understanding
Speed of development > accuracy

When Deep Learning Excels

Use N-BEATS/N-HiTS when:

Interpretability needed but classical methods insufficient
Long-horizon forecasting
No time-series-specific feature engineering desired

Use LSTM/Transformer when:

Complex patterns, substantial data
Multivariate with interactions
Long sequences (use Informer/Autoformer for efficiency)

When Foundation Models Shine

Use foundation models when:

Zero-shot acceptable: No training time
Limited data: Cold-start problems
Diverse domains: One model across everything
Rapid prototyping: Test feasibility first

Model recommendations:

Overall best: TiRex (35M, state-of-the-art)
Full-featured: Chronos-2 (covariates, multivariate)
Fast inference: Chronos-Bolt (250x faster)
Efficiency: Moirai-2.0 (2x faster, 30x smaller)
Google Cloud: TimesFM (BigQuery integration)
Probabilistic: Lag-Llama (uncertainty quantification)

Production Considerations

Data Pipeline

Missing value handling strategy
Normalization parameters versioned
Frequency validation
Outlier detection

Inference

Context window limits
GPU batching optimization
Fallback models
Caching for repeated series

Output

Prediction intervals (not just point forecasts)
Denormalization
Forecast metadata (model version, confidence)

Monitoring

Accuracy metrics over time
Data/concept drift detection
Degradation alerts
Regular backtesting

Practical Examples

Using Chronos

Python

import torch
from chronos import ChronosPipeline

pipeline = ChronosPipeline.from_pretrained(
    "amazon/chronos-t5-base",
    device_map="cuda",
    torch_dtype=torch.bfloat16,
)

context = torch.tensor([[1.2, 1.5, 1.3, 1.8, 2.1, 1.9, 2.3, 2.0]])

forecast = pipeline.predict(
    context,
    prediction_length=12,
    num_samples=20,
)

median = forecast.median(dim=1).values
lower = forecast.quantile(0.1, dim=1).values
upper = forecast.quantile(0.9, dim=1).values

Using TiRex

Python

from tirex import TiRexPipeline

pipeline = TiRexPipeline.from_pretrained("NX-AI/TiRex")

# TiRex provides both point and quantile predictions
forecast = pipeline.predict(
    context,
    prediction_length=24,
)

Using TimesFM

Python

import timesfm

tfm = timesfm.TimesFm(
    context_len=512,
    horizon_len=128,
    input_patch_len=32,
    output_patch_len=128,
)
tfm.load_from_checkpoint(repo_id="google/timesfm-1.0-200m")

forecast = tfm.forecast(context)

Table of Contents

The Revolution in Time Series Forecasting

Classical Statistical Methods

ARIMA: The Statistical Workhorse

Exponential Smoothing (ETS)

Prophet: Business Forecasting

Deep Learning Architectures

Recurrent Networks: LSTM and GRU

Temporal Convolutional Networks (TCN)

DeepAR: Probabilistic Forecasting at Scale

N-BEATS and N-HiTS: Neural Basis Expansion

N-BEATS Architecture

N-HiTS: Hierarchical Interpolation

NBEATSx: Adding Exogenous Variables

Transformer Architectures for Time Series

Why Vanilla Transformers Struggle

Informer: Efficient Long-Sequence Forecasting

Autoformer: Decomposition Meets Attention

Temporal Fusion Transformer (TFT)

PatchTST: The Patching Breakthrough

iTransformer: Inverted Attention (ICLR 2024 Spotlight)

TimeXer: Exogenous Variables (NeurIPS 2024)

Other Notable Transformers

MLP-Based Models (Efficient Alternatives)

State Space Models: Linear Complexity Alternatives

Mamba for Time Series

MambaTS: Temporal Mamba Blocks

S-Mamba (Simple-Mamba)

Mamba4Cast: Zero-Shot with Mamba

SOR-Mamba: Order-Robust

TiRex: xLSTM Foundation Model (NeurIPS 2025)

Diffusion Models for Time Series

TimeGrad: Autoregressive Diffusion

CSDI: Non-Autoregressive Diffusion

Diffusion Model Improvements (2024-2025)

Novel Architectures

Kolmogorov-Arnold Networks (KAN)

Neural ODEs and Neural CDEs

Graph Neural Networks for Spatial-Temporal Forecasting

Frequency Domain Methods

Self-Supervised Learning for Time Series

TS2Vec: Universal Representations

TNC (Temporal Neighborhood Coding)

TS2Vec-Ensemble (2025)

Other Approaches

Foundation Models for Time Series

The Foundation Model Paradigm

Amazon Chronos Family

Google TimesFM

Salesforce Moirai

TiRex (NeurIPS 2025)

Datadog Toto

Timer-XL (ICLR 2025)

TimeGPT (Nixtla)

Lag-Llama (ServiceNow)

Sundial (ICML 2025 Oral - Top 1%)

Time-MoE (ICLR 2025 Spotlight)

TabPFN-TS (January 2025)

Time-LLM (ICLR 2024)

LLM-Based Forecasting (GPT-4, Claude)

Benchmarks: GIFT-Eval

How Tokenization Works

Quantization (Chronos)

Patching (TimesFM, PatchTST)

Wavelet-Based (WaveToken)

Multi-Resolution

Training Deep Learning Models for Time Series

Loss Functions

Point Forecast Losses

Probabilistic Forecast Losses

Data Preparation and Preprocessing

Normalization Strategies

Handling Missing Values

Handling Irregular Timestamps

Data Augmentation for Time Series

TSMix (Time Series Mixup)

KernelSynth (Synthetic from Gaussian Processes)

Other Augmentation Techniques

Validation Strategies for Time Series

Train-Test Split