Time Series Forecasting with Foundation Models: From ARIMA to Chronos
A comprehensive guide to modern time series forecasting—from classical statistical methods to transformer-based architectures, state space models, diffusion models, and zero-shot foundation models like Chronos, TimesFM, Moirai, and TiRex.
Table of Contents
The Revolution in Time Series Forecasting
Time series forecasting has undergone a transformation as dramatic as the one that reshaped natural language processing. For decades, statistical methods like ARIMA dominated. Then deep learning approaches—LSTMs, GRUs, and neural basis expansion—showed promise for capturing complex temporal patterns. Transformers brought attention mechanisms, state space models offered linear complexity alternatives, and diffusion models introduced probabilistic generation. Now, we're witnessing foundation models for time series: large pretrained models that forecast any time series with zero-shot learning.
This shift matters because time series forecasting is everywhere: demand planning, financial markets, energy load prediction, traffic optimization, healthcare monitoring, climate modeling. The new generation of models delivers better accuracy, often without any training on your data.
This guide covers the complete landscape: classical methods, deep learning architectures (N-BEATS, LSTMs), transformers (Informer, Autoformer, iTransformer, TimeXer), state space models (Mamba, xLSTM/TiRex), diffusion models (TimeGrad, CSDI), novel architectures (KAN, Neural ODEs, Graph NNs), self-supervised learning (TS2Vec), and foundation models (Chronos, TimesFM, Moirai, TiRex, Toto).
┌─────────────────────────────────────────────────────────────────────────┐
│ EVOLUTION OF TIME SERIES FORECASTING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1970s-2010s: STATISTICAL METHODS │
│ ARIMA, Exponential Smoothing, Prophet │
│ ✓ Interpretable ✓ Fast ✗ Limited patterns │
│ │
│ 2015-2019: EARLY DEEP LEARNING │
│ LSTM, GRU, TCN, DeepAR │
│ ✓ Complex patterns ✓ Multivariate ✗ Needs lots of data │
│ │
│ 2019-2021: NEURAL BASIS EXPANSION │
│ N-BEATS, N-HiTS, NBEATSx │
│ ✓ Interpretable ✓ No hand-crafted features ✓ Fast │
│ │
│ 2020-2023: TRANSFORMERS │
│ Informer, Autoformer, TFT, PatchTST, iTransformer, TimeXer │
│ ✓ Long-range dependencies ✓ Attention ✗ Quadratic complexity │
│ │
│ 2023-2024: STATE SPACE & DIFFUSION │
│ Mamba, MambaTS, S-Mamba, TimeGrad, CSDI │
│ ✓ Linear complexity ✓ Probabilistic ✓ Efficient │
│ │
│ 2024-NOW: FOUNDATION MODELS │
│ Chronos, TimesFM, Moirai, TiRex, Toto, Timer-XL │
│ ✓ Zero-shot ✓ Universal ✓ Pretrained on billions of points │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Classical Statistical Methods
Before diving into neural approaches, understanding classical methods is essential. They remain the right choice for many problems.
ARIMA: The Statistical Workhorse
ARIMA (AutoRegressive Integrated Moving Average) models time series as a combination of three components:
- AR (AutoRegressive): Current value depends on previous values
- I (Integrated): Differencing to make the series stationary
- MA (Moving Average): Current value depends on previous forecast errors
Why ARIMA still matters: Excels at short-term forecasting with stationary data. Fast, interpretable, requires no training data beyond the series itself. For univariate forecasting with clear autocorrelation, ARIMA often matches neural networks—especially on small datasets.
Limitations: Non-linear relationships, multiple seasonalities, external variables, long-term dependencies.
Exponential Smoothing (ETS)
ETS decomposes time series into Error, Trend, and Seasonality. The family includes:
- Simple Exponential Smoothing: No trend or seasonality
- Holt's Linear: Adds trend
- Holt-Winters: Adds seasonality (additive or multiplicative)
Prophet: Business Forecasting
Prophet (Meta) models time series as: y(t) = trend(t) + seasonality(t) + holidays(t) + error(t)
Strengths: Handles missing data and outliers, automatic changepoint detection, easy domain knowledge incorporation.
Limitations: Research shows Prophet underperforms ARIMA and neural networks on accuracy. Value lies in ease of use, not peak performance.
Deep Learning Architectures
Recurrent Networks: LSTM and GRU
Long Short-Term Memory (LSTM) networks solve vanilla RNNs' vanishing gradient problem with gates:
- Forget gate: What to discard from memory
- Input gate: What new information to store
- Output gate: What to output
Gated Recurrent Units (GRUs) simplify to two gates (update, reset) with comparable performance.
Why LSTMs work: Learn long-range dependencies, handle variable-length sequences, incorporate multiple features.
Limitations: Sequential training (no parallelization), hyperparameter tuning difficulty, substantial data requirements.
Temporal Convolutional Networks (TCN)
TCNs apply 1D dilated convolutions across time:
Dilation pattern: 1, 2, 4, 8, 16...
Receptive field grows exponentially while parameters grow linearly
TCN advantages: Parallelizable, flexible receptive field, faster training, stable gradients.
DeepAR: Probabilistic Forecasting at Scale
Amazon's DeepAR trains a single LSTM across thousands of related series:
- Probabilistic outputs: Predicts distributions, not point estimates
- Global model: Learns cross-series patterns
- Autoregressive generation: Uses sampled predictions for subsequent steps
N-BEATS and N-HiTS: Neural Basis Expansion
N-BEATS Architecture
N-BEATS (Neural Basis Expansion Analysis for Time Series), developed by ElementAI/ServiceNow, revolutionized interpretable deep forecasting. The key insight: use neural basis expansion to decompose forecasts.
How it works:
- Blocks process the input through fully-connected layers
- Each block outputs expansion coefficients (θ) for both backward (backcast) and forward (forecast)
- Coefficients are projected through basis functions to produce predictions
- Doubly residual stacking: Blocks are organized into stacks, with residual connections both within and across stacks
┌─────────────────────────────────────────────────────────────────────────┐
│ N-BEATS ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Input: [x₁, x₂, ..., xₜ] (lookback window) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ STACK 1: Trend │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Block 1 │ → │ Block 2 │ → │ Block 3 │ │ │
│ │ └────┬────┘ └────┬────┘ └────┬────┘ │ │
│ │ │ │ │ │ │
│ │ ▼ ▼ ▼ │ │
│ │ [θ_trend] → Polynomial Basis → Trend Forecast │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ (residual) │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ STACK 2: Seasonality │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Block 1 │ → │ Block 2 │ → │ Block 3 │ │ │
│ │ └────┬────┘ └────┬────┘ └────┬────┘ │ │
│ │ │ │ │ │ │
│ │ ▼ ▼ ▼ │ │
│ │ [θ_season] → Fourier Basis → Seasonality Forecast │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Final Forecast = Σ(Trend + Seasonality forecasts from all blocks) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Two configurations:
- Generic: No time-series-specific components—proves deep learning primitives alone solve forecasting
- Interpretable: Uses trend (polynomial) and seasonality (Fourier) basis functions
Performance: Improved forecast accuracy by 11% over statistical benchmarks, 3% over M4 competition winner.
N-HiTS: Hierarchical Interpolation
N-HiTS extends N-BEATS for long-horizon forecasting with:
- Hierarchical interpolation: Different blocks specialize in different frequency bands
- Multi-rate data sampling: Blocks see the series at different temporal resolutions
Result: Outperforms Transformer-based models by 25%+ on benchmarks.
NBEATSx: Adding Exogenous Variables
NBEATSx extends N-BEATS to incorporate external features, improving accuracy by ~20% over vanilla N-BEATS.
Transformer Architectures for Time Series
Why Vanilla Transformers Struggle
Self-attention has O(n²) complexity—prohibitive for long sequences. Additional issues:
- Permutation invariance: Transformers don't inherently understand order
- Point-wise attention: May miss local patterns spanning multiple points
Informer: Efficient Long-Sequence Forecasting
Informer (2020) introduced ProbSparse self-attention:
- Identifies important queries (those attending to many keys)
- Only computes attention for those queries
- Reduces complexity from O(n²) to O(n log n)
Additional innovations:
- Self-attention distilling: Progressively halves sequence length between layers
- Generative decoder: Predicts entire output in one pass, avoiding error accumulation
Autoformer: Decomposition Meets Attention
Autoformer (2021) exploits time series structure:
Series decomposition blocks: After each layer, decompose into trend and seasonal components using moving averages.
Auto-correlation attention: Instead of point-wise attention, compute attention based on series periodicity—find similar sub-series and aggregate them.
Results: 10-12% improvement over Informer on periodic data.
Temporal Fusion Transformer (TFT)
TFT (Google) prioritizes interpretability:
- Variable selection networks: Learn which features matter for each prediction
- Gating mechanisms: Control information flow
- Multi-horizon attention: Different attention patterns for different forecast horizons
- Quantile outputs: Prediction intervals, not just point forecasts
PatchTST: The Patching Breakthrough
PatchTST (2023) introduced patching—grouping time points into chunks:
Why patching works:
- Local semantics: A patch captures local patterns as a unit
- Sequence reduction: 512 points → 32 patches
- Better representations: Encode local patterns into embeddings
iTransformer: Inverted Attention (ICLR 2024 Spotlight)
iTransformer made a simple but powerful change: swap the modeling axes.
Traditional transformers: Each time step is a token, attention captures temporal dependencies.
iTransformer: Each variate (channel) is embedded as a token. Attention models cross-variate correlations, while FFN learns temporal representations per variate.
┌─────────────────────────────────────────────────────────────────────────┐
│ iTransformer INVERSION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ TRADITIONAL TRANSFORMER (time as tokens): │
│ ────────────────────────────────────────── │
│ Tokens: [t₁] [t₂] [t₃] ... [tₙ] │
│ ↑ ↑ ↑ ↑ │
│ Each token = one timestamp, all variates │
│ Attention learns: temporal dependencies │
│ │
│ ────────────────────────────────────────────────────────────────────── │
│ │
│ iTRANSFORMER (variates as tokens): │
│ ────────────────────────────────── │
│ Tokens: [var₁] [var₂] [var₃] ... [varₘ] │
│ ↑ ↑ ↑ ↑ │
│ Each token = entire history of one variate │
│ Attention learns: cross-variate correlations │
│ FFN learns: temporal patterns per variate │
│ │
│ RESULT: Stronger accuracy, better scaling, no positional encoding │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Key insight: This simple axis swap yields stronger accuracy without new modules.
TimeXer: Exogenous Variables (NeurIPS 2024)
TimeXer handles both endogenous and exogenous features:
- Patch-wise self-attention: For endogenous series
- Variate-wise cross-attention: To incorporate exogenous information
- Global endogenous tokens: Bridge causal information from exogenous series
State-of-the-art on twelve benchmarks for forecasting with external variables.
Other Notable Transformers
- FEDformer: Frequency-enhanced decomposed transformer using Fourier/wavelet transforms
- ETSformer: Exponential smoothing attention with level, growth, and seasonal components
- Crossformer: Two-stage attention (cross-time, cross-dimension) with segment embedding
MLP-Based Models (Efficient Alternatives)
Recent research shows simple MLPs can match or beat transformers:
TSMixer: MLP mixing in time and channel domains directly—surprisingly competitive.
WPMixer (AAAI 2025): Wavelet Patch Mixer combines:
- Multi-resolution wavelet decomposition (time + frequency domains)
- Patching for extended lookback and local information
- MLP mixing for global information
- 10x more efficient than TimeMixer, lower variance
- Outperforms transformer-based models for long-term forecasting
DLinear: Simple linear layers beat many transformers—challenged the field's assumptions.
State Space Models: Linear Complexity Alternatives
Mamba for Time Series
Mamba, based on selective state space models (SSMs), offers transformer-competitive performance with linear complexity. For time series, this enables efficient processing of very long sequences.
MambaTS: Temporal Mamba Blocks
MambaTS addresses Mamba's limitations for long-term forecasting:
- Variable scan along time: Arranges historical information of all variables together
- Temporal Mamba Block (TMB): Specialized architecture for time series
Result: State-of-the-art on eight public datasets.
S-Mamba (Simple-Mamba)
S-Mamba takes a simpler approach:
- Tokenize time points per variate via linear layer
- Bidirectional Mamba layer extracts inter-variate correlations
- FFN learns temporal dependencies
Low computational overhead with leading performance on thirteen datasets.
Mamba4Cast: Zero-Shot with Mamba
Mamba4Cast is a zero-shot foundation model using Mamba:
- Inspired by Prior-data Fitted Networks (PFNs)
- Trained solely on synthetic data
- Generates forecasts for entire horizons in one pass
- Much lower inference time than transformers
SOR-Mamba: Order-Robust
Channels in time series have no specific order, introducing sequential bias. SOR-Mamba uses regularization to minimize discrepancy between embeddings from reversed channel orders.
TiRex: xLSTM Foundation Model (NeurIPS 2025)
TiRex demonstrates that LSTMs are back. Based on xLSTM (extended LSTM), TiRex is a 35M parameter foundation model that:
- Sets state-of-the-art on GiftEval and Chronos-ZS benchmarks
- Outperforms much larger models (Chronos-Bolt 200M, TimesFM 500M, Toto 151M)
- Provides both point and quantile predictions
Key innovation: State-tracking capability critical for long-horizon forecasting. Unlike transformers or Mamba, TiRex retains explicit state tracking via sLSTM modules.
CPM (Causal Prediction Masking): Training-time masking strategy that enhances state-tracking ability.
┌─────────────────────────────────────────────────────────────────────────┐
│ TiRex vs TRANSFORMER vs MAMBA │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Architecture State Tracking Complexity Zero-Shot Rank │
│ ────────────────────────────────────────────────────────────────────── │
│ Transformer ✗ (attention) O(n²) Good │
│ Mamba ✗ (selective) O(n) Good │
│ TiRex (xLSTM) ✓ (explicit) O(n) BEST (NeurIPS 2025) │
│ │
│ Key insight: State-tracking matters for forecasting │
│ TiRex's sLSTM maintains explicit state across sequence │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Diffusion Models for Time Series
Diffusion models, successful in image generation, have been adapted for probabilistic time series forecasting.
TimeGrad: Autoregressive Diffusion
TimeGrad (first diffusion model for time series) uses:
- Conditional diffusion: Denoising guided by RNN hidden state
- Autoregressive generation: Process sequence with recurrent cell, maintain hidden state
Output: Highly diverse probabilistic forecasts with uncertainty quantification.
CSDI: Non-Autoregressive Diffusion
CSDI (Conditional Score-based Diffusion for Imputation) uses:
- Self-supervised strategy: Input masking for training
- Stacked attention: Temporal and feature-wise attention for conditioning
- Non-autoregressive: Faster predictions than TimeGrad
Diffusion Model Improvements (2024-2025)
Known limitations of TimeGrad/CSDI:
- Optimize likelihood only—generate diverse but poorly aligned samples
- Training instability
- Boundary disharmony problems
Recent advances:
- MG-TSD: Multi-scale generation—predict main components then details
- mr-diff: Separately predict trend and seasonal components
- CCDM, TimeDiff, S²DBM, SimDiff: Achieve best/second-best on benchmarks
- LDM4TS: Translate time series to visual encodings, denoise in image-latent space
- MCD-TSF: Multimodal (text, timestamp) cross-attention with classifier-free guidance
┌─────────────────────────────────────────────────────────────────────────┐
│ DIFFUSION MODEL PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ TRAINING (Forward Process): │
│ ─────────────────────────── │
│ Clean forecast → Add noise (T steps) → Pure noise │
│ x₀ ────────────────────────────────────────────────→ xₜ │
│ │
│ INFERENCE (Reverse Process): │
│ ──────────────────────────── │
│ Pure noise → Denoise (T steps) → Clean forecast │
│ xₜ ────────────────────────────────────────────────→ x₀ │
│ ↑ │
│ │ Conditioned on historical observations │
│ │ (via attention or RNN hidden state) │
│ │
│ KEY BENEFIT: Probabilistic forecasts with uncertainty │
│ Sample multiple trajectories → get prediction intervals │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Novel Architectures
Kolmogorov-Arnold Networks (KAN)
KAN, proposed by MIT in 2024, is a revolutionary architecture based on the Kolmogorov-Arnold representation theorem. Instead of fixed activation functions, KAN uses spline-parametrized learnable activations.
For time series:
- T-KAN: Temporal KAN for univariate forecasting
- MT-KAN: Multivariate Time Series KAN
- TKAT: Temporal Kolmogorov-Arnold Transformer (combines KAN with attention)
Performance: KAN-based models achieve state-of-the-art with up to 98% lower MSE than transformers on some benchmarks, while offering unparalleled interpretability.
Key advantages:
- Interpretability: Learnable activations reveal feature relationships
- Efficiency: Fewer parameters for equivalent performance
- Adaptivity: Dynamic learning of activation patterns
Limitations: ~10x slower training than MLPs (diverse activations can't fully leverage batching).
Neural ODEs and Neural CDEs
Neural ODEs model continuous-time dynamics:
- Learn differential equation governing latent state
- Natural handling of irregular time steps
- Continuous trajectory between observations
Neural CDEs (Controlled Differential Equations) extend for time series:
- Continuous analogue of RNNs
- Handle irregularly sampled and partially observed data
- State-of-the-art on irregular time series
Recent extensions:
- ANCDE: Attentive Neural CDE with attention for dynamic path construction
- DualDynamics (2025): Combines explicit (Neural ODE) and implicit updates
- STG-NCDE: Extends to multivariate with graph convolution
Graph Neural Networks for Spatial-Temporal Forecasting
When time series have underlying graph structure (traffic networks, sensor grids), Graph Neural Networks capture spatial dependencies:
Key architectures:
- STGCN: Graph convolutions + temporal convolutions
- DCRNN: Diffusion convolutions + GRU
- Graph WaveNet: Adaptive graph learning + dilated causal convolutions
- MTGNN: Multivariate Time Series GNN
- STG-NCDE: Graph + Neural CDE for irregular data
Frequency Domain Methods
FourierGNN: Rethinks forecasting from pure graph perspective:
- Hypervariate graph: Each series value (any variate, any timestamp) is a node
- Fourier Graph Operator: Matrix multiplication in Fourier space
- Much lower complexity with adequate expressiveness
FreTS: Frequency-domain MLPs:
- Transform time signal to frequency domain using DFT
- Apply MLP in frequency domain
- Capture global patterns more effectively
Recent improvements:
- FreDN (2025): Learnable frequency decomposition, 10-14% better than FreTS
- FBM: Fourier Basis Mapping addressing time-dependency issues
Self-Supervised Learning for Time Series
Self-supervised pretraining learns representations without labels, then transfers to downstream tasks.
TS2Vec: Universal Representations
TS2Vec learns representations through hierarchical contrastive learning:
- Creates augmented context views
- Contrastive loss at multiple temporal scales
- Timestamp-level representations (not just sequence-level)
Results: State-of-the-art on 125 UCR and 29 UEA datasets for classification, plus strong forecasting and anomaly detection.
TNC (Temporal Neighborhood Coding)
TNC exploits local temporal smoothness:
- Positive pairs: Neighboring segments (stationary neighborhoods)
- Negative pairs: Distant segments
- Assumes local segments share generative process
TS2Vec-Ensemble (2025)
Standard contrastive learning prioritizes instance discrimination over capturing deterministic patterns (seasonality, trend). TS2Vec-Ensemble addresses this:
- Fuses learned dynamics from TS2Vec encoder
- With explicit engineered time features encoding periodic cycles
- Significantly outperforms TS2Vec and state-of-the-art on ETT benchmarks
Other Approaches
- TF-C: Time-frequency consistency learning
- TS-TCC/CA-TCC: Temporal-contextual contrasting
- Series2Vec (2024): Predicts similarity in temporal AND spectral domains
- SoftCLT (ICLR 2024): Soft contrastive learning with augmentation strategies
Foundation Models for Time Series
The Foundation Model Paradigm
Train once on massive diverse data, apply anywhere:
- Pretrain on billions of time points across domains
- Zero-shot: Feed any series, get forecasts without training
- Fine-tune (optional) for improved domain accuracy
Amazon Chronos Family
Chronos (March 2024): Treats time series as language
- Scale and quantize values to 4,096 discrete tokens
- Train T5 encoder-decoder with cross-entropy loss
- Probabilistic forecasting via sampling multiple trajectories
Chronos-Bolt (November 2024): Massive efficiency gains
- 250x faster, 20x less memory, 5% lower error
- Patch-based: chunks of observations instead of individual points
- Direct multi-step forecasting (no autoregressive bottleneck)
Chronos-2 (October 2025): Universal forecasting
- 120M parameters, encoder-only
- Univariate + Multivariate + Covariates in single architecture
- Time and Group Attention: Alternates between temporal and cross-series attention
Google TimesFM
Decoder-only approach inspired by GPT:
- Patching: 32 time points per token
- Pretrained on 100B+ real-world points
Versions:
- TimesFM 1.0 (200M): Up to 512 context
- TimesFM 2.0 (500M): 2048 context, 25% better
- TimesFM 2.5: Architecture optimizations
2025: Integrated into BigQuery/AlloyDB; training corpus expanded to 400B+ points; few-shot learning capabilities.
Salesforce Moirai
Universal forecaster handling any domain, variable lengths, and exogenous features.
Moirai-MoE (October 2024): First mixture-of-experts time series foundation model
- 17% better than dense Moirai
- Comparable accuracy with 65x fewer activated parameters
Moirai 2.0 (November 2025): Decoder-only
- Quantile forecasting + multi-token prediction
- 2x faster, 30x smaller than Moirai 1.0-Large
TiRex (NeurIPS 2025)
xLSTM-based foundation model:
- Only 35M parameters
- State-of-the-art on GiftEval, Chronos-ZS benchmarks
- Outperforms Chronos-Bolt (200M), TimesFM (500M), Toto (151M)
- State-tracking via sLSTM critical for long-horizon forecasting
Datadog Toto
Time Series Optimized Transformer for Observability:
- Trained on ~1 trillion data points (largest dataset among published models)
- 750 billion anonymous numerical metrics from Datadog platform
- Optimized for observability (monitoring, alerting) use cases
Timer-XL (ICLR 2025)
Causal transformer for unified forecasting:
- Generalizes next-token prediction to multivariate next-patch prediction
- Handles univariate, multivariate, and covariate-informed forecasting
- Patch-level generation based on long-context sequences
TimeGPT (Nixtla)
Production-ready foundation model:
- Trained on 100B+ data points
- Zero-shot forecasting AND anomaly detection
- Does NOT use patching (unlike most others)
- Available via API and SDK
Lag-Llama (ServiceNow)
Probabilistic univariate foundation model:
- Decoder-only transformer with lagged features (t-1, t-7, t-30, etc.)
- Outputs Student's t distribution parameters
- Strong zero-shot generalization
Sundial (ICML 2025 Oral - Top 1%)
Generative foundation model from Tsinghua University—current benchmark leader:
- Pre-trained on TimeBench (10^12 time points—largest pretraining corpus)
- TimeFlow Loss: Flow-matching for native continuous-valued training (no tokenization)
- #1 on GIFT-Eval (MASE) and Time-Series-Library (MSE/MAE)
- Zero-shot predictions in milliseconds
- Generates multiple probable predictions without specifying prior distributions
- Mitigates mode collapse via TimeFlow Loss
Time-MoE (ICLR 2025 Spotlight)
Billion-scale MoE foundation model—the largest open time series model:
- Scales up to 2.4 billion parameters
- Time-300B: Largest open time series dataset (300B+ points across 9 domains)
- Sparse MoE activates only subset of experts per prediction
- Context length up to 4096, arbitrary prediction horizons
- Validates scaling laws for time series (more data + bigger model = better)
TabPFN-TS (January 2025)
Tabular foundation model adapted for time series—surprisingly effective:
- Only 11M parameters but top rank on GIFT-Eval
- Recasts forecasting as tabular regression problem
- Outperforms Chronos-Mini (20M) by 7.7%
- Outperforms Chronos-Large (710M) by 3.0% in zero-shot
- No time series pretraining needed—uses only tabular/synthetic data
- Supports both point and probabilistic forecasting
Time-LLM (ICLR 2024)
LLM reprogramming framework—1000+ citations in 2 years:
- Repurposes frozen LLMs (Llama-7B, GPT-2, BERT) for forecasting
- Prompt-as-Prefix (PaP): Enriches input with domain knowledge and task instructions
- Reprogram time series into text prototypes the LLM understands
- Excels in few-shot and zero-shot scenarios
- Adopted by industry for solar, wind, and weather forecasting
LLM-Based Forecasting (GPT-4, Claude)
General-purpose LLMs can forecast time series via prompting:
- LLMTIME: Encodes series as digit strings, adapts discrete distributions
- Surprising finding: GPT-4 performs worse than GPT-3 for forecasting (RLHF may hurt calibration)
- Specialized foundation models still outperform general LLMs
- Best use: Combine LLM reasoning with specialized forecasters
Benchmarks: GIFT-Eval
GIFT-Eval (General Time Series Forecasting Model Evaluation) is the new standard benchmark:
- 28 datasets, 144,000+ time series, 177 million data points
- 7 domains, 10 frequencies, short to long-term horizons
- Non-leaking pretraining data: 230 billion points for fair evaluation
- Evaluates univariate, multivariate, and zero-shot capabilities
Current Leaderboard (2025):
- Sundial (128M) - #1 MASE on GIFT-Eval, #1 MSE/MAE on TSLib (ICML 2025 Oral)
- TabPFN-TS (11M) - Top rank on point forecasting despite tiny size
- TiRex (35M) - State-of-the-art xLSTM, best zero-shot
- TimeCopilot - Ensemble of Chronos-2 + TimesFM-2.5 + TiRex
- Chronos-2 - Best full-featured single model
Key Finding: Short-term → foundation models win. Long-term → fine-tuned models (PatchTST, iTransformer) catch up.
┌─────────────────────────────────────────────────────────────────────────┐
│ FOUNDATION MODEL COMPARISON (2025) │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Model Arch Params Multi Covar Prob Best For │
│ ────────────────────────────────────────────────────────────────────── │
│ Sundial Flow-Match 128M ✓ ✗ ✓ GIFT-Eval #1 │
│ TabPFN-TS PFN 11M ✗ ✗ ✓ Point forecast │
│ TiRex xLSTM 35M ✗ ✗ ✓ Zero-shot best │
│ Chronos-2 Enc-only 120M ✓ ✓ ✓ Full-featured │
│ Chronos-Bolt Enc-Dec varies ✗ ✗ ✓ Fast inference │
│ Time-MoE MoE 2.4B ✓ ✗ ✓ Scale leader │
│ TimesFM-2.5 Dec-only 500M ✗ ✗ ✗ Google Cloud │
│ Moirai-2.0 Dec-only varies ✓ ✓ ✓ Efficiency │
│ Moirai-MoE MoE varies ✓ ✓ ✓ Sparse compute │
│ Toto Transformer 151M varies varies ✓ Observability │
│ Timer-XL Causal varies ✓ ✓ ✓ Unified tasks │
│ Time-LLM LLM-reprg 7B+ ✓ ✓ ✓ LLM leverage │
│ TimeGPT varies varies ✓ ✓ ✓ Production API │
│ Lag-Llama Dec-only varies ✗ ✗ ✓ Probabilistic │
│ │
│ GIFT-EVAL LEADERBOARD (2025): │
│ 1. Sundial (128M) - Flow-matching, #1 MASE (ICML 2025 Oral) │
│ 2. TabPFN-TS (11M) - Tabular model beats specialized models │
│ 3. TiRex (35M) - xLSTM with state-tracking │
│ 4. TimeCopilot - Ensemble (Chronos-2 + TimesFM-2.5 + TiRex) │
│ 5. Chronos-2 - Best single full-featured model │
│ │
│ KEY TRENDS 2025: │
│ • Smaller models with better architectures beating giants │
│ • MoE for efficient scaling (Time-MoE, Moirai-MoE) │
│ • Tabular models surprisingly competitive (TabPFN-TS) │
│ • Ensembles of top models achieve best results │
│ • Scaling laws validated for time series │
│ │
└─────────────────────────────────────────────────────────────────────────┘
How Tokenization Works
The key challenge: language is discrete (words), time series are continuous (numbers).
Quantization (Chronos)
- Scale series to mean 0, std 1
- Bin values into 4,096 buckets
- Each bucket = one vocabulary token
- Train to predict tokens
Patching (TimesFM, PatchTST)
- Chunk into non-overlapping patches (e.g., 32 points)
- Embed each patch via learned projection
- Feed patch embeddings to transformer
Benefits: Dramatic sequence reduction, local pattern capture.
Wavelet-Based (WaveToken)
- Wavelet decomposition: Separate coarse (trend) and fine (detail)
- Quantize wavelet coefficients
- Predict future coefficients, reconstruct
Multi-Resolution
Use patches of varying sizes simultaneously (8, 32, 128 steps) to capture patterns at different temporal scales.
Training Deep Learning Models for Time Series
Training time series forecasting models requires careful attention to loss functions, data preparation, validation strategies, and hyperparameter tuning. This section covers the complete training pipeline.
Loss Functions
The choice of loss function fundamentally shapes what your model learns. Different losses suit different forecasting objectives.
Point Forecast Losses
Mean Squared Error (MSE):
- Penalizes large errors heavily (squared term)
- Sensitive to outliers
- Use when large errors are particularly costly
Mean Absolute Error (MAE):
- More robust to outliers than MSE
- All errors penalized equally
- Better when outliers are noise, not signal
Mean Absolute Percentage Error (MAPE):
- Scale-independent (useful for comparing across series)
- Problem: Undefined when y=0, asymmetric
Symmetric MAPE (sMAPE):
- Addresses MAPE's asymmetry
- Bounded between 0% and 200%
Probabilistic Forecast Losses
Quantile Loss (Pinball Loss):
- Trains model to predict specific quantiles (e.g., 10th, 50th, 90th percentile)
- Asymmetric: Higher quantiles penalize underestimates more
- Sum quantile losses across multiple quantiles for full distribution
def quantile_loss(y_true, y_pred, quantile):
error = y_true - y_pred
return torch.max(quantile * error, (quantile - 1) * error).mean()
# Train for multiple quantiles
quantiles = [0.1, 0.5, 0.9]
total_loss = sum(quantile_loss(y, pred[q], q) for q in quantiles)
Continuous Ranked Probability Score (CRPS):
CRPS = ∫ (F(x) - 1{x ≥ y})² dx
- Evaluates the entire predicted distribution against the observation
- Generalizes MAE: when prediction is a point (delta function), CRPS = MAE
- Key advantage: Avoids quantile crossing (where predicted quantiles intersect)
- Used by Chronos, DeepAR, and other probabilistic models
def crps_gaussian(y_true, mu, sigma):
"""CRPS for Gaussian predictive distribution"""
z = (y_true - mu) / sigma
crps = sigma * (z * (2 * norm.cdf(z) - 1) +
2 * norm.pdf(z) - 1 / np.sqrt(np.pi))
return crps.mean()
Negative Log-Likelihood (NLL):
- For parametric distributions (Gaussian, Student's t, Negative Binomial)
- Model predicts distribution parameters (μ, σ for Gaussian)
- Loss is negative log probability of observation under predicted distribution
┌─────────────────────────────────────────────────────────────────────────┐
│ LOSS FUNCTION SELECTION GUIDE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ OBJECTIVE RECOMMENDED LOSS │
│ ────────────────────────────────────────────────────────────────────── │
│ Point forecast, outliers OK MSE │
│ Point forecast, robust MAE │
│ Scale-independent comparison sMAPE, MASE │
│ Prediction intervals Quantile Loss (multiple quantiles) │
│ Full distribution CRPS or NLL │
│ Business asymmetric cost Custom asymmetric loss │
│ │
│ FOUNDATION MODEL TRAINING: │
│ Chronos: Cross-entropy on quantized tokens │
│ TimesFM: MSE on patches │
│ Moirai: Mixture distribution NLL │
│ TiRex: Quantile loss │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Data Preparation and Preprocessing
Normalization Strategies
Per-series normalization (most common):
def normalize_series(series):
mean = series.mean()
std = series.std() + 1e-8 # Avoid division by zero
normalized = (series - mean) / std
return normalized, mean, std
def denormalize(normalized, mean, std):
return normalized * std + mean
Global normalization (for related series):
- Compute statistics across all series in dataset
- Use when series share similar scales
- Common in foundation model pretraining
Log transformation (for multiplicative patterns):
# For data with multiplicative seasonality/trends
log_series = np.log1p(series) # log(1+x) handles zeros
# Train on log_series, then expm1() to reverse
Handling Missing Values
Forward fill: Use last known value Interpolation: Linear or spline interpolation Masking: Mark missing values, let model handle (Chronos-2 supports this) Imputation models: Use CSDI or other imputation methods first
Handling Irregular Timestamps
For irregular time series:
- Resample to regular frequency (information loss)
- Neural CDEs: Designed for irregular data
- Time-aware models: Encode time gaps as features
Data Augmentation for Time Series
Data augmentation is critical for training generalizable models, especially foundation models.
TSMix (Time Series Mixup)
TSMix creates synthetic series by combining real series:
def tsmix(series_a, series_b, alpha=0.5):
"""Mix two time series with interpolation"""
# Random mixing coefficient
lam = np.random.beta(alpha, alpha)
mixed = lam * series_a + (1 - lam) * series_b
return mixed
Used in Chronos training: 10M TSMix augmentations from 28 datasets.
Key insight: TSMix improves zero-shot performance but not in-domain performance.
KernelSynth (Synthetic from Gaussian Processes)
KernelSynth generates unlimited synthetic series using Gaussian Processes:
from sklearn.gaussian_process.kernels import RBF, Periodic, WhiteKernel
def kernel_synth(length=512, n_samples=1000):
"""Generate synthetic time series from GP kernels"""
synthetic_series = []
for _ in range(n_samples):
# Randomly compose kernels
kernel = (
RBF(length_scale=np.random.uniform(10, 100)) +
Periodic(length_scale=np.random.uniform(5, 50),
periodicity=np.random.uniform(10, 100)) +
WhiteKernel(noise_level=np.random.uniform(0.01, 0.1))
)
# Sample from GP
X = np.arange(length).reshape(-1, 1)
K = kernel(X)
series = np.random.multivariate_normal(np.zeros(length), K)
synthetic_series.append(series)
return synthetic_series
Optimal ratio: ~10% synthetic data. More synthetic data tends to worsen performance.
Other Augmentation Techniques
Jittering: Add small random noise
augmented = series + np.random.normal(0, 0.01, len(series))
Scaling: Random amplitude scaling
augmented = series * np.random.uniform(0.8, 1.2)
Time warping: Stretch/compress time axis locally Window slicing: Random crops from longer series Magnitude warping: Smooth amplitude variations
Validation Strategies for Time Series
Critical: Standard k-fold cross-validation violates temporal ordering and causes data leakage. Use time-respecting validation.
Train-Test Split
Simple temporal split—train on past, test on future:
def temporal_split(series, train_ratio=0.8):
split_idx = int(len(series) * train_ratio)
train = series[:split_idx]
test = series[split_idx:]
return train, test
Problem: Single split may not be representative.
Walk-Forward Validation (Rolling Origin)
The gold standard for time series. Train on expanding/rolling window, test on next period:
from sklearn.model_selection import TimeSeriesSplit
def walk_forward_validation(series, model, n_splits=5):
tscv = TimeSeriesSplit(n_splits=n_splits)
scores = []
for train_idx, test_idx in tscv.split(series):
train, test = series[train_idx], series[test_idx]
# Train model
model.fit(train)
# Predict and evaluate
predictions = model.predict(len(test))
score = evaluate(test, predictions)
scores.append(score)
return np.mean(scores), np.std(scores)
┌─────────────────────────────────────────────────────────────────────────┐
│ WALK-FORWARD VALIDATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Time → │
│ ────────────────────────────────────────────────────────────────────── │
│ │
│ Fold 1: [TRAIN════════] [TEST] │
│ Fold 2: [TRAIN════════════════] [TEST] │
│ Fold 3: [TRAIN════════════════════════] [TEST] │
│ Fold 4: [TRAIN════════════════════════════════] [TEST] │
│ Fold 5: [TRAIN════════════════════════════════════════] [TEST] │
│ │
│ Each fold: Train on all prior data, test on next period │
│ Never use future data to predict past │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Adding a Gap (Embargo Period)
Prevent information leakage from recent observations:
def walk_forward_with_gap(series, gap=7):
"""Gap between train end and test start"""
# If predicting weekly, gap=7 prevents day-of-week leakage
...
Expanding vs Rolling Window
Expanding window: Train set grows with each fold (more data) Rolling window: Fixed-size train set slides forward (handles non-stationarity)
# Rolling window
tscv = TimeSeriesSplit(n_splits=5, max_train_size=365*3) # Max 3 years
Hyperparameter Tuning
Key hyperparameters for time series deep learning models:
Lookback Window (Context Length)
How much history the model sees:
- Too short: Misses seasonal patterns
- Too long: Noise, computational cost, vanishing gradients
Rule of thumb: At least 2-3x the longest seasonality period.
# If data has yearly seasonality (365 days)
lookback = 365 * 2 # ~2 years of history
# For foundation models
# Chronos: Up to 4096 tokens
# TimesFM: Up to 2048 points
# TiRex: Variable, designed for long context
Forecast Horizon
How far ahead to predict:
- Longer horizons → harder, more uncertainty
- Match to business needs
Learning Rate
Most important hyperparameter:
- Too high: Divergence, oscillation
- Too low: Slow convergence, stuck in local minima
Recommendations:
- Start with 1e-3 to 1e-4 for Adam
- Use learning rate schedulers (cosine, reduce-on-plateau)
- Linear scaling rule: When batch size × k, learning rate × k
# Learning rate scheduling
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=num_epochs, eta_min=1e-6
)
# Or reduce on plateau
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode='min', factor=0.5, patience=10
)
Batch Size
- Smaller (16-64): More noise, better generalization, slower
- Larger (128-512): Stable gradients, faster, may overfit
For time series: Research suggests optimal batch size may correlate with data periodicity (use FFT to find dominant frequencies).
# Typical starting points
batch_size = 32 # Good default
# Increase if training is stable, decrease if loss is noisy
Model-Specific Hyperparameters
LSTM/GRU:
- Hidden size: 64-512
- Number of layers: 1-3
- Dropout: 0.1-0.3
Transformer:
- d_model: 64-512
- n_heads: 4-8
- n_layers: 2-6
- Patch size: 8-64
N-BEATS:
- Stack types: [trend, seasonality] or [generic]
- Blocks per stack: 3-5
- Layer width: 256-512
Fine-Tuning Foundation Models
Foundation models work zero-shot, but fine-tuning can significantly improve domain-specific performance.
When to Fine-Tune
Fine-tune when:
- Your domain has unique patterns not in pretraining data
- You have sufficient domain data (1000+ series or 100K+ points)
- Zero-shot performance is good but not good enough
- Computational resources available
Don't fine-tune when:
- Very limited data (risk of overfitting)
- Zero-shot already meets requirements
- Need rapid deployment
Fine-Tuning Strategies
Full Fine-Tuning
Update all model weights:
# Load pretrained model
model = ChronosModel.from_pretrained("amazon/chronos-t5-base")
# Unfreeze all parameters
for param in model.parameters():
param.requires_grad = True
# Train with lower learning rate
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5) # 10x lower than pretraining
Pros: Maximum adaptation Cons: Needs more data, risk of catastrophic forgetting
Adapter-Based Fine-Tuning (PEFT)
Add small trainable modules, freeze pretrained weights:
from peft import get_peft_model, LoraConfig
# Add LoRA adapters
lora_config = LoraConfig(
r=8, # Rank
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1,
)
model = get_peft_model(base_model, lora_config)
# Only adapter parameters are trainable
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
# Typically <1% of total parameters
Pros: Efficient, less overfitting, can store multiple adapters Cons: Less expressive than full fine-tuning
ChronosX uses adapters to add covariate support to Chronos, achieving ~22% improvement.
Last-Layer Fine-Tuning
Only train the output projection:
# Freeze all but last layer
for param in model.parameters():
param.requires_grad = False
for param in model.output_projection.parameters():
param.requires_grad = True
Pros: Fast, minimal overfitting risk Cons: Limited adaptation
Fine-Tuning Code Examples
Fine-Tuning Chronos
import torch
from chronos import ChronosConfig, ChronosModel
from torch.utils.data import DataLoader
# Load pretrained
config = ChronosConfig.from_pretrained("amazon/chronos-t5-small")
model = ChronosModel.from_pretrained("amazon/chronos-t5-small")
# Prepare your data
train_dataset = YourTimeSeriesDataset(your_data)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
# Fine-tuning setup
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10)
model.train()
for epoch in range(10):
for batch in train_loader:
context, target = batch
# Tokenize (Chronos-specific)
tokens = model.tokenizer.encode(context)
target_tokens = model.tokenizer.encode(target)
# Forward pass
loss = model(tokens, labels=target_tokens).loss
# Backward pass
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
# Save fine-tuned model
model.save_pretrained("./chronos-finetuned")
Fine-Tuning with NeuralForecast
from neuralforecast import NeuralForecast
from neuralforecast.models import NBEATS, NHITS, TFT
# Define models with hyperparameters
models = [
NBEATS(
input_size=7*24, # 1 week of hourly data
h=24, # Predict 24 hours ahead
max_steps=1000,
learning_rate=1e-3,
batch_size=32,
),
NHITS(
input_size=7*24,
h=24,
max_steps=1000,
n_freq_downsample=[24, 12, 1], # Multi-resolution
),
TFT(
input_size=7*24,
h=24,
hidden_size=128,
max_steps=1000,
)
]
# Train
nf = NeuralForecast(models=models, freq='H')
nf.fit(df=train_df)
# Predict
forecasts = nf.predict()
Training Infrastructure
Memory Optimization
# Gradient checkpointing (trade compute for memory)
model.gradient_checkpointing_enable()
# Mixed precision training
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast():
loss = model(inputs).loss
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Distributed Training
# PyTorch DDP
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel
dist.init_process_group(backend='nccl')
model = DistributedDataParallel(model)
# Or use Hugging Face Accelerate
from accelerate import Accelerator
accelerator = Accelerator()
model, optimizer, train_loader = accelerator.prepare(
model, optimizer, train_loader
)
Evaluation Metrics
Point Forecast Metrics
def mase(y_true, y_pred, y_train, seasonality=1):
"""Mean Absolute Scaled Error - compares to naive baseline"""
naive_mae = np.mean(np.abs(y_train[seasonality:] - y_train[:-seasonality]))
forecast_mae = np.mean(np.abs(y_true - y_pred))
return forecast_mae / naive_mae
def smape(y_true, y_pred):
"""Symmetric Mean Absolute Percentage Error"""
return 100 * np.mean(2 * np.abs(y_true - y_pred) /
(np.abs(y_true) + np.abs(y_pred) + 1e-8))
Probabilistic Metrics
def coverage(y_true, lower, upper):
"""What fraction of observations fall within prediction interval"""
return np.mean((y_true >= lower) & (y_true <= upper))
def interval_width(lower, upper):
"""Average width of prediction intervals"""
return np.mean(upper - lower)
# Good model: High coverage + narrow intervals
┌─────────────────────────────────────────────────────────────────────────┐
│ COMPLETE TRAINING PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. DATA PREPARATION │
│ ├─ Handle missing values │
│ ├─ Normalize (per-series or global) │
│ ├─ Create train/val/test splits (temporal) │
│ └─ Apply augmentation (TSMix, KernelSynth) │
│ │
│ 2. MODEL SELECTION │
│ ├─ Baseline: ARIMA/ETS for comparison │
│ ├─ Choose architecture based on data/requirements │
│ └─ Foundation model for zero-shot baseline │
│ │
│ 3. TRAINING │
│ ├─ Set loss function (MSE/MAE/Quantile/CRPS) │
│ ├─ Configure optimizer (AdamW, lr=1e-3 to 1e-4) │
│ ├─ Add learning rate scheduler │
│ ├─ Implement early stopping on validation loss │
│ └─ Use gradient clipping (max_norm=1.0) │
│ │
│ 4. VALIDATION │
│ ├─ Walk-forward validation (5+ folds) │
│ ├─ Track point metrics (MAE, MASE, sMAPE) │
│ └─ Track probabilistic metrics (CRPS, coverage) │
│ │
│ 5. FINE-TUNING (if using foundation model) │
│ ├─ Start with zero-shot baseline │
│ ├─ Try adapter-based fine-tuning first │
│ ├─ Use lower learning rate (10x lower) │
│ └─ Monitor for overfitting │
│ │
│ 6. PRODUCTION │
│ ├─ Save model + normalization params │
│ ├─ Set up monitoring (accuracy drift) │
│ └─ Plan retraining schedule │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Model Selection Guide
When Classical Methods Win
Use ARIMA/ETS when:
- Series < 1000 points with clear autocorrelation
- Interpretability > marginal accuracy gains
- No deep learning infrastructure
Use Prophet when:
- Business forecasting with seasonalities/holidays
- Non-technical stakeholders need understanding
- Speed of development > accuracy
When Deep Learning Excels
Use N-BEATS/N-HiTS when:
- Interpretability needed but classical methods insufficient
- Long-horizon forecasting
- No time-series-specific feature engineering desired
Use LSTM/Transformer when:
- Complex patterns, substantial data
- Multivariate with interactions
- Long sequences (use Informer/Autoformer for efficiency)
When Foundation Models Shine
Use foundation models when:
- Zero-shot acceptable: No training time
- Limited data: Cold-start problems
- Diverse domains: One model across everything
- Rapid prototyping: Test feasibility first
Model recommendations:
- Overall best: TiRex (35M, state-of-the-art)
- Full-featured: Chronos-2 (covariates, multivariate)
- Fast inference: Chronos-Bolt (250x faster)
- Efficiency: Moirai-2.0 (2x faster, 30x smaller)
- Google Cloud: TimesFM (BigQuery integration)
- Probabilistic: Lag-Llama (uncertainty quantification)
Production Considerations
Data Pipeline
- Missing value handling strategy
- Normalization parameters versioned
- Frequency validation
- Outlier detection
Inference
- Context window limits
- GPU batching optimization
- Fallback models
- Caching for repeated series
Output
- Prediction intervals (not just point forecasts)
- Denormalization
- Forecast metadata (model version, confidence)
Monitoring
- Accuracy metrics over time
- Data/concept drift detection
- Degradation alerts
- Regular backtesting
Practical Examples
Using Chronos
import torch
from chronos import ChronosPipeline
pipeline = ChronosPipeline.from_pretrained(
"amazon/chronos-t5-base",
device_map="cuda",
torch_dtype=torch.bfloat16,
)
context = torch.tensor([[1.2, 1.5, 1.3, 1.8, 2.1, 1.9, 2.3, 2.0]])
forecast = pipeline.predict(
context,
prediction_length=12,
num_samples=20,
)
median = forecast.median(dim=1).values
lower = forecast.quantile(0.1, dim=1).values
upper = forecast.quantile(0.9, dim=1).values
Using TiRex
from tirex import TiRexPipeline
pipeline = TiRexPipeline.from_pretrained("NX-AI/TiRex")
# TiRex provides both point and quantile predictions
forecast = pipeline.predict(
context,
prediction_length=24,
)
Using TimesFM
import timesfm
tfm = timesfm.TimesFm(
context_len=512,
horizon_len=128,
input_patch_len=32,
output_patch_len=128,
)
tfm.load_from_checkpoint(repo_id="google/timesfm-1.0-200m")
forecast = tfm.forecast(context)
Frequently Asked Questions
Related Articles
GenAI for Data Analytics: From Raw Data to Actionable Insights
How to build AI systems that explore datasets autonomously, discover patterns you didn't know to look for, and explain insights in natural language.
Building Production-Ready RAG Systems: Lessons from the Field
A comprehensive guide to building Retrieval-Augmented Generation systems that actually work in production, based on real-world experience at Goji AI.
ML System Design: A Complete Framework for Production Systems
A comprehensive framework for designing machine learning systems at scale. From problem framing to production monitoring—everything you need to build ML systems that actually work.
When NOT to Use LLMs: A Practical Guide to Choosing the Right Tool
A contrarian but practical guide to when large language models are the wrong choice. Understanding when traditional ML, simple heuristics, or no ML at all will outperform LLMs on cost, latency, reliability, and accuracy.