How do I reproduce OLMo training?

The official training scripts in src/scripts/official/ are the exact scripts used for released models. Launch with torchrun and point to your data and checkpoint locations.

Can I train on my own data?

Yes. The data module supports various input formats. You'll need to tokenize your data and convert to the memmap format for efficient training. Utilities are provided for common conversions.

How does OLMo compare to other training frameworks?

OLMo-core prioritizes transparency and reproducibility over maximum performance. It's designed for research and education, making design decisions clear. For production training at massive scale, you might combine OLMo-core insights with more optimized frameworks.

What hardware is required?

OLMo-core can run on any GPU supported by PyTorch, but large models require substantial resources. The published checkpoints were trained on clusters of H100 GPUs. Smaller experiments can run on consumer hardware.

How do I extend the architecture?

The modular design allows replacing individual components. Want a different attention mechanism? Implement the interface and plug it in. The config system propagates your custom components through the training pipeline.

Is there support for fine-tuning?

Yes, the training infrastructure supports starting from existing checkpoints. You can continue pre-training, perform supervised fine-tuning, or implement custom fine-tuning objectives through training modules.

OLMo-core: A Deep Dive into Allen AI's Open LLM Training Pipeline | Enrico Piovano

Introduction

Allen AI's OLMo (Open Language Model) project represents one of the most ambitious efforts in open LLM development. Unlike other open-weight models that release only the final checkpoint, OLMo releases everything: model weights, training code, data, and intermediate checkpoints. The OLMo-core library provides the foundational building blocks that make this possible.

This post explores OLMo-core's architecture by examining its Python implementation. We'll understand how the distributed trainer orchestrates multi-node training, how the model architecture is defined, how checkpointing enables reproducibility, and how the modular design allows customization for different model sizes and configurations.

Project Philosophy

OLMo-core embodies the philosophy that advancing AI requires transparency. The codebase is designed not just for internal use but as a reference implementation that others can study, modify, and build upon. Every design decision prioritizes clarity and reproducibility over clever optimizations that obscure behavior.

The official training scripts for OLMo-2 and OLMo-3 are included directly in the repository. These aren't simplified examples but the actual scripts used to train the released models. This transparency allows researchers to understand exactly how frontier models are built.

Core Architecture

OLMo-core is organized into focused modules that can be composed for different training scenarios.

Module Organization

The main package contains several key directories. The nn directory provides neural network building blocks including attention, feed-forward layers, and transformer blocks. The train directory contains the trainer and training-related utilities. The optim directory implements optimizers and learning rate schedulers. The data directory handles data loading and preprocessing. The distributed directory manages multi-GPU and multi-node coordination.

Additional utilities include io for checkpoint and data I/O, eval for evaluation during training, and generate for inference after training.

Neural Network Components

The nn module provides the core model components, designed for flexibility while maintaining performance.

Attention Mechanisms

The attention directory implements multi-head attention with multiple backends. The module supports standard attention, Flash Attention 2, Flash Attention 3, and ring attention for sequence parallelism. Backend selection happens automatically based on available libraries and configuration.

Rotary Position Embeddings (RoPE) are implemented in rope.py with extensive support for different scaling strategies. The implementation supports standard RoPE, YaRN scaling for extended context, and various interpolation methods. The attention module uses these embeddings transparently based on configuration.

Feed-Forward Networks

The feed_forward.py module implements the feed-forward component of transformer layers. It supports standard MLP architectures, gated variants (SwiGLU, GeGLU), and configurable activation functions. The implementation includes proper initialization and supports tensor parallelism.

Mixture of Experts

The moe directory implements Mixture of Experts architectures. The implementation supports both standard MoE with token dropping and dropless MoE using grouped GEMMs. Load balancing losses encourage even expert utilization, and the auxiliary loss can be configured independently.

Transformer Architecture

The transformer directory composes attention, feed-forward, and normalization into complete transformer blocks and models. The architecture is configurable through dataclass configs that specify layer counts, dimensions, attention heads, and other hyperparameters.

Layer normalization options include standard LayerNorm, RMSNorm, and fused variants for performance. The residual stream supports pre-norm and post-norm configurations.

Language Model Head

The lm_head.py module implements the output projection and loss computation. It supports weight tying with the input embeddings and integrates with various loss implementations including fused cross-entropy for memory efficiency.

Training Infrastructure

The train module provides comprehensive training infrastructure.

The Trainer

The trainer.py file is the central coordinator for training. It manages the training loop, handles distributed training, coordinates checkpointing, and integrates with callbacks for extensibility.

The trainer accepts a TrainConfig that specifies all training parameters: model configuration, optimizer settings, data loading parameters, checkpoint frequency, and more. This config-driven approach ensures reproducibility and makes it easy to modify training runs.

The training loop follows a standard pattern: load batch, forward pass, compute loss, backward pass, optimizer step. But the implementation handles many complexities: gradient accumulation, gradient clipping, mixed precision, distributed synchronization, and checkpoint saving.

Training Modules

The train_module directory provides abstractions for different training scenarios. A training module encapsulates the model, optimizer, and any module-specific logic. This abstraction allows different training approaches (pre-training, fine-tuning, different objectives) to share the same trainer infrastructure.

Checkpointing

Checkpoint management in checkpoint.py handles saving and loading training state. Checkpoints include model weights, optimizer state, scheduler state, training progress (step count, tokens processed), and random number generator states for reproducibility.

The checkpoint format is designed for distributed training. When using tensor parallelism or pipeline parallelism, each rank saves its shard of the model. Loading handles resharding when the parallelism configuration changes between saving and loading.

OLMo checkpoints are stored in a consistent format that enables conversion to other frameworks (HuggingFace Transformers, vLLM) through the conversion utilities.

Callbacks

The callbacks directory provides hooks for extending trainer behavior. Callbacks can run at various points: before/after training steps, at checkpoint saves, at evaluation points, and at training completion.

Built-in callbacks include logging callbacks for metrics, evaluation callbacks that run periodic assessments, checkpoint callbacks that manage save logic, and profiling callbacks that gather performance data.

Optimization

The optim module implements optimizers and learning rate scheduling.

Optimizers

The optimizer implementations include AdamW with careful handling of weight decay and epsilon. For large-scale training, the optimizer state can dominate memory usage, so the implementations support various memory optimization strategies.

Distributed optimizer variants shard optimizer state across ranks, reducing per-GPU memory at the cost of additional communication. This enables training larger models within fixed memory budgets.

Learning Rate Scheduling

Learning rate schedulers implement warmup, decay, and annealing patterns. The standard OLMo schedule uses linear warmup followed by cosine decay. Annealing scripts implement final-stage training with modified schedules.

Schedulers are configured through dataclasses that specify warmup steps, decay type, minimum learning rate, and total training steps.

Data Pipeline

The data module handles data loading for training.

Data Formats

OLMo uses pre-tokenized data stored in memory-mapped formats for efficient loading. The memmap format allows random access without loading entire datasets into memory, crucial for training on datasets with trillions of tokens.

The data module supports various input formats and handles conversion to the efficient training format. Metadata tracks document boundaries, enabling proper handling of attention masks across concatenated documents.

Data Mixing

Training on multiple data sources requires mixing strategies. The data module supports weighted mixing where different sources contribute different proportions to training batches. Mixing weights can change during training following predefined schedules.

Sequence Packing

For efficiency, sequences are packed into fixed-length chunks. The packing handles document boundaries appropriately—attention doesn't cross between documents even within packed sequences. Position IDs are adjusted to reflect the actual position within each document.

Distributed Training

The distributed module provides infrastructure for multi-GPU and multi-node training.

Parallelism Strategies

OLMo-core supports multiple parallelism strategies. Data parallelism replicates the model across ranks, each processing different batches. Tensor parallelism splits individual layers across ranks, reducing per-GPU memory. Pipeline parallelism distributes layers across ranks in a pipeline.

Fully Sharded Data Parallel (FSDP) shards model parameters, gradients, and optimizer state across ranks. The implementation integrates with PyTorch's FSDP while providing OLMo-specific configuration and utilities.

Communication

Communication primitives handle the collective operations needed for distributed training. All-reduce for gradient synchronization, all-gather for collecting sharded tensors, and reduce-scatter for distributing gradients. The implementation handles different communication backends and optimizes for network topology when possible.

Sequence Parallelism

For very long sequences, sequence parallelism distributes the sequence dimension across ranks. The ring attention implementation enables efficient attention computation across distributed sequences, crucial for training with long context lengths.

Float8 Training

The float8 directory implements 8-bit floating point training for memory efficiency.

Float8 training quantizes activations and weights to 8-bit formats during forward and backward passes. This reduces memory usage and can improve throughput on hardware with float8 support (like NVIDIA H100). The implementation handles the scaling and conversion automatically while maintaining training stability.

Model Ladder

The model_ladder.py module implements the "model ladder" concept—a series of smaller models used to predict optimal hyperparameters for larger models.

Training large models is expensive, and hyperparameter tuning at scale is impractical. The model ladder trains progressively larger models, each informing hyperparameter choices for the next. This systematic approach to hyperparameter transfer significantly reduces the cost of training frontier models.

Evaluation

The eval directory provides evaluation infrastructure.

In-Training Evaluation

Periodic evaluation during training tracks model quality. The evaluation callbacks run standard benchmarks at configurable intervals, logging results for monitoring training progress.

Benchmark Integration

The evaluation code integrates with standard LLM benchmarks. This enables direct comparison with other models and tracking of capability development during training.

Launch Utilities

The launch directory provides utilities for launching training jobs.

Cluster Integration

The launch utilities handle different cluster environments. Beaker integration supports AI2's internal cluster, while standard torchrun support enables running on any PyTorch-compatible cluster.

Job configuration handles resource requests, environment setup, and coordination across multiple nodes. The launch scripts used for official OLMo training are included as references.

Configuration System

OLMo-core uses dataclasses for configuration, providing type safety and easy serialization.

Config Dataclasses

All major components have associated config dataclasses. ModelConfig specifies architecture parameters. TrainConfig specifies training parameters. OptimizerConfig specifies optimizer settings. These configs compose hierarchically—a TrainConfig contains a ModelConfig, OptimizerConfig, and DataConfig.

Config Override

Command-line override of config values uses a dot-notation path syntax. This enables modifying specific values without changing config files, useful for hyperparameter sweeps and debugging.

Table of Contents