Skip to main content
Back to Blog

OLMo-core: A Deep Dive into Allen AI's Open LLM Training Pipeline

A comprehensive exploration of OLMo-core's architecture—examining Allen AI's building blocks for fully open LLM training including the distributed trainer, model architectures, optimization strategies, and the design that enabled OLMo-2 and OLMo-3 training.

8 min read
Share:

Introduction

Allen AI's OLMo (Open Language Model) project represents one of the most ambitious efforts in open LLM development. Unlike other open-weight models that release only the final checkpoint, OLMo releases everything: model weights, training code, data, and intermediate checkpoints. The OLMo-core library provides the foundational building blocks that make this possible.

This post explores OLMo-core's architecture by examining its Python implementation. We'll understand how the distributed trainer orchestrates multi-node training, how the model architecture is defined, how checkpointing enables reproducibility, and how the modular design allows customization for different model sizes and configurations.

Project Philosophy

OLMo-core embodies the philosophy that advancing AI requires transparency. The codebase is designed not just for internal use but as a reference implementation that others can study, modify, and build upon. Every design decision prioritizes clarity and reproducibility over clever optimizations that obscure behavior.

The official training scripts for OLMo-2 and OLMo-3 are included directly in the repository. These aren't simplified examples but the actual scripts used to train the released models. This transparency allows researchers to understand exactly how frontier models are built.

Core Architecture

OLMo-core is organized into focused modules that can be composed for different training scenarios.

Module Organization

The main package contains several key directories. The nn directory provides neural network building blocks including attention, feed-forward layers, and transformer blocks. The train directory contains the trainer and training-related utilities. The optim directory implements optimizers and learning rate schedulers. The data directory handles data loading and preprocessing. The distributed directory manages multi-GPU and multi-node coordination.

Additional utilities include io for checkpoint and data I/O, eval for evaluation during training, and generate for inference after training.

Neural Network Components

The nn module provides the core model components, designed for flexibility while maintaining performance.

Attention Mechanisms

The attention directory implements multi-head attention with multiple backends. The module supports standard attention, Flash Attention 2, Flash Attention 3, and ring attention for sequence parallelism. Backend selection happens automatically based on available libraries and configuration.

Rotary Position Embeddings (RoPE) are implemented in rope.py with extensive support for different scaling strategies. The implementation supports standard RoPE, YaRN scaling for extended context, and various interpolation methods. The attention module uses these embeddings transparently based on configuration.

Feed-Forward Networks

The feed_forward.py module implements the feed-forward component of transformer layers. It supports standard MLP architectures, gated variants (SwiGLU, GeGLU), and configurable activation functions. The implementation includes proper initialization and supports tensor parallelism.

Mixture of Experts

The moe directory implements Mixture of Experts architectures. The implementation supports both standard MoE with token dropping and dropless MoE using grouped GEMMs. Load balancing losses encourage even expert utilization, and the auxiliary loss can be configured independently.

Transformer Architecture

The transformer directory composes attention, feed-forward, and normalization into complete transformer blocks and models. The architecture is configurable through dataclass configs that specify layer counts, dimensions, attention heads, and other hyperparameters.

Layer normalization options include standard LayerNorm, RMSNorm, and fused variants for performance. The residual stream supports pre-norm and post-norm configurations.

Language Model Head

The lm_head.py module implements the output projection and loss computation. It supports weight tying with the input embeddings and integrates with various loss implementations including fused cross-entropy for memory efficiency.

Training Infrastructure

The train module provides comprehensive training infrastructure.

The Trainer

The trainer.py file is the central coordinator for training. It manages the training loop, handles distributed training, coordinates checkpointing, and integrates with callbacks for extensibility.

The trainer accepts a TrainConfig that specifies all training parameters: model configuration, optimizer settings, data loading parameters, checkpoint frequency, and more. This config-driven approach ensures reproducibility and makes it easy to modify training runs.

The training loop follows a standard pattern: load batch, forward pass, compute loss, backward pass, optimizer step. But the implementation handles many complexities: gradient accumulation, gradient clipping, mixed precision, distributed synchronization, and checkpoint saving.

Training Modules

The train_module directory provides abstractions for different training scenarios. A training module encapsulates the model, optimizer, and any module-specific logic. This abstraction allows different training approaches (pre-training, fine-tuning, different objectives) to share the same trainer infrastructure.

Checkpointing

Checkpoint management in checkpoint.py handles saving and loading training state. Checkpoints include model weights, optimizer state, scheduler state, training progress (step count, tokens processed), and random number generator states for reproducibility.

The checkpoint format is designed for distributed training. When using tensor parallelism or pipeline parallelism, each rank saves its shard of the model. Loading handles resharding when the parallelism configuration changes between saving and loading.

OLMo checkpoints are stored in a consistent format that enables conversion to other frameworks (HuggingFace Transformers, vLLM) through the conversion utilities.

Callbacks

The callbacks directory provides hooks for extending trainer behavior. Callbacks can run at various points: before/after training steps, at checkpoint saves, at evaluation points, and at training completion.

Built-in callbacks include logging callbacks for metrics, evaluation callbacks that run periodic assessments, checkpoint callbacks that manage save logic, and profiling callbacks that gather performance data.

Optimization

The optim module implements optimizers and learning rate scheduling.

Optimizers

The optimizer implementations include AdamW with careful handling of weight decay and epsilon. For large-scale training, the optimizer state can dominate memory usage, so the implementations support various memory optimization strategies.

Distributed optimizer variants shard optimizer state across ranks, reducing per-GPU memory at the cost of additional communication. This enables training larger models within fixed memory budgets.

Learning Rate Scheduling

Learning rate schedulers implement warmup, decay, and annealing patterns. The standard OLMo schedule uses linear warmup followed by cosine decay. Annealing scripts implement final-stage training with modified schedules.

Schedulers are configured through dataclasses that specify warmup steps, decay type, minimum learning rate, and total training steps.

Data Pipeline

The data module handles data loading for training.

Data Formats

OLMo uses pre-tokenized data stored in memory-mapped formats for efficient loading. The memmap format allows random access without loading entire datasets into memory, crucial for training on datasets with trillions of tokens.

The data module supports various input formats and handles conversion to the efficient training format. Metadata tracks document boundaries, enabling proper handling of attention masks across concatenated documents.

Data Mixing

Training on multiple data sources requires mixing strategies. The data module supports weighted mixing where different sources contribute different proportions to training batches. Mixing weights can change during training following predefined schedules.

Sequence Packing

For efficiency, sequences are packed into fixed-length chunks. The packing handles document boundaries appropriately—attention doesn't cross between documents even within packed sequences. Position IDs are adjusted to reflect the actual position within each document.

Distributed Training

The distributed module provides infrastructure for multi-GPU and multi-node training.

Parallelism Strategies

OLMo-core supports multiple parallelism strategies. Data parallelism replicates the model across ranks, each processing different batches. Tensor parallelism splits individual layers across ranks, reducing per-GPU memory. Pipeline parallelism distributes layers across ranks in a pipeline.

Fully Sharded Data Parallel (FSDP) shards model parameters, gradients, and optimizer state across ranks. The implementation integrates with PyTorch's FSDP while providing OLMo-specific configuration and utilities.

Communication

Communication primitives handle the collective operations needed for distributed training. All-reduce for gradient synchronization, all-gather for collecting sharded tensors, and reduce-scatter for distributing gradients. The implementation handles different communication backends and optimizes for network topology when possible.

Sequence Parallelism

For very long sequences, sequence parallelism distributes the sequence dimension across ranks. The ring attention implementation enables efficient attention computation across distributed sequences, crucial for training with long context lengths.

Float8 Training

The float8 directory implements 8-bit floating point training for memory efficiency.

Float8 training quantizes activations and weights to 8-bit formats during forward and backward passes. This reduces memory usage and can improve throughput on hardware with float8 support (like NVIDIA H100). The implementation handles the scaling and conversion automatically while maintaining training stability.

Model Ladder

The model_ladder.py module implements the "model ladder" concept—a series of smaller models used to predict optimal hyperparameters for larger models.

Training large models is expensive, and hyperparameter tuning at scale is impractical. The model ladder trains progressively larger models, each informing hyperparameter choices for the next. This systematic approach to hyperparameter transfer significantly reduces the cost of training frontier models.

Evaluation

The eval directory provides evaluation infrastructure.

In-Training Evaluation

Periodic evaluation during training tracks model quality. The evaluation callbacks run standard benchmarks at configurable intervals, logging results for monitoring training progress.

Benchmark Integration

The evaluation code integrates with standard LLM benchmarks. This enables direct comparison with other models and tracking of capability development during training.

Launch Utilities

The launch directory provides utilities for launching training jobs.

Cluster Integration

The launch utilities handle different cluster environments. Beaker integration supports AI2's internal cluster, while standard torchrun support enables running on any PyTorch-compatible cluster.

Job configuration handles resource requests, environment setup, and coordination across multiple nodes. The launch scripts used for official OLMo training are included as references.

Configuration System

OLMo-core uses dataclasses for configuration, providing type safety and easy serialization.

Config Dataclasses

All major components have associated config dataclasses. ModelConfig specifies architecture parameters. TrainConfig specifies training parameters. OptimizerConfig specifies optimizer settings. These configs compose hierarchically—a TrainConfig contains a ModelConfig, OptimizerConfig, and DataConfig.

Config Override

Command-line override of config values uses a dot-notation path syntax. This enables modifying specific values without changing config files, useful for hyperparameter sweeps and debugging.

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles

LLMsML Engineering

Distributed Training: How to Train 70B+ Parameter Models

A comprehensive deep dive into distributed training—how to train models that don't fit on a single GPU. Understand data parallelism, tensor parallelism, pipeline parallelism, ZeRO optimization, and the engineering behind training frontier LLMs.

3 min read
EducationLLMs

LLM Pre-training: Building Foundation Models from Scratch

A comprehensive guide to pre-training large language models—from data curation and architecture decisions to scaling laws and distributed training infrastructure. Understanding how GPT, Llama, and other foundation models are built.

15 min read
LLMsML Engineering

Mixture of Experts: Scaling LLMs Beyond Dense Models

A comprehensive deep dive into Mixture of Experts (MoE) architecture—how models like Mixtral and GPT-4 achieve massive capacity without proportional compute costs. Understand routing mechanisms, expert specialization, load balancing, and why MoE represents the future of LLM scaling.

6 min read
LLMsML Engineering

Attention Mechanisms: From Self-Attention to FlashAttention

A comprehensive deep dive into attention mechanisms—the core innovation powering modern LLMs. From the intuition behind self-attention to the engineering of FlashAttention, understand how transformers actually work.

7 min read
LLMsML Engineering

Open-Source LLMs: The Complete 2025 Guide

A comprehensive guide to open-source LLMs—Llama 4, Qwen3, DeepSeek V3.2, Mistral Large 3, Kimi K2, GLM-4.7 and more. Detailed benchmarks, hardware requirements, deployment strategies, and practical recommendations for production use.

3 min read