OLMoE#
OLMoE is a 7B-parameter Mixture-of-Experts (MoE) model from Allen Institute for AI (AI2) featuring 64 experts with top-8 routing. The model is designed to be fully open-source, with training data, code, and model weights publicly available. Itβs named βOLMoE-1B-7Bβ where 1B refers to the activated parameters and 7B refers to the total parameters.
The latest version (OLMoE-1B-7B-0125, released January 2025) is an improved version of the original September 2024 release (OLMoE-1B-7B-0924), trained on 5T tokens with performance improvements across multiple benchmarks.
The model features 16 decoder layers with 64 routed experts per layer, activating 8 experts per token for a total of approximately 1.3B active parameters per forward pass out of 7B total.
OLMoE models are supported via the Bridge system with specialized configurations for MoE optimizations.
Model Architecture#
Parameters: 7B total, 1.3B activated per forward pass
Layers: 16 decoder layers
Attention: Multi-query attention with QK LayerNorm and RoPE
MoE: 64 routed experts per layer with top-8 routing
Hidden size: 2048
FFN hidden size: 1024 (dense layers), 1024 (expert layers)
Attention heads: 16 query heads, 16 key-value groups
Vocab size: 50,304
Context Length: 4K tokens
Activation: SiLU with gated linear units
Training: 5T tokens (OLMoE-1B-7B-0125)
Key Features#
QK LayerNorm: Applies LayerNorm to query and key projections for training stability
RoPE: Rotary Position Embeddings with base 10000
MoE Routing: Softmax-based router with auxiliary loss for load balancing
Router Pre-Softmax: Pre-softmax routing scores
Grouped GEMM: Optimized grouped matrix multiplications for expert computation
Conversion with π€ Hugging Face#
Load HF β Megatron#
from megatron.bridge import AutoBridge
# Example: OLMoE-1B-7B-0125 (latest version)
bridge = AutoBridge.from_hf_pretrained("allenai/OLMoE-1B-7B-0125")
provider = bridge.to_megatron_provider()
# Configure parallelism before instantiating the model
provider.tensor_model_parallel_size = 1
provider.pipeline_model_parallel_size = 1
provider.expert_model_parallel_size = 8
provider.sequence_parallel = False
provider.finalize()
model = provider.provide_distributed_model(wrap_with_ddp=False)
# You can also use older versions:
# bridge = AutoBridge.from_hf_pretrained("allenai/OLMoE-1B-7B-0924")
Export Megatron β HF#
# Convert from a Megatron checkpoint directory to HF format
bridge.export_ckpt(
megatron_path="/results/olmoe_7b/checkpoints/iter_0500000",
hf_path="./olmoe-hf-export",
)
Examples#
Checkpoint conversion: examples/conversion/convert_checkpoints.py
Recipes#
See: bridge.recipes.olmoe
Available Recipes#
Pretrain recipes:
olmoe_7b_pretrain_config: Pre-training for OLMoE-7B (7B parameters, 1.3B activated per token)
Finetune recipes:
olmoe_7b_finetune_config: Finetuning for OLMoE-7B with PEFT support (LoRA, DoRA)
Parallelism Configurations#
Model |
Mode |
TP |
PP |
EP |
Total GPUs |
Use Case |
|---|---|---|---|---|---|---|
OLMoE-7B |
Pretrain |
1 |
1 |
8 |
8 |
Pre-training (single node) |
OLMoE-7B |
Full SFT |
1 |
1 |
8 |
8 |
Full supervised finetuning |
OLMoE-7B |
LoRA/DoRA |
1 |
1 |
1 |
8 |
PEFT finetuning (single node) |
Key Features:
Expert Parallelism: EP=8 for efficient MoE training (64 experts)
Selective Recomputation: Enabled by default for memory optimization
RoPE Fusion: Optional optimization for MLA (
apply_rope_fusion=True)MoE Optimizations: Grouped GEMM and permute fusion enabled by default
Performance Optimizations:
MoE Permute Fusion: Fused expert permutation operations
Grouped GEMM: Optimized expert computation
Router Load Balancing: Auxiliary loss for balanced expert utilization
Manual GC: Aggressive garbage collection (interval=5)
Precision-Aware Optimizer: BF16 gradients and optimizer states with FP32 master weights
Pipeline Layouts (optional):
PP=1: No pipelining (default)
PP=2: 8+8 layer split with embedding/loss
PP=4: 4+4+4+4 layer split
VP: PP=2,VP=2 supported
Pre-training Example#
from megatron.bridge.recipes.olmoe import olmoe_7b_pretrain_config
cfg = olmoe_7b_pretrain_config(
name="olmoe_pretrain",
data_paths=["/path/to/dataset.nvjsonl"],
dir="/results/olmoe_7b",
train_iters=500_000,
global_batch_size=2048,
seq_length=4096,
# Uses TP=1, PP=1, EP=8 (8 GPUs) automatically
)
Finetuning Examples#
Full Finetuning#
from megatron.bridge.recipes.olmoe import olmoe_7b_finetune_config
cfg = olmoe_7b_finetune_config(
tokenizer_path="allenai/OLMoE-1B-7B-0125",
name="olmoe_full_sft",
pretrained_checkpoint="path/to/olmoe/checkpoint",
peft=None, # Full supervised finetuning
train_iters=1000,
global_batch_size=128,
finetune_lr=5e-6,
# Uses TP=1, PP=1, EP=8 (8 GPUs) automatically
)
LoRA Finetuning#
from megatron.bridge.recipes.olmoe import olmoe_7b_finetune_config
cfg = olmoe_7b_finetune_config(
tokenizer_path="allenai/OLMoE-1B-7B-0125",
name="olmoe_lora_finetune",
pretrained_checkpoint="path/to/olmoe/checkpoint",
peft="lora", # or "dora" for DoRA
train_iters=1000,
global_batch_size=128,
finetune_lr=1e-4,
# Uses TP=1, PP=1, EP=1 (8 GPUs) automatically
)
Hugging Face model cards#
Latest (January 2025)#
OLMoE-1B-7B-0125 (Base): allenai/OLMoE-1B-7B-0125
OLMoE-1B-7B-0125-SFT: allenai/OLMoE-1B-7B-0125-SFT
OLMoE-1B-7B-0125-Instruct: allenai/OLMoE-1B-7B-0125-Instruct
Previous (September 2024)#
OLMoE-1B-7B-0924 (Base): allenai/OLMoE-1B-7B-0924
OLMoE-1B-7B-0924-Instruct: allenai/OLMoE-1B-7B-0924-Instruct
Technical Resources#
OLMoE Paper: OLMoE: Open Mixture-of-Experts Language Models
OLMoE Model Card (Latest): HuggingFace Model Card
OLMoE GitHub Repository: allenai/OLMoE