OLMoE#

OLMoE is a 7B-parameter Mixture-of-Experts (MoE) model from Allen Institute for AI (AI2) featuring 64 experts with top-8 routing. The model is designed to be fully open-source, with training data, code, and model weights publicly available. It’s named β€œOLMoE-1B-7B” where 1B refers to the activated parameters and 7B refers to the total parameters.

The latest version (OLMoE-1B-7B-0125, released January 2025) is an improved version of the original September 2024 release (OLMoE-1B-7B-0924), trained on 5T tokens with performance improvements across multiple benchmarks.

The model features 16 decoder layers with 64 routed experts per layer, activating 8 experts per token for a total of approximately 1.3B active parameters per forward pass out of 7B total.

OLMoE models are supported via the Bridge system with specialized configurations for MoE optimizations.

Model Architecture#

  • Parameters: 7B total, 1.3B activated per forward pass

  • Layers: 16 decoder layers

  • Attention: Multi-query attention with QK LayerNorm and RoPE

  • MoE: 64 routed experts per layer with top-8 routing

  • Hidden size: 2048

  • FFN hidden size: 1024 (dense layers), 1024 (expert layers)

  • Attention heads: 16 query heads, 16 key-value groups

  • Vocab size: 50,304

  • Context Length: 4K tokens

  • Activation: SiLU with gated linear units

  • Training: 5T tokens (OLMoE-1B-7B-0125)

Key Features#

  • QK LayerNorm: Applies LayerNorm to query and key projections for training stability

  • RoPE: Rotary Position Embeddings with base 10000

  • MoE Routing: Softmax-based router with auxiliary loss for load balancing

  • Router Pre-Softmax: Pre-softmax routing scores

  • Grouped GEMM: Optimized grouped matrix multiplications for expert computation

Conversion with πŸ€— Hugging Face#

Load HF β†’ Megatron#

from megatron.bridge import AutoBridge

# Example: OLMoE-1B-7B-0125 (latest version)
bridge = AutoBridge.from_hf_pretrained("allenai/OLMoE-1B-7B-0125")
provider = bridge.to_megatron_provider()

# Configure parallelism before instantiating the model
provider.tensor_model_parallel_size = 1
provider.pipeline_model_parallel_size = 1
provider.expert_model_parallel_size = 8
provider.sequence_parallel = False

provider.finalize()
model = provider.provide_distributed_model(wrap_with_ddp=False)
# You can also use older versions:
# bridge = AutoBridge.from_hf_pretrained("allenai/OLMoE-1B-7B-0924")

Export Megatron β†’ HF#

# Convert from a Megatron checkpoint directory to HF format
bridge.export_ckpt(
    megatron_path="/results/olmoe_7b/checkpoints/iter_0500000",
    hf_path="./olmoe-hf-export",
)

Examples#

Recipes#

See: bridge.recipes.olmoe

Available Recipes#

  • Pretrain recipes:

    • olmoe_7b_pretrain_config: Pre-training for OLMoE-7B (7B parameters, 1.3B activated per token)

  • Finetune recipes:

    • olmoe_7b_finetune_config: Finetuning for OLMoE-7B with PEFT support (LoRA, DoRA)

Parallelism Configurations#

Model

Mode

TP

PP

EP

Total GPUs

Use Case

OLMoE-7B

Pretrain

1

1

8

8

Pre-training (single node)

OLMoE-7B

Full SFT

1

1

8

8

Full supervised finetuning

OLMoE-7B

LoRA/DoRA

1

1

1

8

PEFT finetuning (single node)

Key Features:

  • Expert Parallelism: EP=8 for efficient MoE training (64 experts)

  • Selective Recomputation: Enabled by default for memory optimization

  • RoPE Fusion: Optional optimization for MLA (apply_rope_fusion=True)

  • MoE Optimizations: Grouped GEMM and permute fusion enabled by default

Performance Optimizations:

  • MoE Permute Fusion: Fused expert permutation operations

  • Grouped GEMM: Optimized expert computation

  • Router Load Balancing: Auxiliary loss for balanced expert utilization

  • Manual GC: Aggressive garbage collection (interval=5)

  • Precision-Aware Optimizer: BF16 gradients and optimizer states with FP32 master weights

Pipeline Layouts (optional):

  • PP=1: No pipelining (default)

  • PP=2: 8+8 layer split with embedding/loss

  • PP=4: 4+4+4+4 layer split

  • VP: PP=2,VP=2 supported

Pre-training Example#

from megatron.bridge.recipes.olmoe import olmoe_7b_pretrain_config

cfg = olmoe_7b_pretrain_config(
    name="olmoe_pretrain",
    data_paths=["/path/to/dataset.nvjsonl"],
    dir="/results/olmoe_7b",
    train_iters=500_000,
    global_batch_size=2048,
    seq_length=4096,
    # Uses TP=1, PP=1, EP=8 (8 GPUs) automatically
)

Finetuning Examples#

Full Finetuning#

from megatron.bridge.recipes.olmoe import olmoe_7b_finetune_config

cfg = olmoe_7b_finetune_config(
    tokenizer_path="allenai/OLMoE-1B-7B-0125",
    name="olmoe_full_sft",
    pretrained_checkpoint="path/to/olmoe/checkpoint",
    peft=None,  # Full supervised finetuning
    train_iters=1000,
    global_batch_size=128,
    finetune_lr=5e-6,
    # Uses TP=1, PP=1, EP=8 (8 GPUs) automatically
)

LoRA Finetuning#

from megatron.bridge.recipes.olmoe import olmoe_7b_finetune_config

cfg = olmoe_7b_finetune_config(
    tokenizer_path="allenai/OLMoE-1B-7B-0125",
    name="olmoe_lora_finetune",
    pretrained_checkpoint="path/to/olmoe/checkpoint",
    peft="lora",  # or "dora" for DoRA
    train_iters=1000,
    global_batch_size=128,
    finetune_lr=1e-4,
    # Uses TP=1, PP=1, EP=1 (8 GPUs) automatically
)

Hugging Face model cards#

Latest (January 2025)#

Previous (September 2024)#

Technical Resources#