Moonlight#

Moonlight is a 16B-parameter Mixture-of-Experts (MoE) model from Moonshot AI trained with 5.7T tokens using the innovative Muon optimizer. While Moonlight shares the same architecture as DeepSeek-V3 (featuring Multi-head Latent Attention and MoE), it is a distinct model that advances the Pareto frontier of performance vs training FLOPs through the use of Muon, which is ~2× more sample efficient than Adam with compute optimal training.

The model features 27 decoder layers with 64 routed experts and 8 shared experts per layer, with 3B activated parameters per forward pass out of 16B total parameters.

Moonlight models are supported via the Bridge system with specialized configurations for MoE and MLA optimizations.

Model Architecture#

Parameters: 16B total, 3B activated per forward pass
Layers: 27 decoder layers
Attention: Multi-head Latent Attention (MLA) with RoPE fusion support
MoE: 64 routed experts + 8 shared experts per layer
Hidden size: 2048
Intermediate size: 10944 (with MLP and expert gating)
Vocab size: 151,936
Context Length: 8K tokens
Training: 5.7T tokens with Muon optimizer

Conversion with 🤗 Hugging Face#

Moonlight shares the same architecture as DeepSeek-V3, which enables compatibility with various inference engines like vLLM and SGLang. The model can be loaded from HuggingFace or used with Megatron checkpoints.

Load HF → Megatron#

from megatron.bridge import AutoBridge

# Example: Moonlight-16B-A3B
bridge = AutoBridge.from_hf_pretrained("moonshotai/Moonlight-16B-A3B")
provider = bridge.to_megatron_provider()

# Configure parallelism before instantiating the model
provider.tensor_model_parallel_size = 2
provider.pipeline_model_parallel_size = 1
provider.expert_model_parallel_size = 8
provider.sequence_parallel = True

provider.finalize()
model = provider.provide_distributed_model(wrap_with_ddp=False)

Export Megatron → HF#

# Convert from a Megatron checkpoint directory to HF format
bridge.export_ckpt(
    megatron_path="/results/moonlight_16b/checkpoints/iter_0500000",
    hf_path="./moonlight-hf-export",
)

Examples#

Checkpoint conversion: examples/conversion/convert_checkpoints.py

Recipes#

See: bridge.recipes.moonlight

Available Recipes#

Pretrain recipes:
- moonlight_16b_pretrain_config: Pre-training for Moonlight-16B (16B parameters, 3B activated per token)
Finetune recipes:
- moonlight_16b_finetune_config: Finetuning for Moonlight-16B with PEFT support (LoRA, DoRA)

Parallelism Configurations#

Model	Mode	TP	PP	EP	Total GPUs	Use Case
Moonlight-16B	Pretrain	2	1	8	16	Pre-training (2 nodes)
Moonlight-16B	Full SFT	2	1	8	16	Full supervised finetuning (2 nodes)
Moonlight-16B	LoRA/DoRA	1	1	1	8	PEFT finetuning (single node!)

Key Features:

Expert Parallelism: EP=8 for efficient MoE training (64 experts)
Sequence Parallel: Enabled by default for memory efficiency
Selective Recomputation: Reduces activation memory
RoPE Fusion: Optional MLA-specific optimization (apply_rope_fusion=True)
DeePEP: Optional expert permutation optimization (enable_deepep=True)

Performance Optimizations:

MoE Permute Fusion: Fused expert permutation operations
RoPE Fusion: Optional fusion for Multi-head Latent Attention
Manual GC: Aggressive garbage collection (interval=5)
Precision-Aware Optimizer: BF16 gradients and optimizer states with FP32 master weights

Pipeline Layouts (optional):

PP=1: No pipelining (default)
PP=2: 14+13 layer split with embedding/loss
PP=4: 8+7+7+6 layer split
PP=8: 5+4+4+4+4+4+4+4 layer split
VP: PP=2,VP=2 and PP=4,VP=2 supported

Pre-training Example#

from megatron.bridge.recipes.moonlight import moonlight_16b_pretrain_config

cfg = moonlight_16b_pretrain_config(
    name="moonlight_pretrain",
    data_paths=["/path/to/dataset.nvjsonl"],
    dir="/results/moonlight_16b",
    train_iters=500_000,
    global_batch_size=2048,
    seq_length=4096,
    # Uses TP=2, PP=1, EP=8 (16 GPUs) automatically
)

Finetuning Examples#

Full Finetuning (2 Nodes)#

from megatron.bridge.recipes.moonlight import moonlight_16b_finetune_config

cfg = moonlight_16b_finetune_config(
    tokenizer_path="moonshotai/Moonlight-16B-A3B",
    name="moonlight_full_sft",
    pretrained_checkpoint="/results/moonlight_16b/checkpoints/iter_0500000",
    peft=None,  # Full supervised finetuning
    train_iters=1000,
    global_batch_size=128,
    finetune_lr=5e-6,
    # Uses TP=2, PP=1, EP=8 (16 GPUs) automatically
)

LoRA Finetuning#

from megatron.bridge.recipes.moonlight import moonlight_16b_finetune_config

cfg = moonlight_16b_finetune_config(
    tokenizer_path="moonshotai/Moonlight-16B-A3B",
    name="moonlight_lora_finetune",
    pretrained_checkpoint="/results/moonlight_16b/checkpoints/iter_0500000",
    peft="lora",  # or "dora" for DoRA
    train_iters=1000,
    global_batch_size=128,
    finetune_lr=1e-4,
    # Uses TP=1, PP=1, EP=1 (8 GPUs) automatically
)

Hugging Face model cards#

Moonlight-16B-A3B (Base): moonshotai/Moonlight-16B-A3B
Moonlight-16B-A3B-Instruct: moonshotai/Moonlight-16B-A3B-Instruct

Technical Paper#

Muon is Scalable for LLM Training: arXiv:2502.16982