Moonlight#

Moonlight is a 16B-parameter Mixture-of-Experts (MoE) model from Moonshot AI trained with 5.7T tokens using the innovative Muon optimizer. While Moonlight shares the same architecture as DeepSeek-V3 (featuring Multi-head Latent Attention and MoE), it is a distinct model that advances the Pareto frontier of performance vs training FLOPs through the use of Muon, which is ~2× more sample efficient than Adam with compute optimal training.

The model features 27 decoder layers with 64 routed experts and 8 shared experts per layer, with 3B activated parameters per forward pass out of 16B total parameters.

Moonlight models are supported via the Bridge system with specialized configurations for MoE and MLA optimizations.

Model Architecture#

  • Parameters: 16B total, 3B activated per forward pass

  • Layers: 27 decoder layers

  • Attention: Multi-head Latent Attention (MLA) with RoPE fusion support

  • MoE: 64 routed experts + 8 shared experts per layer

  • Hidden size: 2048

  • Intermediate size: 10944 (with MLP and expert gating)

  • Vocab size: 151,936

  • Context Length: 8K tokens

  • Training: 5.7T tokens with Muon optimizer

Conversion with 🤗 Hugging Face#

Moonlight shares the same architecture as DeepSeek-V3, which enables compatibility with various inference engines like vLLM and SGLang. The model can be loaded from HuggingFace or used with Megatron checkpoints.

Load HF → Megatron#

from megatron.bridge import AutoBridge

# Example: Moonlight-16B-A3B
bridge = AutoBridge.from_hf_pretrained("moonshotai/Moonlight-16B-A3B")
provider = bridge.to_megatron_provider()

# Configure parallelism before instantiating the model
provider.tensor_model_parallel_size = 2
provider.pipeline_model_parallel_size = 1
provider.expert_model_parallel_size = 8
provider.sequence_parallel = True

provider.finalize()
model = provider.provide_distributed_model(wrap_with_ddp=False)

Export Megatron → HF#

# Convert from a Megatron checkpoint directory to HF format
bridge.export_ckpt(
    megatron_path="/results/moonlight_16b/checkpoints/iter_0500000",
    hf_path="./moonlight-hf-export",
)

Examples#

Recipes#

See: bridge.recipes.moonlight

Available Recipes#

  • Pretrain recipes:

    • moonlight_16b_pretrain_config: Pre-training for Moonlight-16B (16B parameters, 3B activated per token)

  • Finetune recipes:

    • moonlight_16b_finetune_config: Finetuning for Moonlight-16B with PEFT support (LoRA, DoRA)

Parallelism Configurations#

Model

Mode

TP

PP

EP

Total GPUs

Use Case

Moonlight-16B

Pretrain

2

1

8

16

Pre-training (2 nodes)

Moonlight-16B

Full SFT

2

1

8

16

Full supervised finetuning (2 nodes)

Moonlight-16B

LoRA/DoRA

1

1

1

8

PEFT finetuning (single node!)

Key Features:

  • Expert Parallelism: EP=8 for efficient MoE training (64 experts)

  • Sequence Parallel: Enabled by default for memory efficiency

  • Selective Recomputation: Reduces activation memory

  • RoPE Fusion: Optional MLA-specific optimization (apply_rope_fusion=True)

  • DeePEP: Optional expert permutation optimization (enable_deepep=True)

Performance Optimizations:

  • MoE Permute Fusion: Fused expert permutation operations

  • RoPE Fusion: Optional fusion for Multi-head Latent Attention

  • Manual GC: Aggressive garbage collection (interval=5)

  • Precision-Aware Optimizer: BF16 gradients and optimizer states with FP32 master weights

Pipeline Layouts (optional):

  • PP=1: No pipelining (default)

  • PP=2: 14+13 layer split with embedding/loss

  • PP=4: 8+7+7+6 layer split

  • PP=8: 5+4+4+4+4+4+4+4 layer split

  • VP: PP=2,VP=2 and PP=4,VP=2 supported

Pre-training Example#

from megatron.bridge.recipes.moonlight import moonlight_16b_pretrain_config

cfg = moonlight_16b_pretrain_config(
    name="moonlight_pretrain",
    data_paths=["/path/to/dataset.nvjsonl"],
    dir="/results/moonlight_16b",
    train_iters=500_000,
    global_batch_size=2048,
    seq_length=4096,
    # Uses TP=2, PP=1, EP=8 (16 GPUs) automatically
)

Finetuning Examples#

Full Finetuning (2 Nodes)#

from megatron.bridge.recipes.moonlight import moonlight_16b_finetune_config

cfg = moonlight_16b_finetune_config(
    tokenizer_path="moonshotai/Moonlight-16B-A3B",
    name="moonlight_full_sft",
    pretrained_checkpoint="/results/moonlight_16b/checkpoints/iter_0500000",
    peft=None,  # Full supervised finetuning
    train_iters=1000,
    global_batch_size=128,
    finetune_lr=5e-6,
    # Uses TP=2, PP=1, EP=8 (16 GPUs) automatically
)

LoRA Finetuning#

from megatron.bridge.recipes.moonlight import moonlight_16b_finetune_config

cfg = moonlight_16b_finetune_config(
    tokenizer_path="moonshotai/Moonlight-16B-A3B",
    name="moonlight_lora_finetune",
    pretrained_checkpoint="/results/moonlight_16b/checkpoints/iter_0500000",
    peft="lora",  # or "dora" for DoRA
    train_iters=1000,
    global_batch_size=128,
    finetune_lr=1e-4,
    # Uses TP=1, PP=1, EP=1 (8 GPUs) automatically
)

Hugging Face model cards#

Technical Paper#