Fine-Tune Large MoE LLMs#
Introduction#
Mixture-of-Experts (MoE) architectures have become the dominant design for frontier language models, activating only a fraction of their total parameters per token to deliver strong performance at reduced compute cost. This guide walks through fine-tuning four example MoE LLMs with NVIDIA NeMo Automodel. For a full list of supported architectures, see the LLM model coverage page.
Model |
HF Checkpoint |
Validated Using |
|---|---|---|
GLM-5 |
256 H100 GPUs (32 nodes x 8) |
|
MiniMax-M2.5 |
64 H100 GPUs (8 nodes x 8) |
|
Step-3.5 Flash |
64 H100 GPUs (8 nodes x 8) |
|
DeepSeek-V3.2 |
256 H100 GPUs (32 nodes x 8) |
To set up your environment to run NeMo Automodel, follow the installation guide.
Data#
HellaSwag Dataset#
All four recipes use the HellaSwag dataset, a commonsense natural language inference benchmark where the model must predict the most plausible continuation of a given scenario.
Source:
rowan/hellaswagSplit:
train(used for both training and validation in these recipes)Task: Next-token prediction on commonsense sentence completions
For details on how to swap in your own dataset, see the LLM Dataset Guide and the Dataset Overview.
Recipes#
MiniMax-M2.5#
examples/llm_finetune/minimax_m2/minimax_m2.5_hellaswag_pp.yaml — validated using 64 H100 GPUs (8 nodes x 8).
Key distributed settings:
distributed:
strategy: fsdp2
pp_size: 2
ep_size: 32
pipeline:
pp_schedule: interleaved1f1b
layers_per_stage: 2
GLM-5#
examples/llm_finetune/glm/glm_5_hellaswag_pp.yaml — validated using 256 H100 GPUs (32 nodes x 8).
Key distributed settings:
distributed:
strategy: fsdp2
pp_size: 4
ep_size: 64
activation_checkpointing: true
pipeline:
pp_schedule: interleaved1f1b
layers_per_stage: 2
Step-3.5 Flash (StepFun)#
examples/llm_finetune/stepfun/step_3.5_flash_hellaswag_pp.yaml — validated using 64 H100 GPUs (8 nodes x 8).
Key distributed settings:
distributed:
strategy: fsdp2
pp_size: 2
ep_size: 32
pipeline:
pp_schedule: interleaved1f1b
layers_per_stage: 2
DeepSeek-V3.2#
examples/llm_finetune/deepseek_v32/deepseek_v32_hellaswag_pp.yaml — validated using 256 H100 GPUs (32 nodes x 8).
Key distributed settings:
distributed:
strategy: fsdp2
pp_size: 4
ep_size: 64
activation_checkpointing: true
pipeline:
pp_schedule: interleaved1f1b
layers_per_stage: 2
Launch Training#
NeMo Automodel supports several ways to launch training—via the Automodel CLI with Slurm, interactive sessions, torchrun, and more. For full details on all launch options (Slurm batch jobs, multi-node configuration, environment variables, etc.), see the Run on a Cluster guide.
Automodel CLI#
automodel finetune llm -c examples/llm_finetune/glm/glm_5_hellaswag_pp.yaml
Replace the recipe path with the one for your target model.
torchrun#
export TRANSFORMERS_OFFLINE=1
export HF_HOME=your/path/to/hf_cache
export HF_DATASETS_OFFLINE=1
export WANDB_API_KEY=your_wandb_key
torchrun --nproc_per_node=8 \
--nnodes=8 \
--rdzv_backend=c10d \
--rdzv_endpoint=${MASTER_ADDR}:${PORT} \
nemo_automodel/recipes/llm/benchmark.py \
-c examples/llm_finetune/glm/glm_5_hellaswag_pp.yaml \
--model.pretrained_model_name_or_path=/your/local/model_weights
Replace the -c path, --nnodes, and --model.pretrained_model_name_or_path for the model you want to fine-tune.
Before you start:
Hugging Face applies rate limits on downloads. We recommend cloning the model repository to your local filesystem beforehand.
Ensure your Hugging Face cache (
HF_HOME) is configured and that the dataset is already cached locally.To enable Weights & Biases logging, set your
WANDB_API_KEYand configure thewandbsection in the YAML file.