Fine-Tune Large MoE LLMs
Fine-Tune Large MoE LLMs
Introduction
Mixture-of-Experts (MoE) architectures have become the dominant design for frontier language models, activating only a fraction of their total parameters per token to deliver strong performance at reduced compute cost. This guide walks through fine-tuning four example MoE LLMs with NVIDIA NeMo Automodel. For a full list of supported architectures, see the LLM model coverage page.
To set up your environment to run NeMo Automodel, follow the installation guide.
Data
HellaSwag Dataset
All four recipes use the HellaSwag dataset, a commonsense natural language inference benchmark where the model must predict the most plausible continuation of a given scenario.
- Source:
rowan/hellaswag - Split:
train(used for both training and validation in these recipes) - Task: Next-token prediction on commonsense sentence completions
For details on how to swap in your own dataset, see the LLM Dataset Guide and the Dataset Overview.
Recipes
MiniMax-M2.5
examples/llm_finetune/minimax_m2/minimax_m2.5_hellaswag_pp.yaml — validated using 64 H100 GPUs (8 nodes x 8).
Key distributed settings:
GLM-5
examples/llm_finetune/glm/glm_5_hellaswag_pp.yaml — validated using 256 H100 GPUs (32 nodes x 8).
Key distributed settings:
Step-3.5 Flash (StepFun)
examples/llm_finetune/stepfun/step_3.5_flash_hellaswag_pp.yaml — validated using 64 H100 GPUs (8 nodes x 8).
Key distributed settings:
DeepSeek-V3.2
examples/llm_finetune/deepseek_v32/deepseek_v32_hellaswag_pp.yaml — validated using 256 H100 GPUs (32 nodes x 8).
Key distributed settings:
Launch Training
NeMo Automodel supports several ways to launch training—via the Automodel CLI with Slurm, interactive sessions, torchrun, and more. For full details on all launch options (Slurm batch jobs, multi-node configuration, environment variables, etc.), see the Run on a Cluster guide.
Automodel CLI
Replace the recipe path with the one for your target model.
torchrun
Replace the -c path, --nnodes, and --model.pretrained_model_name_or_path for the model you want to fine-tune.
Before you start:
- Hugging Face applies rate limits on downloads. We recommend cloning the model repository to your local filesystem beforehand.
- Ensure your Hugging Face cache (
HF_HOME) is configured and that the dataset is already cached locally. - To enable Weights & Biases logging, set your
WANDB_API_KEYand configure thewandbsection in the YAML file.