Fine-Tune Large MoE LLMs

View as Markdown

Introduction

Mixture-of-Experts (MoE) architectures have become the dominant design for frontier language models, activating only a fraction of their total parameters per token to deliver strong performance at reduced compute cost. This guide walks through fine-tuning four example MoE LLMs with NVIDIA NeMo Automodel. For a full list of supported architectures, see the LLM model coverage page.

ModelHF CheckpointValidated Using
GLM-5zai-org/GLM-5256 H100 GPUs (32 nodes x 8)
MiniMax-M2.5MiniMaxAI/MiniMax-M2.564 H100 GPUs (8 nodes x 8)
Step-3.5 Flashstepfun-ai/Step-3.5-Flash64 H100 GPUs (8 nodes x 8)
DeepSeek-V3.2deepseek-ai/DeepSeek-V3.2256 H100 GPUs (32 nodes x 8)

To set up your environment to run NeMo Automodel, follow the installation guide.

Data

HellaSwag Dataset

All four recipes use the HellaSwag dataset, a commonsense natural language inference benchmark where the model must predict the most plausible continuation of a given scenario.

  • Source: rowan/hellaswag
  • Split: train (used for both training and validation in these recipes)
  • Task: Next-token prediction on commonsense sentence completions

For details on how to swap in your own dataset, see the LLM Dataset Guide and the Dataset Overview.

Recipes

MiniMax-M2.5

examples/llm_finetune/minimax_m2/minimax_m2.5_hellaswag_pp.yaml — validated using 64 H100 GPUs (8 nodes x 8).

Key distributed settings:

1distributed:
2 strategy: fsdp2
3 pp_size: 2
4 ep_size: 32
5 pipeline:
6 pp_schedule: interleaved1f1b
7 layers_per_stage: 2

GLM-5

examples/llm_finetune/glm/glm_5_hellaswag_pp.yaml — validated using 256 H100 GPUs (32 nodes x 8).

Key distributed settings:

1distributed:
2 strategy: fsdp2
3 pp_size: 4
4 ep_size: 64
5 activation_checkpointing: true
6 pipeline:
7 pp_schedule: interleaved1f1b
8 layers_per_stage: 2

Step-3.5 Flash (StepFun)

examples/llm_finetune/stepfun/step_3.5_flash_hellaswag_pp.yaml — validated using 64 H100 GPUs (8 nodes x 8).

Key distributed settings:

1distributed:
2 strategy: fsdp2
3 pp_size: 2
4 ep_size: 32
5 pipeline:
6 pp_schedule: interleaved1f1b
7 layers_per_stage: 2

DeepSeek-V3.2

examples/llm_finetune/deepseek_v32/deepseek_v32_hellaswag_pp.yaml — validated using 256 H100 GPUs (32 nodes x 8).

Key distributed settings:

1distributed:
2 strategy: fsdp2
3 pp_size: 4
4 ep_size: 64
5 activation_checkpointing: true
6 pipeline:
7 pp_schedule: interleaved1f1b
8 layers_per_stage: 2

Launch Training

NeMo Automodel supports several ways to launch training—via the Automodel CLI with Slurm, interactive sessions, torchrun, and more. For full details on all launch options (Slurm batch jobs, multi-node configuration, environment variables, etc.), see the Run on a Cluster guide.

Automodel CLI

$automodel finetune llm -c examples/llm_finetune/glm/glm_5_hellaswag_pp.yaml

Replace the recipe path with the one for your target model.

torchrun

$export TRANSFORMERS_OFFLINE=1
$export HF_HOME=your/path/to/hf_cache
$export HF_DATASETS_OFFLINE=1
$export WANDB_API_KEY=your_wandb_key
$
$torchrun --nproc-per-node=8 \
> --nnodes=8 \
> --rdzv_backend=c10d \
> --rdzv_endpoint=${MASTER_ADDR}:${PORT} \
> nemo_automodel/recipes/llm/benchmark.py \
> -c examples/llm_finetune/glm/glm_5_hellaswag_pp.yaml \
> --model.pretrained_model_name_or_path=/your/local/model_weights

Replace the -c path, --nnodes, and --model.pretrained_model_name_or_path for the model you want to fine-tune.

Before you start:

  • Hugging Face applies rate limits on downloads. We recommend cloning the model repository to your local filesystem beforehand.
  • Ensure your Hugging Face cache (HF_HOME) is configured and that the dataset is already cached locally.
  • To enable Weights & Biases logging, set your WANDB_API_KEY and configure the wandb section in the YAML file.