Fine-Tune Large MoE LLMs#

Introduction#

Mixture-of-Experts (MoE) architectures have become the dominant design for frontier language models, activating only a fraction of their total parameters per token to deliver strong performance at reduced compute cost. This guide walks through fine-tuning four example MoE LLMs with NVIDIA NeMo Automodel. For a full list of supported architectures, see the LLM model coverage page.

Model	HF Checkpoint	Validated Using
GLM-5	`zai-org/GLM-5`	256 H100 GPUs (32 nodes x 8)
MiniMax-M2.5	`MiniMaxAI/MiniMax-M2.5`	64 H100 GPUs (8 nodes x 8)
Step-3.5 Flash	`stepfun-ai/Step-3.5-Flash`	64 H100 GPUs (8 nodes x 8)
DeepSeek-V3.2	`deepseek-ai/DeepSeek-V3.2`	256 H100 GPUs (32 nodes x 8)

To set up your environment to run NeMo Automodel, follow the installation guide.

Data#

HellaSwag Dataset#

All four recipes use the HellaSwag dataset, a commonsense natural language inference benchmark where the model must predict the most plausible continuation of a given scenario.

Source: rowan/hellaswag
Split: train (used for both training and validation in these recipes)
Task: Next-token prediction on commonsense sentence completions

For details on how to swap in your own dataset, see the LLM Dataset Guide and the Dataset Overview.

Recipes#

MiniMax-M2.5#

examples/llm_finetune/minimax_m2/minimax_m2.5_hellaswag_pp.yaml — validated using 64 H100 GPUs (8 nodes x 8).

Key distributed settings:

distributed:
  strategy: fsdp2
  pp_size: 2
  ep_size: 32
  pipeline:
    pp_schedule: interleaved1f1b
    layers_per_stage: 2

GLM-5#

examples/llm_finetune/glm/glm_5_hellaswag_pp.yaml — validated using 256 H100 GPUs (32 nodes x 8).

Key distributed settings:

distributed:
  strategy: fsdp2
  pp_size: 4
  ep_size: 64
  activation_checkpointing: true
  pipeline:
    pp_schedule: interleaved1f1b
    layers_per_stage: 2

Step-3.5 Flash (StepFun)#

examples/llm_finetune/stepfun/step_3.5_flash_hellaswag_pp.yaml — validated using 64 H100 GPUs (8 nodes x 8).

Key distributed settings:

distributed:
  strategy: fsdp2
  pp_size: 2
  ep_size: 32
  pipeline:
    pp_schedule: interleaved1f1b
    layers_per_stage: 2

DeepSeek-V3.2#

examples/llm_finetune/deepseek_v32/deepseek_v32_hellaswag_pp.yaml — validated using 256 H100 GPUs (32 nodes x 8).

Key distributed settings:

distributed:
  strategy: fsdp2
  pp_size: 4
  ep_size: 64
  activation_checkpointing: true
  pipeline:
    pp_schedule: interleaved1f1b
    layers_per_stage: 2

Launch Training#

NeMo Automodel supports several ways to launch training—via the Automodel CLI with Slurm, interactive sessions, torchrun, and more. For full details on all launch options (Slurm batch jobs, multi-node configuration, environment variables, etc.), see the Run on a Cluster guide.

Automodel CLI#

automodel finetune llm -c examples/llm_finetune/glm/glm_5_hellaswag_pp.yaml

Replace the recipe path with the one for your target model.

torchrun#

export TRANSFORMERS_OFFLINE=1
export HF_HOME=your/path/to/hf_cache
export HF_DATASETS_OFFLINE=1
export WANDB_API_KEY=your_wandb_key

torchrun --nproc_per_node=8 \
         --nnodes=8 \
         --rdzv_backend=c10d \
         --rdzv_endpoint=${MASTER_ADDR}:${PORT} \
  nemo_automodel/recipes/llm/benchmark.py \
    -c examples/llm_finetune/glm/glm_5_hellaswag_pp.yaml \
    --model.pretrained_model_name_or_path=/your/local/model_weights

Replace the -c path, --nnodes, and --model.pretrained_model_name_or_path for the model you want to fine-tune.

Before you start:

Hugging Face applies rate limits on downloads. We recommend cloning the model repository to your local filesystem beforehand.
Ensure your Hugging Face cache (HF_HOME) is configured and that the dataset is already cached locally.
To enable Weights & Biases logging, set your WANDB_API_KEY and configure the wandb section in the YAML file.