Fine-Tune Large MoE LLMs#

Introduction#

Mixture-of-Experts (MoE) architectures have become the dominant design for frontier language models, activating only a fraction of their total parameters per token to deliver strong performance at reduced compute cost. This guide walks through fine-tuning four example MoE LLMs with NVIDIA NeMo Automodel. For a full list of supported architectures, see the LLM model coverage page.

Model

HF Checkpoint

Validated Using

GLM-5

zai-org/GLM-5

256 H100 GPUs (32 nodes x 8)

MiniMax-M2.5

MiniMaxAI/MiniMax-M2.5

64 H100 GPUs (8 nodes x 8)

Step-3.5 Flash

stepfun-ai/Step-3.5-Flash

64 H100 GPUs (8 nodes x 8)

DeepSeek-V3.2

deepseek-ai/DeepSeek-V3.2

256 H100 GPUs (32 nodes x 8)

To set up your environment to run NeMo Automodel, follow the installation guide.

Data#

HellaSwag Dataset#

All four recipes use the HellaSwag dataset, a commonsense natural language inference benchmark where the model must predict the most plausible continuation of a given scenario.

  • Source: rowan/hellaswag

  • Split: train (used for both training and validation in these recipes)

  • Task: Next-token prediction on commonsense sentence completions

For details on how to swap in your own dataset, see the LLM Dataset Guide and the Dataset Overview.

Recipes#

MiniMax-M2.5#

examples/llm_finetune/minimax_m2/minimax_m2.5_hellaswag_pp.yaml — validated using 64 H100 GPUs (8 nodes x 8).

Key distributed settings:

distributed:
  strategy: fsdp2
  pp_size: 2
  ep_size: 32
  pipeline:
    pp_schedule: interleaved1f1b
    layers_per_stage: 2

GLM-5#

examples/llm_finetune/glm/glm_5_hellaswag_pp.yaml — validated using 256 H100 GPUs (32 nodes x 8).

Key distributed settings:

distributed:
  strategy: fsdp2
  pp_size: 4
  ep_size: 64
  activation_checkpointing: true
  pipeline:
    pp_schedule: interleaved1f1b
    layers_per_stage: 2

Step-3.5 Flash (StepFun)#

examples/llm_finetune/stepfun/step_3.5_flash_hellaswag_pp.yaml — validated using 64 H100 GPUs (8 nodes x 8).

Key distributed settings:

distributed:
  strategy: fsdp2
  pp_size: 2
  ep_size: 32
  pipeline:
    pp_schedule: interleaved1f1b
    layers_per_stage: 2

DeepSeek-V3.2#

examples/llm_finetune/deepseek_v32/deepseek_v32_hellaswag_pp.yaml — validated using 256 H100 GPUs (32 nodes x 8).

Key distributed settings:

distributed:
  strategy: fsdp2
  pp_size: 4
  ep_size: 64
  activation_checkpointing: true
  pipeline:
    pp_schedule: interleaved1f1b
    layers_per_stage: 2

Launch Training#

NeMo Automodel supports several ways to launch training—via the Automodel CLI with Slurm, interactive sessions, torchrun, and more. For full details on all launch options (Slurm batch jobs, multi-node configuration, environment variables, etc.), see the Run on a Cluster guide.

Automodel CLI#

automodel finetune llm -c examples/llm_finetune/glm/glm_5_hellaswag_pp.yaml

Replace the recipe path with the one for your target model.

torchrun#

export TRANSFORMERS_OFFLINE=1
export HF_HOME=your/path/to/hf_cache
export HF_DATASETS_OFFLINE=1
export WANDB_API_KEY=your_wandb_key

torchrun --nproc_per_node=8 \
         --nnodes=8 \
         --rdzv_backend=c10d \
         --rdzv_endpoint=${MASTER_ADDR}:${PORT} \
  nemo_automodel/recipes/llm/benchmark.py \
    -c examples/llm_finetune/glm/glm_5_hellaswag_pp.yaml \
    --model.pretrained_model_name_or_path=/your/local/model_weights

Replace the -c path, --nnodes, and --model.pretrained_model_name_or_path for the model you want to fine-tune.

Before you start:

  • Hugging Face applies rate limits on downloads. We recommend cloning the model repository to your local filesystem beforehand.

  • Ensure your Hugging Face cache (HF_HOME) is configured and that the dataset is already cached locally.

  • To enable Weights & Biases logging, set your WANDB_API_KEY and configure the wandb section in the YAML file.