> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# Fine-Tune Large MoE LLMs

## Introduction

Mixture-of-Experts (MoE) architectures have become the dominant design for frontier language models, activating only a fraction of their total parameters per token to deliver strong performance at reduced compute cost. This guide walks through fine-tuning four example MoE LLMs with NVIDIA NeMo Automodel. For a full list of supported architectures, see the [LLM model coverage](/model-coverage/large-language-models/overview) page.

| Model          | HF Checkpoint                                                                   | Validated Using              |
| -------------- | ------------------------------------------------------------------------------- | ---------------------------- |
| GLM-5          | [`zai-org/GLM-5`](https://huggingface.co/zai-org/GLM-5)                         | 256 H100 GPUs (32 nodes x 8) |
| MiniMax-M2.5   | [`MiniMaxAI/MiniMax-M2.5`](https://huggingface.co/MiniMaxAI/MiniMax-M2.5)       | 64 H100 GPUs (8 nodes x 8)   |
| Step-3.5 Flash | [`stepfun-ai/Step-3.5-Flash`](https://huggingface.co/stepfun-ai/Step-3.5-Flash) | 64 H100 GPUs (8 nodes x 8)   |
| DeepSeek-V3.2  | [`deepseek-ai/DeepSeek-V3.2`](https://huggingface.co/deepseek-ai/DeepSeek-V3.2) | 256 H100 GPUs (32 nodes x 8) |

To set up your environment to run NeMo Automodel, follow the [installation guide](https://github.com/NVIDIA-NeMo/Automodel#-install-nemo-automodel).

## Data

### HellaSwag Dataset

All four recipes use the [HellaSwag](https://huggingface.co/datasets/rowan/hellaswag) dataset, a commonsense natural language inference benchmark where the model must predict the most plausible continuation of a given scenario.

* **Source**: `rowan/hellaswag`
* **Split**: `train` (used for both training and validation in these recipes)
* **Task**: Next-token prediction on commonsense sentence completions

For details on how to swap in your own dataset, see the [LLM Dataset Guide](/datasets/text-dataset) and the [Dataset Overview](/datasets/overview).

## Recipes

### MiniMax-M2.5

[`examples/llm_finetune/minimax_m2/minimax_m2.5_hellaswag_pp.yaml`](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/minimax_m2/minimax_m2.5_hellaswag_pp.yaml) — validated using **64 H100 GPUs** (8 nodes x 8).

Key distributed settings:

```yaml
distributed:
  strategy: fsdp2
  pp_size: 2
  ep_size: 32
  pipeline:
    pp_schedule: interleaved1f1b
    layers_per_stage: 2
```

### GLM-5

[`examples/llm_finetune/glm/glm_5_hellaswag_pp.yaml`](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/glm/glm_5_hellaswag_pp.yaml) — validated using **256 H100 GPUs** (32 nodes x 8).

Key distributed settings:

```yaml
distributed:
  strategy: fsdp2
  pp_size: 4
  ep_size: 64
  activation_checkpointing: true
  pipeline:
    pp_schedule: interleaved1f1b
    layers_per_stage: 2
```

### Step-3.5 Flash (StepFun)

[`examples/llm_finetune/stepfun/step_3.5_flash_hellaswag_pp.yaml`](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/stepfun/step_3.5_flash_hellaswag_pp.yaml) — validated using **64 H100 GPUs** (8 nodes x 8).

Key distributed settings:

```yaml
distributed:
  strategy: fsdp2
  pp_size: 2
  ep_size: 32
  pipeline:
    pp_schedule: interleaved1f1b
    layers_per_stage: 2
```

### DeepSeek-V3.2

[`examples/llm_finetune/deepseek_v32/deepseek_v32_hellaswag_pp.yaml`](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/deepseek_v32/deepseek_v32_hellaswag_pp.yaml) — validated using **256 H100 GPUs** (32 nodes x 8).

Key distributed settings:

```yaml
distributed:
  strategy: fsdp2
  pp_size: 4
  ep_size: 64
  activation_checkpointing: true
  pipeline:
    pp_schedule: interleaved1f1b
    layers_per_stage: 2
```

## Launch Training

NeMo Automodel supports several ways to launch training—via the Automodel CLI with Slurm, interactive sessions, `torchrun`, and more. For full details on all launch options (Slurm batch jobs, multi-node configuration, environment variables, etc.), see the [Run on a Cluster](/job-launchers/slurm-cluster) guide.

### Automodel CLI

```bash
automodel finetune llm -c examples/llm_finetune/glm/glm_5_hellaswag_pp.yaml
```

Replace the recipe path with the one for your target model.

### torchrun

```bash
export TRANSFORMERS_OFFLINE=1
export HF_HOME=your/path/to/hf_cache
export HF_DATASETS_OFFLINE=1
export WANDB_API_KEY=your_wandb_key

torchrun --nproc-per-node=8 \
         --nnodes=8 \
         --rdzv_backend=c10d \
         --rdzv_endpoint=${MASTER_ADDR}:${PORT} \
  nemo_automodel/recipes/llm/benchmark.py \
    -c examples/llm_finetune/glm/glm_5_hellaswag_pp.yaml \
    --model.pretrained_model_name_or_path=/your/local/model_weights
```

Replace the `-c` path, `--nnodes`, and `--model.pretrained_model_name_or_path` for the model you want to fine-tune.

**Before you start**:

* Hugging Face applies rate limits on downloads. We recommend cloning the model repository to your local filesystem beforehand.
* Ensure your Hugging Face cache (`HF_HOME`) is configured and that the dataset is already cached locally.
* To enable Weights & Biases logging, set your `WANDB_API_KEY` and configure the `wandb` section in the YAML file.