Auto Recipe β€” Recipe Index & Recommendation#

This skill indexes every shipped recipe and helps users pick the right starting config, adjust parallelism, and avoid common pitfalls.

How to Use This Skill#

  1. Ask the user for: model name/size, GPU count & type, training goal (pretrain / SFT / PEFT), and sequence length (if non-default).

  2. Look up the best-match recipe in the index below.

  3. Recommend the recipe function name + entry-point command.

  4. Provide adjustment advice (parallelism resizing, batch tuning, pitfalls).


Entry Points#

Library recipes (functional training)#

# Pretrain with mock data
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe <recipe_function_name> \
    --dataset llm-pretrain-mock

# SFT with SQuAD
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe <recipe_function_name> \
    --dataset llm-finetune

# Override any field via CLI
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe llama3_8b_pretrain_config \
    --dataset llm-pretrain-mock \
    'model.tensor_model_parallel_size=2' \
    'training.global_batch_size=64'

Performance recipes (throughput benchmarks)#

python scripts/performance/run_script.py \
    --recipe <model_family> \
    --gpu_type h100 \
    --num_gpus 64 \
    --data mock

Perf recipes are NOT fully validated for correctness. Most conversations and testing were on mock data. They are designed for upper-bound throughput measurement, not production training. Always validate loss curves and convergence independently.


Recipe Unification (Coming Soon β€” PR #2803)#

PR #2803 is unifying performance recipes into the same Python function format used by library recipes. Key changes:

  • Perf recipes move from scripts/performance/configs/ β†’ src/megatron/bridge/recipes/<family>/<model>_perf.py

  • Each perf recipe becomes a self-contained Python function (e.g. llama3_8b_h100_bf16_pretrain_config())

  • The old WorkloadBaseConfig β†’ set_workload_base_configs β†’ get_perf_optimized_recipe pipeline is removed

  • Shared helpers: _benchmark_common() (50 iters, timing, TE RNG), _perf_precision() (bf16 / fp8_cs / fp8_mx / nvfp4)

Why Python, not YAML? Previous YAML-based approaches had problems: recipe logic was split across multiple indirection layers, configs were not self-contained, and the two-level pipeline made maintenance and debugging difficult. Python functions are explicit, greppable, and composable.

After #2803 lands, both library and perf recipes will be invocable through the same run_recipe.py entry point.


Library Recipe Index#

All recipes live under src/megatron/bridge/recipes/. Each function returns a ConfigContainer with model, training, optimizer, and data settings.

Llama#

Recipe

Mode

TP

PP

CP

SP

GPUs (min)

Seq Len

llama2_7b_pretrain_config

Pretrain

2

1

β€”

β€”

2

4K

llama3_8b_pretrain_config

Pretrain

2

1

β€”

βœ“

2

8K

llama3_8b_16k_pretrain_config

Pretrain

2

1

2

βœ“

4

16K

llama3_8b_64k_pretrain_config

Pretrain

2

1

4

βœ“

8

64K

llama3_8b_128k_pretrain_config

Pretrain

2

1

8

βœ“

16

128K

llama3_70b_pretrain_config

Pretrain

8

4

β€”

βœ“

32

8K

llama3_70b_16k_pretrain_config

Pretrain

8

4

2

βœ“

64

16K

llama3_70b_64k_pretrain_config

Pretrain

8

4

4

βœ“

128

64K

llama31_405b_pretrain_config

Pretrain

8

16

β€”

βœ“

128

8K

llama3_8b_sft_config

SFT

2

1

β€”

βœ“

2

8K

llama3_70b_sft_config

SFT

4

4

β€”

βœ“

16

8K

llama31_405b_sft_config

SFT

8

8

β€”

βœ“

64

8K

llama3_8b_peft_config

PEFT

1

1

β€”

β€”

1

8K

llama3_70b_peft_config

PEFT

2

4

β€”

βœ“

8

8K

llama31_405b_peft_config

PEFT

4

8

β€”

βœ“

32

8K

Qwen2 / Qwen2.5#

Recipe

Mode

TP

PP

Sizes

qwen2_*_{pretrain,sft,peft}_config

All

1–8

1–4

500M, 1.5B, 7B, 14B, 32B, 72B

qwen25_*_{pretrain,sft,peft}_config

All

1–8

1–4

500M, 1.5B, 3B, 7B, 14B, 32B, 72B

Qwen3 (Dense)#

Recipe

Mode

TP

PP

CP

Sizes

qwen3_*_pretrain_config

Pretrain

1–8

1–2

β€”

600M–32B

qwen3_*_sft_config

SFT

1–8

1–2

β€”

600M–32B

qwen3_600m_sft_128k_config

SFT

1

1

8

600M (128K seq)

qwen3_*_peft_config

PEFT

1

1

β€”

600M–32B

Qwen3 MoE#

Recipe

Mode

TP

PP

EP

CP

GPUs

qwen3_30b_a3b_pretrain_config

Pretrain

1

1

8

β€”

8

qwen3_30b_a3b_sft_config

SFT

1

1

8

β€”

8

qwen3_30b_a3b_peft_config

PEFT

1

1

1

β€”

1

qwen3_235b_a22b_pretrain_config

Pretrain

4

16

8

2

512+

qwen3_235b_a22b_sft_config

SFT

4

8

8

β€”

256

qwen3_235b_a22b_peft_config

PEFT

1

4

4

β€”

16

Qwen3-Next#

Recipe

Mode

TP

PP

EP

qwen3_next_80b_a3b_pretrain_config

Pretrain

1

4

8

qwen3_next_80b_a3b_sft_config

SFT

1

2

8

qwen3_next_80b_a3b_peft_config

PEFT

1

1

4

DeepSeek#

Recipe

Mode

TP

PP

EP

GPUs

deepseek_v2_lite_pretrain_config

Pretrain

1

1

8

8

deepseek_v2_pretrain_config

Pretrain

1

4

32

128

deepseek_v3_pretrain_config

Pretrain

2

16

64

2048

deepseek_v3_pretrain_config_32nodes

Pretrain

2

8

32

256

GLM-4.5#

Recipe

Mode

TP

PP

EP

GPUs

glm45_355b_pretrain_config

Pretrain

2

8

16

256

glm45_air_106b_pretrain_config

Pretrain

1

4

8

32

glm45_355b_sft_config

SFT

2

8

16

256

glm45_air_106b_sft_config

SFT

1

4

8

32

glm45_355b_peft_config

PEFT

2

4

4

32

glm45_air_106b_peft_config

PEFT

1

2

4

8

Gemma#

Recipe

Mode

TP

PP

Sizes

gemma2_*_{pretrain,sft,peft}_config

All

2–8

1–2

2B, 9B, 27B

gemma3_1b_{pretrain,sft,peft}_config

All

1

1

1B (32K seq)

NemotronH / Nemotron#

Recipe

Mode

TP

PP

EP

Notes

nemotronh_{4b,8b,47b,56b}_*_config

P/S/PEFT

1–8

1–4

β€”

Dense SSM-hybrid

nemotron_3_nano_*_config

P/S/PEFT

varies

1

8

MoE + Mamba

nemotron_3_super_*_config

P/S/PEFT

4

1

8

MoE + Mamba, ~40% CUDA graph gain

nemotron_nano_{9b,12b}_v2_*_config

P/S/PEFT

varies

1

β€”

Dense

Other Models#

Recipe

Mode

Notes

moonlight_16b_{pretrain,sft,peft}_config

All

MoE EP=8

olmoe_7b_{pretrain,sft,peft}_config

All

MoE EP=8

ministral3_{3b,8b,14b}_{sft,peft}_config

SFT/PEFT

Dense

gpt_oss_20b_*_config

All

MoE + FP8/MXFP8 variants

gpt_oss_120b_*_config

All

MoE

vanilla_gpt_pretrain_config

Pretrain

MLM/Bridge parity baseline

gpt3_175b_pretrain_config

Pretrain

TP=4, PP=8, VP=6

kimi_k2_pretrain_config

Pretrain

1T MoE, TP=2 PP=16 EP=32

VLM Recipes#

Recipe

Mode

TP

PP

EP

GPUs

gemma3_vl_{4b,12b,27b}_{sft,peft}_config

SFT/PEFT

1–8

1–2

β€”

1–16

qwen25_vl_{3b,7b,32b,72b}_{sft,peft}_config

SFT/PEFT

1–8

1–4

β€”

1–32

qwen3_vl_{8b,30b_a3b,235b_a22b}_{sft,peft}_config

SFT/PEFT

1–4

1–8

1–32

1–512

qwen35_vl_*_{sft,peft}_config

SFT/PEFT

varies

varies

varies

varies

glm_45v_{sft,peft}_config

SFT/PEFT

1

8

4–16

64–512

nemotron_nano_v2_vl_12b_{sft,peft}_config

SFT/PEFT

2–4

1

β€”

8

Diffusion Recipes#

Recipe

Mode

TP

CP

wan_1_3B_{pretrain,sft}_config

P/SFT

1

8

wan_14B_{pretrain,sft}_config

P/SFT

2

4

flux_12b_{pretrain,sft}_config

P/SFT

2

1


Performance Recipe Index#

All perf recipes live under scripts/performance/. They are invoked via run_script.py and use WorkloadBaseConfig presets per GPU type.

Important: Perf recipes are designed for upper-bound throughput benchmarks, not production training. They run 50 iterations on mock data by default. Throughput numbers are aspirational targets, not validated convergence configs.

Llama 3 / 3.1#

Model

GPUs

GPU Types

Key Features

Llama 3 8B

8

H100, B200, B300, GB200, GB300, R100

CUDA graphs (local), FSDP on GB variants

Llama 3 70B

64

H100, B200, B300, GB200, GB300

TP comm overlap (userbuffers), FSDP, CUDA graphs

Llama 3.1 405B

128–1024

H100, B200, B300, GB200, GB300

TP+CP comm overlap (userbuffers), FSDP, heavy PP/VP

SFT/LoRA variants also exist (e.g. 8B SFT with packed sequences, 70B SFT on 32 GPUs).

DeepSeek V3#

Model

GPUs

GPU Types

Key Features

DeepSeek V3 (671B MoE)

256–1024

H100, B200, B300, GB200, GB300

HybridEP dispatcher, MLA recompute, CUDA graphs (TE scoped)

Qwen3 MoE#

Model

GPUs

GPU Types

Key Features

Qwen3 30B-A3B

8–16

H100, B200, B300, GB200, GB300

MoE alltoall/flex dispatcher

Qwen3 235B-A22B

64–256

H100, B200, B300, GB200, GB300

TP comm overlap, CUDA graphs, MoE a2a overlap

Qwen3-Next 80B-A3B

64–128

H100, B200, B300, GB200, GB300

EP 64–128

Qwen3-VL#

Model

GPUs

GPU Types

Key Features

Qwen3-VL 30B-A3B

8–16

H100, B200, B300, GB200, GB300

VLM + MoE

Qwen3-VL 235B-A22B

64–256

H100, B200, B300, GB200, GB300

VLM + MoE, TP comm overlap

Kimi K2#

Model

GPUs

GPU Types

Key Features

Kimi K2 (1T MoE)

256–1024

H100, B200, B300, GB200, GB300

Muon/Adam optimizer, HybridEP, pipeline layout helpers

NemotronH#

Model

GPUs

GPU Types

Key Features

Nemotron 3 Nano (30B MoE+Mamba)

8–16

H100, B200, B300, GB200, GB300

TE CUDA graphs (attn+mamba+moe), HybridEP

Nemotron 3 Super

64

H100, B200, B300, GB200, GB300

TE CUDA graphs, EP=64

NemotronH 56B

64

H100, B200, B300

TP=2–8, TE graphs (mamba+attn)

GPT-OSS#

Model

GPUs

GPU Types

Key Features

GPT-OSS 120B

64

H100, B200, GB200

EP=64, HybridEP on GB200


Recommendation Decision Tree#

User wants to train a model
β”‚
β”œβ”€ Know the model name?
β”‚   β”œβ”€ Yes β†’ Look up in Library Recipe Index above
β”‚   β”‚   β”œβ”€ Has a recipe for their size + mode? β†’ Use it directly
β”‚   β”‚   └─ No exact match? β†’ Use closest size, adjust parallelism
β”‚   └─ No β†’ Ask for model name, size, and HF model ID
β”‚
β”œβ”€ What's the training goal?
β”‚   β”œβ”€ Pretrain β†’ Use *_pretrain_config
β”‚   β”œβ”€ SFT (full fine-tune) β†’ Use *_sft_config
β”‚   └─ PEFT (LoRA/DoRA) β†’ Use *_peft_config (lowest GPU requirement)
β”‚
β”œβ”€ How many GPUs?
β”‚   β”œβ”€ 1 GPU β†’ Only PEFT recipes work (TP=1, PP=1)
β”‚   β”œβ”€ 8 GPUs (1 node) β†’ Most 8B–16B models, small MoE (EP=8)
β”‚   β”œβ”€ 16–64 GPUs β†’ 70B dense, medium MoE
β”‚   └─ 128+ GPUs β†’ 405B+, large MoE (DeepSeek V3, Kimi K2)
β”‚
β”œβ”€ Want throughput benchmarks?
β”‚   β”œβ”€ Yes β†’ Use perf recipes (scripts/performance/)
β”‚   β”‚   └─ ⚠️ These run on mock data for upper-bound perf only
β”‚   └─ No β†’ Use library recipes (scripts/training/run_recipe.py)
β”‚
└─ Long context?
    β”œβ”€ > 8K β†’ Need CP (context parallelism), check *_16k / *_64k / *_128k variants
    └─ ≀ 8K β†’ Default recipes work

Adjustment Advice (When Recommending)#

Parallelism Resizing Rules#

When the user’s GPU count differs from the recipe default:

  1. TP must divide num_key_value_heads (GQA constraint). E.g. if num_key_value_heads=8, valid TP = {1, 2, 4, 8}.

  2. TP should stay within a single node (NVLink). TP > 8 requires inter-node NVLink (e.g., GB200 NVL72).

  3. PP adds pipeline bubbles. Minimize PP; only increase when TP alone can’t fit the model. Use VP (virtual pipeline) to mitigate bubble overhead.

  4. EP doesn’t reduce dense-layer memory. Only expert parameters shard with EP. Shared attention/embeddings are replicated. For β€œOOM with MoE”, increase EP first, not TP.

  5. SP should be True whenever TP > 1. It eliminates redundant activation copies and is essentially free.

  6. CP requires all-to-all or ring attention. Check cp_comm_type. For GQA models, a2a+p2p hierarchical CP allows CP > num_kv_heads.

  7. world_size = DP Γ— TP Γ— PP Γ— CP Γ— EP. DP is implicit. Make sure the product of explicit parallelisms divides your total GPU count.

Batch Size Tuning#

  • Start with the recipe’s micro_batch_size. If OOM, reduce to 1.

  • global_batch_size determines learning dynamics. Scale with DP: GBS = micro_batch_size Γ— DP Γ— gradient_accumulation_steps.

  • For MoE, micro_batch_size=1 is typical at scale.

Common Pitfalls to Warn About#

Pitfall

Symptom

Fix

TP > num_kv_heads

Crash: β€œTP must divide num_query_groups”

Reduce TP to a divisor of num_kv_heads

PP without VP

Poor throughput (large bubble)

Set virtual_pipeline_model_parallel_size

EP too low for large MoE

OOM on expert params

Increase EP; each expert lives on EP/num_experts ranks

CUDA graphs + packed sequences

Assert: β€œCUDA graph accepts only Tensor inputs”

Disable packing or use local full-iteration graphs

CUDA graphs + full recompute

Assert: β€œfull recompute only with full iteration CUDA graph”

Disable recompute or switch to local impl

use_te_rng_tracker not set

Assert on provider init when CUDA graphs enabled

Set cfg.model.use_te_rng_tracker = True and cfg.rng.te_rng_tracker = True

FSDP + TP > 1 on H100

Possible comm bottleneck

Prefer FSDP with TP=1 or TP=2 on H100; FSDP shines on GB/B-series

Long context without CP

OOM on activations

Add CP=2/4/8; use *_16k, *_64k, or *_128k recipe variants

MoE overlap_grad_reduce on H100

May hurt perf (False in many H100 presets)

Set overlap_grad_reduce=False for MoE on H100

VLM SFT missing image data

Runs but produces garbage

Provide actual multimodal dataset or use mock VLM data

Qwen35-VL MoE FSDP

Tested on Blackwell only

May not work on H100; validate first

Recipe Override Examples#

# Scale Llama3 8B from 2 GPUs to 8 GPUs (increase DP)
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe llama3_8b_pretrain_config \
    --dataset llm-pretrain-mock

# Reduce parallelism for Qwen3-MoE 30B to fit on 4 GPUs
uv run torchrun --nproc_per_node=4 scripts/training/run_recipe.py \
    --recipe qwen3_30b_a3b_sft_config \
    --dataset llm-finetune \
    'model.expert_model_parallel_size=4'

# Add long context to an existing recipe
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe llama3_8b_pretrain_config \
    --dataset llm-pretrain-mock \
    'model.seq_length=32768' \
    'model.context_parallel_size=4'

# Enable CUDA graphs on any recipe
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe qwen3_30b_a3b_pretrain_config \
    --dataset llm-pretrain-mock \
    'model.cuda_graph_impl=transformer_engine' \
    'model.cuda_graph_scope=[attn,moe_router,moe_preprocess]' \
    'model.use_te_rng_tracker=True' \
    'rng.te_rng_tracker=True'

Quick Reference: Which Recipe for My Situation?#

I want to…

Start with

GPUs needed

Try Bridge for the first time

llama3_8b_sft_config + mock data

2

Fine-tune a 7-8B model

llama3_8b_sft_config or qwen3_8b_sft_config

2–8

LoRA on 1 GPU

llama3_8b_peft_config or qwen3_8b_peft_config

1

Pretrain a dense 70B

llama3_70b_pretrain_config

32–64

Train a small MoE

qwen3_30b_a3b_pretrain_config

8

Train a large MoE (235B+)

qwen3_235b_a22b_pretrain_config

256–512

Benchmark throughput

Perf recipes via run_script.py

Varies

Long-context training

llama3_8b_128k_pretrain_config or add CP override

16+

VLM fine-tuning

qwen3_vl_8b_sft_config or gemma3_vl_*_sft_config

4–8

Diffusion training

wan_1_3B_pretrain_config or flux_12b_pretrain_config

8


Code Anchors#

What

Path

Library recipes root

src/megatron/bridge/recipes/

Recipe __init__.py (all exports)

src/megatron/bridge/recipes/__init__.py

Common recipe helpers

src/megatron/bridge/recipes/common.py

Training entry point

scripts/training/run_recipe.py

Perf recipes root

scripts/performance/

Perf entry point

scripts/performance/run_script.py

Perf workload configs

scripts/performance/configs/<family>/

Perf overrides (benchmark defaults)

scripts/performance/utils/overrides.py