Auto Recipe — Recipe Index & Recommendation#

This skill indexes every shipped recipe and helps users pick the right starting config, adjust parallelism, and avoid common pitfalls.

How to Use This Skill#

Ask the user for: model name/size, GPU count & type, training goal (pretrain / SFT / PEFT), and sequence length (if non-default).
Look up the best-match recipe in the index below.
Recommend the recipe function name + entry-point command.
Provide adjustment advice (parallelism resizing, batch tuning, pitfalls).

Entry Points#

Library recipes (functional training)#

# Pretrain with mock data
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe <recipe_function_name> \
    --dataset llm-pretrain-mock

# SFT with SQuAD
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe <recipe_function_name> \
    --dataset llm-finetune

# Override any field via CLI
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe llama3_8b_pretrain_config \
    --dataset llm-pretrain-mock \
    'model.tensor_model_parallel_size=2' \
    'training.global_batch_size=64'

Performance recipes (throughput benchmarks)#

python scripts/performance/run_script.py \
    --recipe <model_family> \
    --gpu_type h100 \
    --num_gpus 64 \
    --data mock

Perf recipes are NOT fully validated for correctness. Most conversations and testing were on mock data. They are designed for upper-bound throughput measurement, not production training. Always validate loss curves and convergence independently.

Recipe Unification (Coming Soon — PR #2803)#

PR #2803 is unifying performance recipes into the same Python function format used by library recipes. Key changes:

Perf recipes move from scripts/performance/configs/ → src/megatron/bridge/recipes/<family>/<model>_perf.py
Each perf recipe becomes a self-contained Python function (e.g. llama3_8b_h100_bf16_pretrain_config())
The old WorkloadBaseConfig → set_workload_base_configs → get_perf_optimized_recipe pipeline is removed
Shared helpers: _benchmark_common() (50 iters, timing, TE RNG), _perf_precision() (bf16 / fp8_cs / fp8_mx / nvfp4)

Why Python, not YAML? Previous YAML-based approaches had problems: recipe logic was split across multiple indirection layers, configs were not self-contained, and the two-level pipeline made maintenance and debugging difficult. Python functions are explicit, greppable, and composable.

After #2803 lands, both library and perf recipes will be invocable through the same run_recipe.py entry point.

Library Recipe Index#

All recipes live under src/megatron/bridge/recipes/. Each function returns a ConfigContainer with model, training, optimizer, and data settings.

Llama#

Recipe	Mode	TP	PP	CP	SP	GPUs (min)	Seq Len
`llama2_7b_pretrain_config`	Pretrain	2	1	—	—	2	4K
`llama3_8b_pretrain_config`	Pretrain	2	1	—	✓	2	8K
`llama3_8b_16k_pretrain_config`	Pretrain	2	1	2	✓	4	16K
`llama3_8b_64k_pretrain_config`	Pretrain	2	1	4	✓	8	64K
`llama3_8b_128k_pretrain_config`	Pretrain	2	1	8	✓	16	128K
`llama3_70b_pretrain_config`	Pretrain	8	4	—	✓	32	8K
`llama3_70b_16k_pretrain_config`	Pretrain	8	4	2	✓	64	16K
`llama3_70b_64k_pretrain_config`	Pretrain	8	4	4	✓	128	64K
`llama31_405b_pretrain_config`	Pretrain	8	16	—	✓	128	8K
`llama3_8b_sft_config`	SFT	2	1	—	✓	2	8K
`llama3_70b_sft_config`	SFT	4	4	—	✓	16	8K
`llama31_405b_sft_config`	SFT	8	8	—	✓	64	8K
`llama3_8b_peft_config`	PEFT	1	1	—	—	1	8K
`llama3_70b_peft_config`	PEFT	2	4	—	✓	8	8K
`llama31_405b_peft_config`	PEFT	4	8	—	✓	32	8K

Qwen2 / Qwen2.5#

Recipe	Mode	TP	PP	Sizes
`qwen2_*_{pretrain,sft,peft}_config`	All	1–8	1–4	500M, 1.5B, 7B, 14B, 32B, 72B
`qwen25_*_{pretrain,sft,peft}_config`	All	1–8	1–4	500M, 1.5B, 3B, 7B, 14B, 32B, 72B

Qwen3 (Dense)#

Recipe	Mode	TP	PP	CP	Sizes
`qwen3_*_pretrain_config`	Pretrain	1–8	1–2	—	600M–32B
`qwen3_*_sft_config`	SFT	1–8	1–2	—	600M–32B
`qwen3_600m_sft_128k_config`	SFT	1	1	8	600M (128K seq)
`qwen3_*_peft_config`	PEFT	1	1	—	600M–32B

Qwen3 MoE#

Recipe	Mode	TP	PP	EP	CP	GPUs
`qwen3_30b_a3b_pretrain_config`	Pretrain	1	1	8	—	8
`qwen3_30b_a3b_sft_config`	SFT	1	1	8	—	8
`qwen3_30b_a3b_peft_config`	PEFT	1	1	1	—	1
`qwen3_235b_a22b_pretrain_config`	Pretrain	4	16	8	2	512+
`qwen3_235b_a22b_sft_config`	SFT	4	8	8	—	256
`qwen3_235b_a22b_peft_config`	PEFT	1	4	4	—	16

Qwen3-Next#

Recipe	Mode	TP	PP	EP
`qwen3_next_80b_a3b_pretrain_config`	Pretrain	1	4	8
`qwen3_next_80b_a3b_sft_config`	SFT	1	2	8
`qwen3_next_80b_a3b_peft_config`	PEFT	1	1	4

DeepSeek#

Recipe	Mode	TP	PP	EP	GPUs
`deepseek_v2_lite_pretrain_config`	Pretrain	1	1	8	8
`deepseek_v2_pretrain_config`	Pretrain	1	4	32	128
`deepseek_v3_pretrain_config`	Pretrain	2	16	64	2048
`deepseek_v3_pretrain_config_32nodes`	Pretrain	2	8	32	256

GLM-4.5#

Recipe	Mode	TP	PP	EP	GPUs
`glm45_355b_pretrain_config`	Pretrain	2	8	16	256
`glm45_air_106b_pretrain_config`	Pretrain	1	4	8	32
`glm45_355b_sft_config`	SFT	2	8	16	256
`glm45_air_106b_sft_config`	SFT	1	4	8	32
`glm45_355b_peft_config`	PEFT	2	4	4	32
`glm45_air_106b_peft_config`	PEFT	1	2	4	8

Gemma#

Recipe	Mode	TP	PP	Sizes
`gemma2_*_{pretrain,sft,peft}_config`	All	2–8	1–2	2B, 9B, 27B
`gemma3_1b_{pretrain,sft,peft}_config`	All	1	1	1B (32K seq)

NemotronH / Nemotron#

Recipe	Mode	TP	PP	EP	Notes
`nemotronh_{4b,8b,47b,56b}_*_config`	P/S/PEFT	1–8	1–4	—	Dense SSM-hybrid
`nemotron_3_nano_*_config`	P/S/PEFT	varies	1	8	MoE + Mamba
`nemotron_3_super_*_config`	P/S/PEFT	4	1	8	MoE + Mamba, ~40% CUDA graph gain
`nemotron_nano_{9b,12b}_v2_*_config`	P/S/PEFT	varies	1	—	Dense

Other Models#

Recipe	Mode	Notes
`moonlight_16b_{pretrain,sft,peft}_config`	All	MoE EP=8
`olmoe_7b_{pretrain,sft,peft}_config`	All	MoE EP=8
`ministral3_{3b,8b,14b}_{sft,peft}_config`	SFT/PEFT	Dense
`gpt_oss_20b_*_config`	All	MoE + FP8/MXFP8 variants
`gpt_oss_120b_*_config`	All	MoE
`vanilla_gpt_pretrain_config`	Pretrain	MLM/Bridge parity baseline
`gpt3_175b_pretrain_config`	Pretrain	TP=4, PP=8, VP=6
`kimi_k2_pretrain_config`	Pretrain	1T MoE, TP=2 PP=16 EP=32

VLM Recipes#

Recipe	Mode	TP	PP	EP	GPUs
`gemma3_vl_{4b,12b,27b}_{sft,peft}_config`	SFT/PEFT	1–8	1–2	—	1–16
`qwen25_vl_{3b,7b,32b,72b}_{sft,peft}_config`	SFT/PEFT	1–8	1–4	—	1–32
`qwen3_vl_{8b,30b_a3b,235b_a22b}_{sft,peft}_config`	SFT/PEFT	1–4	1–8	1–32	1–512
`qwen35_vl_*_{sft,peft}_config`	SFT/PEFT	varies	varies	varies	varies
`glm_45v_{sft,peft}_config`	SFT/PEFT	1	8	4–16	64–512
`nemotron_nano_v2_vl_12b_{sft,peft}_config`	SFT/PEFT	2–4	1	—	8

Diffusion Recipes#

Recipe	Mode	TP	CP
`wan_1_3B_{pretrain,sft}_config`	P/SFT	1	8
`wan_14B_{pretrain,sft}_config`	P/SFT	2	4
`flux_12b_{pretrain,sft}_config`	P/SFT	2	1

Performance Recipe Index#

All perf recipes live under scripts/performance/. They are invoked via run_script.py and use WorkloadBaseConfig presets per GPU type.

Important: Perf recipes are designed for upper-bound throughput benchmarks, not production training. They run 50 iterations on mock data by default. Throughput numbers are aspirational targets, not validated convergence configs.

Llama 3 / 3.1#

Model	GPUs	GPU Types	Key Features
Llama 3 8B	8	H100, B200, B300, GB200, GB300, R100	CUDA graphs (local), FSDP on GB variants
Llama 3 70B	64	H100, B200, B300, GB200, GB300	TP comm overlap (userbuffers), FSDP, CUDA graphs
Llama 3.1 405B	128–1024	H100, B200, B300, GB200, GB300	TP+CP comm overlap (userbuffers), FSDP, heavy PP/VP

SFT/LoRA variants also exist (e.g. 8B SFT with packed sequences, 70B SFT on 32 GPUs).

DeepSeek V3#

Model	GPUs	GPU Types	Key Features
DeepSeek V3 (671B MoE)	256–1024	H100, B200, B300, GB200, GB300	HybridEP dispatcher, MLA recompute, CUDA graphs (TE scoped)

Qwen3 MoE#

Model	GPUs	GPU Types	Key Features
Qwen3 30B-A3B	8–16	H100, B200, B300, GB200, GB300	MoE alltoall/flex dispatcher
Qwen3 235B-A22B	64–256	H100, B200, B300, GB200, GB300	TP comm overlap, CUDA graphs, MoE a2a overlap
Qwen3-Next 80B-A3B	64–128	H100, B200, B300, GB200, GB300	EP 64–128

Qwen3-VL#

Model	GPUs	GPU Types	Key Features
Qwen3-VL 30B-A3B	8–16	H100, B200, B300, GB200, GB300	VLM + MoE
Qwen3-VL 235B-A22B	64–256	H100, B200, B300, GB200, GB300	VLM + MoE, TP comm overlap

Kimi K2#

Model	GPUs	GPU Types	Key Features
Kimi K2 (1T MoE)	256–1024	H100, B200, B300, GB200, GB300	Muon/Adam optimizer, HybridEP, pipeline layout helpers

NemotronH#

Model	GPUs	GPU Types	Key Features
Nemotron 3 Nano (30B MoE+Mamba)	8–16	H100, B200, B300, GB200, GB300	TE CUDA graphs (attn+mamba+moe), HybridEP
Nemotron 3 Super	64	H100, B200, B300, GB200, GB300	TE CUDA graphs, EP=64
NemotronH 56B	64	H100, B200, B300	TP=2–8, TE graphs (mamba+attn)

GPT-OSS#

Model	GPUs	GPU Types	Key Features
GPT-OSS 120B	64	H100, B200, GB200	EP=64, HybridEP on GB200

Recommendation Decision Tree#

User wants to train a model
│
├─ Know the model name?
│   ├─ Yes → Look up in Library Recipe Index above
│   │   ├─ Has a recipe for their size + mode? → Use it directly
│   │   └─ No exact match? → Use closest size, adjust parallelism
│   └─ No → Ask for model name, size, and HF model ID
│
├─ What's the training goal?
│   ├─ Pretrain → Use *_pretrain_config
│   ├─ SFT (full fine-tune) → Use *_sft_config
│   └─ PEFT (LoRA/DoRA) → Use *_peft_config (lowest GPU requirement)
│
├─ How many GPUs?
│   ├─ 1 GPU → Only PEFT recipes work (TP=1, PP=1)
│   ├─ 8 GPUs (1 node) → Most 8B–16B models, small MoE (EP=8)
│   ├─ 16–64 GPUs → 70B dense, medium MoE
│   └─ 128+ GPUs → 405B+, large MoE (DeepSeek V3, Kimi K2)
│
├─ Want throughput benchmarks?
│   ├─ Yes → Use perf recipes (scripts/performance/)
│   │   └─ ⚠️ These run on mock data for upper-bound perf only
│   └─ No → Use library recipes (scripts/training/run_recipe.py)
│
└─ Long context?
    ├─ > 8K → Need CP (context parallelism), check *_16k / *_64k / *_128k variants
    └─ ≤ 8K → Default recipes work

Adjustment Advice (When Recommending)#

Parallelism Resizing Rules#

When the user’s GPU count differs from the recipe default:

TP must divide num_key_value_heads (GQA constraint). E.g. if num_key_value_heads=8, valid TP = {1, 2, 4, 8}.
TP should stay within a single node (NVLink). TP > 8 requires inter-node NVLink (e.g., GB200 NVL72).
PP adds pipeline bubbles. Minimize PP; only increase when TP alone can’t fit the model. Use VP (virtual pipeline) to mitigate bubble overhead.
EP doesn’t reduce dense-layer memory. Only expert parameters shard with EP. Shared attention/embeddings are replicated. For “OOM with MoE”, increase EP first, not TP.
SP should be True whenever TP > 1. It eliminates redundant activation copies and is essentially free.
CP requires all-to-all or ring attention. Check cp_comm_type. For GQA models, a2a+p2p hierarchical CP allows CP > num_kv_heads.
world_size = DP × TP × PP × CP × EP. DP is implicit. Make sure the product of explicit parallelisms divides your total GPU count.

Batch Size Tuning#

Start with the recipe’s micro_batch_size. If OOM, reduce to 1.
global_batch_size determines learning dynamics. Scale with DP: GBS = micro_batch_size × DP × gradient_accumulation_steps.
For MoE, micro_batch_size=1 is typical at scale.

Common Pitfalls to Warn About#

Pitfall	Symptom	Fix
TP > num_kv_heads	Crash: “TP must divide num_query_groups”	Reduce TP to a divisor of num_kv_heads
PP without VP	Poor throughput (large bubble)	Set `virtual_pipeline_model_parallel_size`
EP too low for large MoE	OOM on expert params	Increase EP; each expert lives on EP/num_experts ranks
CUDA graphs + packed sequences	Assert: “CUDA graph accepts only Tensor inputs”	Disable packing or use `local` full-iteration graphs
CUDA graphs + full recompute	Assert: “full recompute only with full iteration CUDA graph”	Disable recompute or switch to `local` impl
`use_te_rng_tracker` not set	Assert on provider init when CUDA graphs enabled	Set `cfg.model.use_te_rng_tracker = True` and `cfg.rng.te_rng_tracker = True`
FSDP + TP > 1 on H100	Possible comm bottleneck	Prefer FSDP with TP=1 or TP=2 on H100; FSDP shines on GB/B-series
Long context without CP	OOM on activations	Add CP=2/4/8; use `_16k`, `_64k`, or `*_128k` recipe variants
MoE `overlap_grad_reduce` on H100	May hurt perf (False in many H100 presets)	Set `overlap_grad_reduce=False` for MoE on H100
VLM SFT missing image data	Runs but produces garbage	Provide actual multimodal dataset or use mock VLM data
Qwen35-VL MoE FSDP	Tested on Blackwell only	May not work on H100; validate first

Recipe Override Examples#

# Scale Llama3 8B from 2 GPUs to 8 GPUs (increase DP)
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe llama3_8b_pretrain_config \
    --dataset llm-pretrain-mock

# Reduce parallelism for Qwen3-MoE 30B to fit on 4 GPUs
uv run torchrun --nproc_per_node=4 scripts/training/run_recipe.py \
    --recipe qwen3_30b_a3b_sft_config \
    --dataset llm-finetune \
    'model.expert_model_parallel_size=4'

# Add long context to an existing recipe
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe llama3_8b_pretrain_config \
    --dataset llm-pretrain-mock \
    'model.seq_length=32768' \
    'model.context_parallel_size=4'

# Enable CUDA graphs on any recipe
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe qwen3_30b_a3b_pretrain_config \
    --dataset llm-pretrain-mock \
    'model.cuda_graph_impl=transformer_engine' \
    'model.cuda_graph_scope=[attn,moe_router,moe_preprocess]' \
    'model.use_te_rng_tracker=True' \
    'rng.te_rng_tracker=True'

Quick Reference: Which Recipe for My Situation?#

I want to…	Start with	GPUs needed
Try Bridge for the first time	`llama3_8b_sft_config` + mock data	2
Fine-tune a 7-8B model	`llama3_8b_sft_config` or `qwen3_8b_sft_config`	2–8
LoRA on 1 GPU	`llama3_8b_peft_config` or `qwen3_8b_peft_config`	1
Pretrain a dense 70B	`llama3_70b_pretrain_config`	32–64
Train a small MoE	`qwen3_30b_a3b_pretrain_config`	8
Train a large MoE (235B+)	`qwen3_235b_a22b_pretrain_config`	256–512
Benchmark throughput	Perf recipes via `run_script.py`	Varies
Long-context training	`llama3_8b_128k_pretrain_config` or add CP override	16+
VLM fine-tuning	`qwen3_vl_8b_sft_config` or `gemma3_vl_*_sft_config`	4–8
Diffusion training	`wan_1_3B_pretrain_config` or `flux_12b_pretrain_config`	8

Code Anchors#

What	Path
Library recipes root	`src/megatron/bridge/recipes/`
Recipe `__init__.py` (all exports)	`src/megatron/bridge/recipes/__init__.py`
Common recipe helpers	`src/megatron/bridge/recipes/common.py`
Training entry point	`scripts/training/run_recipe.py`
Perf recipes root	`scripts/performance/`
Perf entry point	`scripts/performance/run_script.py`
Perf workload configs	`scripts/performance/configs/<family>/`
Perf overrides (benchmark defaults)	`scripts/performance/utils/overrides.py`