Auto Recipe β Recipe Index & Recommendation#
This skill indexes every shipped recipe and helps users pick the right starting config, adjust parallelism, and avoid common pitfalls.
How to Use This Skill#
Ask the user for: model name/size, GPU count & type, training goal (pretrain / SFT / PEFT), and sequence length (if non-default).
Look up the best-match recipe in the index below.
Recommend the recipe function name + entry-point command.
Provide adjustment advice (parallelism resizing, batch tuning, pitfalls).
Entry Points#
Library recipes (functional training)#
# Pretrain with mock data
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe <recipe_function_name> \
--dataset llm-pretrain-mock
# SFT with SQuAD
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe <recipe_function_name> \
--dataset llm-finetune
# Override any field via CLI
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe llama3_8b_pretrain_config \
--dataset llm-pretrain-mock \
'model.tensor_model_parallel_size=2' \
'training.global_batch_size=64'
Performance recipes (throughput benchmarks)#
python scripts/performance/run_script.py \
--recipe <model_family> \
--gpu_type h100 \
--num_gpus 64 \
--data mock
Perf recipes are NOT fully validated for correctness. Most conversations and testing were on mock data. They are designed for upper-bound throughput measurement, not production training. Always validate loss curves and convergence independently.
Recipe Unification (Coming Soon β PR #2803)#
PR #2803 is unifying performance recipes into the same Python function format used by library recipes. Key changes:
Perf recipes move from
scripts/performance/configs/βsrc/megatron/bridge/recipes/<family>/<model>_perf.pyEach perf recipe becomes a self-contained Python function (e.g.
llama3_8b_h100_bf16_pretrain_config())The old
WorkloadBaseConfigβset_workload_base_configsβget_perf_optimized_recipepipeline is removedShared helpers:
_benchmark_common()(50 iters, timing, TE RNG),_perf_precision()(bf16 / fp8_cs / fp8_mx / nvfp4)
Why Python, not YAML? Previous YAML-based approaches had problems: recipe logic was split across multiple indirection layers, configs were not self-contained, and the two-level pipeline made maintenance and debugging difficult. Python functions are explicit, greppable, and composable.
After #2803 lands, both library and perf recipes will be invocable through the
same run_recipe.py entry point.
Library Recipe Index#
All recipes live under src/megatron/bridge/recipes/. Each function returns a
ConfigContainer with model, training, optimizer, and data settings.
Llama#
Recipe |
Mode |
TP |
PP |
CP |
SP |
GPUs (min) |
Seq Len |
|---|---|---|---|---|---|---|---|
|
Pretrain |
2 |
1 |
β |
β |
2 |
4K |
|
Pretrain |
2 |
1 |
β |
β |
2 |
8K |
|
Pretrain |
2 |
1 |
2 |
β |
4 |
16K |
|
Pretrain |
2 |
1 |
4 |
β |
8 |
64K |
|
Pretrain |
2 |
1 |
8 |
β |
16 |
128K |
|
Pretrain |
8 |
4 |
β |
β |
32 |
8K |
|
Pretrain |
8 |
4 |
2 |
β |
64 |
16K |
|
Pretrain |
8 |
4 |
4 |
β |
128 |
64K |
|
Pretrain |
8 |
16 |
β |
β |
128 |
8K |
|
SFT |
2 |
1 |
β |
β |
2 |
8K |
|
SFT |
4 |
4 |
β |
β |
16 |
8K |
|
SFT |
8 |
8 |
β |
β |
64 |
8K |
|
PEFT |
1 |
1 |
β |
β |
1 |
8K |
|
PEFT |
2 |
4 |
β |
β |
8 |
8K |
|
PEFT |
4 |
8 |
β |
β |
32 |
8K |
Qwen2 / Qwen2.5#
Recipe |
Mode |
TP |
PP |
Sizes |
|---|---|---|---|---|
|
All |
1β8 |
1β4 |
500M, 1.5B, 7B, 14B, 32B, 72B |
|
All |
1β8 |
1β4 |
500M, 1.5B, 3B, 7B, 14B, 32B, 72B |
Qwen3 (Dense)#
Recipe |
Mode |
TP |
PP |
CP |
Sizes |
|---|---|---|---|---|---|
|
Pretrain |
1β8 |
1β2 |
β |
600Mβ32B |
|
SFT |
1β8 |
1β2 |
β |
600Mβ32B |
|
SFT |
1 |
1 |
8 |
600M (128K seq) |
|
PEFT |
1 |
1 |
β |
600Mβ32B |
Qwen3 MoE#
Recipe |
Mode |
TP |
PP |
EP |
CP |
GPUs |
|---|---|---|---|---|---|---|
|
Pretrain |
1 |
1 |
8 |
β |
8 |
|
SFT |
1 |
1 |
8 |
β |
8 |
|
PEFT |
1 |
1 |
1 |
β |
1 |
|
Pretrain |
4 |
16 |
8 |
2 |
512+ |
|
SFT |
4 |
8 |
8 |
β |
256 |
|
PEFT |
1 |
4 |
4 |
β |
16 |
Qwen3-Next#
Recipe |
Mode |
TP |
PP |
EP |
|---|---|---|---|---|
|
Pretrain |
1 |
4 |
8 |
|
SFT |
1 |
2 |
8 |
|
PEFT |
1 |
1 |
4 |
DeepSeek#
Recipe |
Mode |
TP |
PP |
EP |
GPUs |
|---|---|---|---|---|---|
|
Pretrain |
1 |
1 |
8 |
8 |
|
Pretrain |
1 |
4 |
32 |
128 |
|
Pretrain |
2 |
16 |
64 |
2048 |
|
Pretrain |
2 |
8 |
32 |
256 |
GLM-4.5#
Recipe |
Mode |
TP |
PP |
EP |
GPUs |
|---|---|---|---|---|---|
|
Pretrain |
2 |
8 |
16 |
256 |
|
Pretrain |
1 |
4 |
8 |
32 |
|
SFT |
2 |
8 |
16 |
256 |
|
SFT |
1 |
4 |
8 |
32 |
|
PEFT |
2 |
4 |
4 |
32 |
|
PEFT |
1 |
2 |
4 |
8 |
Gemma#
Recipe |
Mode |
TP |
PP |
Sizes |
|---|---|---|---|---|
|
All |
2β8 |
1β2 |
2B, 9B, 27B |
|
All |
1 |
1 |
1B (32K seq) |
NemotronH / Nemotron#
Recipe |
Mode |
TP |
PP |
EP |
Notes |
|---|---|---|---|---|---|
|
P/S/PEFT |
1β8 |
1β4 |
β |
Dense SSM-hybrid |
|
P/S/PEFT |
varies |
1 |
8 |
MoE + Mamba |
|
P/S/PEFT |
4 |
1 |
8 |
MoE + Mamba, ~40% CUDA graph gain |
|
P/S/PEFT |
varies |
1 |
β |
Dense |
Other Models#
Recipe |
Mode |
Notes |
|---|---|---|
|
All |
MoE EP=8 |
|
All |
MoE EP=8 |
|
SFT/PEFT |
Dense |
|
All |
MoE + FP8/MXFP8 variants |
|
All |
MoE |
|
Pretrain |
MLM/Bridge parity baseline |
|
Pretrain |
TP=4, PP=8, VP=6 |
|
Pretrain |
1T MoE, TP=2 PP=16 EP=32 |
VLM Recipes#
Recipe |
Mode |
TP |
PP |
EP |
GPUs |
|---|---|---|---|---|---|
|
SFT/PEFT |
1β8 |
1β2 |
β |
1β16 |
|
SFT/PEFT |
1β8 |
1β4 |
β |
1β32 |
|
SFT/PEFT |
1β4 |
1β8 |
1β32 |
1β512 |
|
SFT/PEFT |
varies |
varies |
varies |
varies |
|
SFT/PEFT |
1 |
8 |
4β16 |
64β512 |
|
SFT/PEFT |
2β4 |
1 |
β |
8 |
Diffusion Recipes#
Recipe |
Mode |
TP |
CP |
|---|---|---|---|
|
P/SFT |
1 |
8 |
|
P/SFT |
2 |
4 |
|
P/SFT |
2 |
1 |
Performance Recipe Index#
All perf recipes live under scripts/performance/. They are invoked via
run_script.py and use WorkloadBaseConfig presets per GPU type.
Important: Perf recipes are designed for upper-bound throughput benchmarks, not production training. They run 50 iterations on mock data by default. Throughput numbers are aspirational targets, not validated convergence configs.
Llama 3 / 3.1#
Model |
GPUs |
GPU Types |
Key Features |
|---|---|---|---|
Llama 3 8B |
8 |
H100, B200, B300, GB200, GB300, R100 |
CUDA graphs (local), FSDP on GB variants |
Llama 3 70B |
64 |
H100, B200, B300, GB200, GB300 |
TP comm overlap (userbuffers), FSDP, CUDA graphs |
Llama 3.1 405B |
128β1024 |
H100, B200, B300, GB200, GB300 |
TP+CP comm overlap (userbuffers), FSDP, heavy PP/VP |
SFT/LoRA variants also exist (e.g. 8B SFT with packed sequences, 70B SFT on 32 GPUs).
DeepSeek V3#
Model |
GPUs |
GPU Types |
Key Features |
|---|---|---|---|
DeepSeek V3 (671B MoE) |
256β1024 |
H100, B200, B300, GB200, GB300 |
HybridEP dispatcher, MLA recompute, CUDA graphs (TE scoped) |
Qwen3 MoE#
Model |
GPUs |
GPU Types |
Key Features |
|---|---|---|---|
Qwen3 30B-A3B |
8β16 |
H100, B200, B300, GB200, GB300 |
MoE alltoall/flex dispatcher |
Qwen3 235B-A22B |
64β256 |
H100, B200, B300, GB200, GB300 |
TP comm overlap, CUDA graphs, MoE a2a overlap |
Qwen3-Next 80B-A3B |
64β128 |
H100, B200, B300, GB200, GB300 |
EP 64β128 |
Qwen3-VL#
Model |
GPUs |
GPU Types |
Key Features |
|---|---|---|---|
Qwen3-VL 30B-A3B |
8β16 |
H100, B200, B300, GB200, GB300 |
VLM + MoE |
Qwen3-VL 235B-A22B |
64β256 |
H100, B200, B300, GB200, GB300 |
VLM + MoE, TP comm overlap |
Kimi K2#
Model |
GPUs |
GPU Types |
Key Features |
|---|---|---|---|
Kimi K2 (1T MoE) |
256β1024 |
H100, B200, B300, GB200, GB300 |
Muon/Adam optimizer, HybridEP, pipeline layout helpers |
NemotronH#
Model |
GPUs |
GPU Types |
Key Features |
|---|---|---|---|
Nemotron 3 Nano (30B MoE+Mamba) |
8β16 |
H100, B200, B300, GB200, GB300 |
TE CUDA graphs (attn+mamba+moe), HybridEP |
Nemotron 3 Super |
64 |
H100, B200, B300, GB200, GB300 |
TE CUDA graphs, EP=64 |
NemotronH 56B |
64 |
H100, B200, B300 |
TP=2β8, TE graphs (mamba+attn) |
GPT-OSS#
Model |
GPUs |
GPU Types |
Key Features |
|---|---|---|---|
GPT-OSS 120B |
64 |
H100, B200, GB200 |
EP=64, HybridEP on GB200 |
Recommendation Decision Tree#
User wants to train a model
β
ββ Know the model name?
β ββ Yes β Look up in Library Recipe Index above
β β ββ Has a recipe for their size + mode? β Use it directly
β β ββ No exact match? β Use closest size, adjust parallelism
β ββ No β Ask for model name, size, and HF model ID
β
ββ What's the training goal?
β ββ Pretrain β Use *_pretrain_config
β ββ SFT (full fine-tune) β Use *_sft_config
β ββ PEFT (LoRA/DoRA) β Use *_peft_config (lowest GPU requirement)
β
ββ How many GPUs?
β ββ 1 GPU β Only PEFT recipes work (TP=1, PP=1)
β ββ 8 GPUs (1 node) β Most 8Bβ16B models, small MoE (EP=8)
β ββ 16β64 GPUs β 70B dense, medium MoE
β ββ 128+ GPUs β 405B+, large MoE (DeepSeek V3, Kimi K2)
β
ββ Want throughput benchmarks?
β ββ Yes β Use perf recipes (scripts/performance/)
β β ββ β οΈ These run on mock data for upper-bound perf only
β ββ No β Use library recipes (scripts/training/run_recipe.py)
β
ββ Long context?
ββ > 8K β Need CP (context parallelism), check *_16k / *_64k / *_128k variants
ββ β€ 8K β Default recipes work
Adjustment Advice (When Recommending)#
Parallelism Resizing Rules#
When the userβs GPU count differs from the recipe default:
TP must divide
num_key_value_heads(GQA constraint). E.g. ifnum_key_value_heads=8, valid TP = {1, 2, 4, 8}.TP should stay within a single node (NVLink). TP > 8 requires inter-node NVLink (e.g., GB200 NVL72).
PP adds pipeline bubbles. Minimize PP; only increase when TP alone canβt fit the model. Use VP (virtual pipeline) to mitigate bubble overhead.
EP doesnβt reduce dense-layer memory. Only expert parameters shard with EP. Shared attention/embeddings are replicated. For βOOM with MoEβ, increase EP first, not TP.
SP should be True whenever TP > 1. It eliminates redundant activation copies and is essentially free.
CP requires all-to-all or ring attention. Check
cp_comm_type. For GQA models,a2a+p2phierarchical CP allows CP > num_kv_heads.world_size = DP Γ TP Γ PP Γ CP Γ EP. DP is implicit. Make sure the product of explicit parallelisms divides your total GPU count.
Batch Size Tuning#
Start with the recipeβs
micro_batch_size. If OOM, reduce to 1.global_batch_sizedetermines learning dynamics. Scale with DP:GBS = micro_batch_size Γ DP Γ gradient_accumulation_steps.For MoE,
micro_batch_size=1is typical at scale.
Common Pitfalls to Warn About#
Pitfall |
Symptom |
Fix |
|---|---|---|
TP > num_kv_heads |
Crash: βTP must divide num_query_groupsβ |
Reduce TP to a divisor of num_kv_heads |
PP without VP |
Poor throughput (large bubble) |
Set |
EP too low for large MoE |
OOM on expert params |
Increase EP; each expert lives on EP/num_experts ranks |
CUDA graphs + packed sequences |
Assert: βCUDA graph accepts only Tensor inputsβ |
Disable packing or use |
CUDA graphs + full recompute |
Assert: βfull recompute only with full iteration CUDA graphβ |
Disable recompute or switch to |
|
Assert on provider init when CUDA graphs enabled |
Set |
FSDP + TP > 1 on H100 |
Possible comm bottleneck |
Prefer FSDP with TP=1 or TP=2 on H100; FSDP shines on GB/B-series |
Long context without CP |
OOM on activations |
Add CP=2/4/8; use |
MoE |
May hurt perf (False in many H100 presets) |
Set |
VLM SFT missing image data |
Runs but produces garbage |
Provide actual multimodal dataset or use mock VLM data |
Qwen35-VL MoE FSDP |
Tested on Blackwell only |
May not work on H100; validate first |
Recipe Override Examples#
# Scale Llama3 8B from 2 GPUs to 8 GPUs (increase DP)
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe llama3_8b_pretrain_config \
--dataset llm-pretrain-mock
# Reduce parallelism for Qwen3-MoE 30B to fit on 4 GPUs
uv run torchrun --nproc_per_node=4 scripts/training/run_recipe.py \
--recipe qwen3_30b_a3b_sft_config \
--dataset llm-finetune \
'model.expert_model_parallel_size=4'
# Add long context to an existing recipe
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe llama3_8b_pretrain_config \
--dataset llm-pretrain-mock \
'model.seq_length=32768' \
'model.context_parallel_size=4'
# Enable CUDA graphs on any recipe
uv run torchrun --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe qwen3_30b_a3b_pretrain_config \
--dataset llm-pretrain-mock \
'model.cuda_graph_impl=transformer_engine' \
'model.cuda_graph_scope=[attn,moe_router,moe_preprocess]' \
'model.use_te_rng_tracker=True' \
'rng.te_rng_tracker=True'
Quick Reference: Which Recipe for My Situation?#
I want to⦠|
Start with |
GPUs needed |
|---|---|---|
Try Bridge for the first time |
|
2 |
Fine-tune a 7-8B model |
|
2β8 |
LoRA on 1 GPU |
|
1 |
Pretrain a dense 70B |
|
32β64 |
Train a small MoE |
|
8 |
Train a large MoE (235B+) |
|
256β512 |
Benchmark throughput |
Perf recipes via |
Varies |
Long-context training |
|
16+ |
VLM fine-tuning |
|
4β8 |
Diffusion training |
|
8 |
Code Anchors#
What |
Path |
|---|---|
Library recipes root |
|
Recipe |
|
Common recipe helpers |
|
Training entry point |
|
Perf recipes root |
|
Perf entry point |
|
Perf workload configs |
|
Perf overrides (benchmark defaults) |
|