bridge.perf_recipes.llama.b200.llama3#

B200 performance recipes for Llama 3.

Module Contents#

Functions#

llama3_8b_pretrain_8gpu_b200_bf16_config

Llama3 8B pretrain: 8× B200, BF16, CUDA graph local.

llama3_8b_pretrain_8gpu_b200_fp8cs_config

Llama3 8B pretrain: 8× B200, FP8 current-scaling, CUDA graph local.

llama3_8b_pretrain_8gpu_b200_fp8mx_config

Llama3 8B pretrain: 8× B200, MXFP8, CUDA graph local.

llama3_8b_pretrain_8gpu_b200_nvfp4_config

Llama3 8B pretrain: 8× B200, NVFP4.

llama3_70b_pretrain_64gpu_b200_bf16_config

Llama3 70B pretrain: 64× B200, BF16, TP=2 PP=4 CP=2, CUDA graph local, GBS=256.

llama3_70b_pretrain_64gpu_b200_fp8cs_config

Llama3 70B pretrain: 64× B200, FP8 current-scaling, FSDP, GBS=256.

llama3_70b_pretrain_64gpu_b200_fp8mx_config

Llama3 70B pretrain: 64× B200, MXFP8, TP=2 PP=4, GBS=256.

llama3_70b_pretrain_64gpu_b200_nvfp4_config

Llama3 70B pretrain: 64× B200, NVFP4, TP=2 PP=4, GBS=256.

llama3_70b_lora_8gpu_b200_bf16_config

Llama3 70B LoRA: 8× B200, BF16, PP=2.

llama3_70b_lora_8gpu_b200_fp8cs_config

Llama3 70B LoRA: 8× B200, FP8 current-scaling, PP=2.

llama3_70b_lora_8gpu_b200_fp8mx_config

Llama3 70B LoRA: 8× B200, MXFP8, PP=2.

llama3_8b_pretrain_64gpu_b200_bf16_config

Llama3 8B pretrain: 64× B200, BF16, legacy-scaled GBS.

llama3_8b_pretrain_64gpu_b200_fp8cs_config

Llama3 8B pretrain: 64× B200, FP8 current-scaling, legacy-scaled GBS.

API#

bridge.perf_recipes.llama.b200.llama3.llama3_8b_pretrain_8gpu_b200_bf16_config() megatron.bridge.perf_recipes.llama.common.ConfigContainer#

Llama3 8B pretrain: 8× B200, BF16, CUDA graph local.

bridge.perf_recipes.llama.b200.llama3.llama3_8b_pretrain_8gpu_b200_fp8cs_config() megatron.bridge.perf_recipes.llama.common.ConfigContainer#

Llama3 8B pretrain: 8× B200, FP8 current-scaling, CUDA graph local.

bridge.perf_recipes.llama.b200.llama3.llama3_8b_pretrain_8gpu_b200_fp8mx_config() megatron.bridge.perf_recipes.llama.common.ConfigContainer#

Llama3 8B pretrain: 8× B200, MXFP8, CUDA graph local.

bridge.perf_recipes.llama.b200.llama3.llama3_8b_pretrain_8gpu_b200_nvfp4_config() megatron.bridge.perf_recipes.llama.common.ConfigContainer#

Llama3 8B pretrain: 8× B200, NVFP4.

bridge.perf_recipes.llama.b200.llama3.llama3_70b_pretrain_64gpu_b200_bf16_config() megatron.bridge.perf_recipes.llama.common.ConfigContainer#

Llama3 70B pretrain: 64× B200, BF16, TP=2 PP=4 CP=2, CUDA graph local, GBS=256.

bridge.perf_recipes.llama.b200.llama3.llama3_70b_pretrain_64gpu_b200_fp8cs_config() megatron.bridge.perf_recipes.llama.common.ConfigContainer#

Llama3 70B pretrain: 64× B200, FP8 current-scaling, FSDP, GBS=256.

bridge.perf_recipes.llama.b200.llama3.llama3_70b_pretrain_64gpu_b200_fp8mx_config() megatron.bridge.perf_recipes.llama.common.ConfigContainer#

Llama3 70B pretrain: 64× B200, MXFP8, TP=2 PP=4, GBS=256.

bridge.perf_recipes.llama.b200.llama3.llama3_70b_pretrain_64gpu_b200_nvfp4_config() megatron.bridge.perf_recipes.llama.common.ConfigContainer#

Llama3 70B pretrain: 64× B200, NVFP4, TP=2 PP=4, GBS=256.

bridge.perf_recipes.llama.b200.llama3.llama3_70b_lora_8gpu_b200_bf16_config() megatron.bridge.perf_recipes.llama.common.ConfigContainer#

Llama3 70B LoRA: 8× B200, BF16, PP=2.

bridge.perf_recipes.llama.b200.llama3.llama3_70b_lora_8gpu_b200_fp8cs_config() megatron.bridge.perf_recipes.llama.common.ConfigContainer#

Llama3 70B LoRA: 8× B200, FP8 current-scaling, PP=2.

bridge.perf_recipes.llama.b200.llama3.llama3_70b_lora_8gpu_b200_fp8mx_config() megatron.bridge.perf_recipes.llama.common.ConfigContainer#

Llama3 70B LoRA: 8× B200, MXFP8, PP=2.

bridge.perf_recipes.llama.b200.llama3.llama3_8b_pretrain_64gpu_b200_bf16_config() megatron.bridge.perf_recipes.llama.common.ConfigContainer#

Llama3 8B pretrain: 64× B200, BF16, legacy-scaled GBS.

bridge.perf_recipes.llama.b200.llama3.llama3_8b_pretrain_64gpu_b200_fp8cs_config() megatron.bridge.perf_recipes.llama.common.ConfigContainer#

Llama3 8B pretrain: 64× B200, FP8 current-scaling, legacy-scaled GBS.