Training Examples#
Get started with Megatron Core training using these practical examples.
Simple Training Example#
The simplest way to get started is with the basic training loop using mock data:
# Distributed training on 2 GPUs with mock data
torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py
This example:
Runs on 2 GPUs
Uses generated mock data (no data preparation needed)
Demonstrates basic distributed training setup
Perfect for testing your installation
LLaMA-3 Training Examples#
LLaMA-3 8B with FP8#
Train LLaMA-3 8B model with FP8 mixed precision on 8 GPUs:
./examples/llama/train_llama3_8b_fp8.sh
Configuration:
8 GPUs
FP8 mixed precision (requires Hopper/Ada/Blackwell GPUs)
Mock data for quick testing
Custom LLaMA Training#
For training with your own data:
torchrun --nproc_per_node=8 pretrain_gpt.py \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--num-layers 32 \
--hidden-size 4096 \
--num-attention-heads 32 \
--seq-length 2048 \
--max-position-embeddings 2048 \
--micro-batch-size 4 \
--global-batch-size 32 \
--train-iters 100000 \
--lr 3.0e-4 \
--min-lr 3.0e-5 \
--lr-decay-style cosine \
--lr-warmup-iters 2000 \
--weight-decay 0.1 \
--clip-grad 1.0 \
--bf16 \
--data-path /path/to/your/preprocessed_data \
--split 949,50,1 \
--save /path/to/checkpoints \
--load /path/to/checkpoints \
--log-interval 10 \
--save-interval 1000 \
--eval-interval 1000
GPT-3 Training Example#
Train a GPT-3 style model:
torchrun --nproc_per_node=8 pretrain_gpt.py \
--tensor-model-parallel-size 2 \
--pipeline-model-parallel-size 2 \
--num-layers 24 \
--hidden-size 2048 \
--num-attention-heads 16 \
--seq-length 1024 \
--max-position-embeddings 1024 \
--micro-batch-size 2 \
--global-batch-size 16 \
--train-iters 100000 \
--lr 1.5e-4 \
--min-lr 1.0e-5 \
--lr-decay-style cosine \
--lr-warmup-iters 1000 \
--weight-decay 0.1 \
--clip-grad 1.0 \
--fp16 \
--data-path /path/to/preprocessed_data \
--split 949,50,1 \
--save /path/to/checkpoints \
--load /path/to/checkpoints
Key Training Arguments#
Model Architecture#
Argument |
Description |
|---|---|
|
Number of transformer layers |
|
Hidden dimension size |
|
Number of attention heads |
|
Sequence length for training |
Training Configuration#
Argument |
Description |
|---|---|
|
Batch size per GPU |
|
Total batch size across all GPUs |
|
Number of training iterations |
Learning Rate#
Argument |
Description |
|---|---|
|
Peak learning rate |
|
Minimum learning rate |
|
LR schedule (cosine, linear, constant) |
|
Warmup iterations |
Mixed Precision#
Argument |
Description |
|---|---|
|
FP16 mixed precision |
|
BF16 mixed precision (recommended) |
|
FP8 mixed precision (Hopper/Ada/Blackwell) |
Data and Checkpointing#
Argument |
Description |
|---|---|
|
Path to preprocessed data |
|
Train/validation/test split (e.g., 949,50,1) |
|
Checkpoint save directory |
|
Checkpoint load directory |
|
Save checkpoint every N iterations |
Next Steps#
Optimize Performance: See Advanced Features for FSDP, distributed optimizer, and other optimizations
Scale Up: Learn about Parallelism Strategies to train larger models across more GPUs
Prepare Data: Follow the Data Preparation guide to process your own datasets