Your First Training Run#
This guide walks you through two training examples and then covers data preparation for your own datasets. You start with a minimal distributed loop to validate your environment, then run a full LLaMA-3 training job. Make sure you have completed installation before proceeding.
Minimal Training Example#
Start with the simplest possible setup, a distributed training loop using mock data on two GPUs. This verifies that your environment is configured correctly before moving to real models.
torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py
LLaMA-3 Training Example#
With the environment validated, run a production-scale example. The following script trains a LLaMA-3 8B model with FP8 mixed precision on eight GPUs using mock data, demonstrating tensor parallelism and optimized kernels.
./examples/llama/train_llama3_8b_h100_fp8.sh
Data Preparation#
To train on your own data, Megatron expects preprocessed binary files (.bin and .idx).
1. Prepare a JSONL File#
Each line should contain a text field:
{"text": "Your training text here..."}
{"text": "Another training sample..."}
2. Preprocess the Data#
Run the preprocessing script to tokenize and convert your data into binary format:
python tools/preprocess_data.py \
--input data.jsonl \
--output-prefix processed_data \
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model /path/to/tokenizer.model \
--workers 8 \
--append-eod
Key Arguments#
--input: Path to input JSON/JSONL file--output-prefix: Prefix for output binary files (.binand.idx)--tokenizer-type: Tokenizer type (HuggingFaceTokenizer,GPT2BPETokenizer, and so on)--tokenizer-model: Path to tokenizer model file--workers: Number of parallel workers for processing--append-eod: Add end-of-document token
Next Steps#
Explore Parallelism Strategies to scale your training
Learn about Data Preparation best practices
Explore Advanced Features for FP8 training, context parallelism, and more