Data Preparation#
Preparing your data correctly is essential for successful training with Megatron Core.
Data Format#
Megatron Core expects training data in JSONL (JSON Lines) format, where each line is a JSON object:
{"text": "Your training text here..."}
{"text": "Another training sample..."}
{"text": "More training data..."}
Preprocessing Data#
Use the preprocess_data.py tool to convert your JSONL data into Megatron’s binary format:
python tools/preprocess_data.py \
--input data.jsonl \
--output-prefix processed_data \
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model /path/to/tokenizer.model \
--workers 8 \
--append-eod
Key Arguments#
Argument |
Description |
|---|---|
|
Path to input JSON/JSONL file |
|
Prefix for output binary files (.bin and .idx) |
|
Tokenizer type ( |
|
Path to tokenizer model file |
|
Number of parallel workers for processing |
|
Add end-of-document token |
Output Files#
The preprocessing tool generates two files:
processed_data.bin- Binary file containing tokenized sequencesprocessed_data.idx- Index file for fast random access
Using Preprocessed Data#
Reference your preprocessed data in training scripts:
--data-path processed_data \
--split 949,50,1 # Train/validation/test split
Common Tokenizers#
HuggingFace Tokenizers#
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model /path/to/tokenizer.model
GPT-2 BPE Tokenizer#
--tokenizer-type GPT2BPETokenizer \
--vocab-file gpt2-vocab.json \
--merge-file gpt2-merges.txt