Data Preparation#
Megatron Bridge uses different dataset config objects for pretraining, text fine-tuning, and multimodal fine-tuning. Choose the data path by workflow first, then keep the dataset sequence length aligned with model.seq_length.
Data Formats by Workflow#
Workflow |
Data format |
Config or provider |
Required path fields |
|---|---|---|---|
LLM pretraining |
Megatron binary |
|
|
LLM SFT or PEFT from local files |
JSONL split files |
|
|
LLM SFT or PEFT from Hugging Face datasets |
Hugging Face dataset processed to JSONL |
|
|
VLM SFT or PEFT |
Energon/WebDataset, Hugging Face VLM dataset, or preloaded JSON |
VLM |
Provider-specific fields such as |
Use seq_length in Bridge examples and CLI overrides. GPTDatasetConfig also stores this value as Megatron Core’s inherited sequence_length field internally, but FinetuningDatasetConfig uses seq_length.
LLM Pretraining Data#
LLM pretraining uses Megatron binary indexed datasets. Each dataset is represented by a prefix with matching .bin and .idx files:
/data/dclm/preprocessed_text_document.bin
/data/dclm/preprocessed_text_document.idx
Pass the prefix without the .bin or .idx suffix:
from megatron.bridge.training.config import GPTDatasetConfig
dataset = GPTDatasetConfig(
seq_length=8192,
data_path="/data/dclm/preprocessed_text_document",
split="9999,8,2",
random_seed=1234,
reset_attention_mask=False,
reset_position_ids=False,
eod_mask_loss=False,
)
The CLI-friendly data_path field is converted to Megatron Core’s blend field during config finalization. For weighted multi-dataset training, use either a flattened data_path list with weights and prefixes or set blend/blend_per_split directly.
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe llama32_1b_pretrain_config \
--dataset llm-pretrain \
dataset.data_path=/data/dclm/preprocessed_text_document \
dataset.seq_length=8192
To create Megatron binary data from JSONL text, use the Megatron-LM tools/preprocess_data.py workflow. The DCLM tutorial shows a complete download, merge, shuffle, and preprocessing flow: DCLM Data Preprocessing Tutorial.
Local JSONL SFT and PEFT Data#
Text SFT and PEFT use a directory containing split files named training.jsonl, validation.jsonl, and optionally test.jsonl:
/data/sft_jsonl/
training.jsonl
validation.jsonl
test.jsonl
The default text SFT dataset expects each JSONL record to contain prompt and answer fields compatible with the configured prompt_template. The common input/output format is:
{"input": "Question: What is Megatron Bridge?", "output": "A PyTorch-native bridge for Megatron-Core workflows."}
Configure local JSONL data with FinetuningDatasetConfig.dataset_root:
from megatron.bridge.training.config import FinetuningDatasetConfig
dataset = FinetuningDatasetConfig(
dataset_root="/data/sft_jsonl",
seq_length=4096,
)
Launch the generic recipe runner with the preloaded local JSONL dataset type:
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe llama32_1b_sft_config \
--dataset llm-finetune-preloaded \
dataset.dataset_root=/data/sft_jsonl \
dataset.seq_length=4096 \
checkpoint.pretrained_checkpoint=/checkpoints/base_model
For PEFT, use the PEFT recipe or set cfg.peft; the data layout stays the same. checkpoint.pretrained_checkpoint is required for the frozen base model, and checkpoint.load is used only when resuming adapter checkpoints.
Hugging Face Datasets for SFT and PEFT#
HFDatasetConfig downloads or reads a Hugging Face dataset, applies a processing function to each example, writes Bridge-compatible JSONL split files, and then builds the same fine-tuning dataset used by local JSONL.
from megatron.bridge.data.builders.hf_dataset import HFDatasetConfig
from megatron.bridge.data.hf_processors.squad import process_squad_example
dataset = HFDatasetConfig(
dataset_name="rajpurkar/squad",
process_example_fn=process_squad_example,
dataset_root="/data/processed/squad",
seq_length=512,
val_proportion=0.1,
do_test=False,
)
If dataset_root is omitted, the processed JSONL is cached under the NeMo datasets cache for the dataset name. Set rewrite=False when you want later runs to reuse already processed files.
The generic launcher provides preset Hugging Face text datasets through --dataset llm-finetune:
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe llama32_1b_peft_config \
--dataset llm-finetune \
dataset.dataset_name=gsm8k \
checkpoint.pretrained_checkpoint=/checkpoints/base_model
VLM Fine-Tuning Data#
VLM recipes usually use a dataset provider instead of FinetuningDatasetConfig. The provider owns both the storage format and the processor needed to turn image, video, audio, and text records into batches.
For Energon/WebDataset data, create tar shards plus .nv-meta metadata and pass the dataset root to the recipe provider:
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe qwen3_vl_8b_peft_energon_config \
--dataset vlm-energon \
--step_func qwen3_vl_step \
dataset.path=/data/vlm_energon \
checkpoint.pretrained_checkpoint=/checkpoints/qwen3_vl_base
For preloaded VLM JSON or JSONL, use records with messages or conversations plus media paths. Relative image and video paths are resolved against dataset.image_folder by PreloadedVLMConversationProvider:
{"messages": [{"role": "user", "content": "<image>Describe the image."}, {"role": "assistant", "content": "A receipt."}], "images": ["receipt_0001.jpg"]}
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe qwen3_vl_8b_peft_config \
--dataset vlm-preloaded \
--step_func qwen3_vl_step \
dataset.train_data_path=/data/vlm/train.jsonl \
dataset.valid_data_path=/data/vlm/validation.jsonl \
dataset.image_folder=/data/vlm/images \
dataset.hf_processor_path=Qwen/Qwen3-VL-8B-Instruct \
checkpoint.pretrained_checkpoint=/checkpoints/qwen3_vl_base
For a complete WebDataset/Energon preparation example, see VALOR32K-AVQA Dataset Preparation Guide.
Checkpoint Conversion Reminder#
Data preparation and checkpoint preparation are separate. From-scratch pretraining does not require a checkpoint. SFT and PEFT require base model weights through checkpoint.pretrained_checkpoint unless you are resuming from a complete native Megatron checkpoint with checkpoint.load.
checkpoint.pretrained_checkpoint may point to a native Megatron checkpoint directory, a specific native iter_N directory, or a local Hugging Face full-model directory. For production and multi-node jobs, converting Hugging Face checkpoints to native Megatron format first is usually more repeatable.