convert/megatron_to_hf#
This step converts a Megatron distributed checkpoint into Hugging Face safetensors layout.
Use it when a downstream Hugging Face-native consumer expects checkpoint_hf but the upstream artifact is checkpoint_megatron.
Syntax#
nemotron steps run convert/megatron_to_hf \
[-c <config-name-or-path>] \
[-r <run-profile> | -b <batch-profile>] \
[-d] \
[<dotlist-overrides>...]
Refer to the Nemotron Steps CLI Reference for the shared flag set.
Configuration Files#
The step ships config/default.yaml under src/nemotron/steps/convert/megatron_to_hf/config/.
The default hf_model_id is nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16, and megatron_path must be supplied for real runs.
Inputs and Outputs#
Direction |
Artifact Type |
Required |
Description |
|---|---|---|---|
Consumes |
|
Yes |
A specific Megatron checkpoint directory, normally an |
Produces |
|
- |
A Hugging Face safetensors checkpoint directory. |
Parameters#
- megatron_path=<path>#
Specific Megatron checkpoint directory to export. Point this at the concrete checkpoint iteration, not the parent training run folder.
Example:
megatron_path=/lustre/runs/sft/checkpoints/iter_0000100
- hf_model_id=<id-or-path>#
Hugging Face model id or config source used by AutoBridge to reconstruct the Hugging Face architecture and tokenizer expectations.
Example:
hf_model_id=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16
- hf_path=<path>#
Output directory for the exported Hugging Face safetensors checkpoint.
Example:
hf_path=/lustre/output/convert/sft-hf
- trust_remote_code=<true-or-false>#
Whether to trust custom Hugging Face model code while reconstructing the export target.
Default:
true.
- show_progress=<true-or-false>#
Whether to show Megatron-Bridge export progress output.
Default:
true.
- strict=<true-or-false>#
Whether Megatron-Bridge should require source and target checkpoint keys to match strictly.
Default:
true.
- distributed=<true-or-false-or-auto>#
Use the mounted multi-GPU converter instead of the single-process AutoBridge helper. Keep this enabled for large checkpoints that cannot be loaded on one GPU.
Default:
true.
- tp=<int> pp=<int> ep=<int> etp=<int>#
Tensor, pipeline, expert, and expert-tensor parallel sizes used by the source Megatron checkpoint. These values must match the checkpoint layout.
Defaults:
tp=1,pp=1,ep=8,etp=1.
- distributed_save=<true-or-false>#
Let ranks write assigned Hugging Face shards independently, reducing rank-0 memory pressure during export.
Default:
true.
- torchrun.nproc_per_node=<int>#
Number of local conversion ranks when the step has to launch
torchrunitself. When a backend already launches the step withtorchrun, the existing distributed world is reused.Default:
NEMOTRON_CONVERT_NPROC_PER_NODEor8.
Command Examples#
Export a validated Megatron checkpoint iteration to Hugging Face layout:
$ nemotron steps run convert/megatron_to_hf -c default \
megatron_path=/path/to/megatron/iter_0000100 \
hf_model_id=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 \
hf_path=./output/convert/sft-hf \
tp=1 pp=1 ep=8
Submit the export through a generated Lepton profile:
$ nemotron steps run convert/megatron_to_hf -c default --batch lepton_convert_model \
megatron_path=/mnt/lustre-shared/output/sft/megatron_bridge/iter_0000100 \
hf_model_id=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 \
hf_path=/mnt/lustre-shared/output/convert/sft-hf
Recovery Notes#
If export fails because the checkpoint is incomplete, wait for async checkpoint save to finish and retry from a complete
iter_*directory.If tokenizer or config reconstruction fails, set
hf_model_idto the original base model or config source.If
distributed=truelaunches multiple ranks withtp=pp=ep=etp=1, the step fails early because that would not shard the model. Set the real source checkpoint parallelism, such astp=1 pp=1 ep=8 etp=1for Nemotron MoE ortp=8 pp=1 ep=1 etp=1for a dense checkpoint.Validate the exported Hugging Face checkpoint with a small generation or evaluation job before deployment.