Convert Checkpoints Between Training Steps#

Use a conversion step when one step produces a checkpoint layout that the next step cannot consume directly. The converter is an explicit pipeline step, not an implicit side effect of training.

Choose the Converter#

Source artifact	Target artifact	Step
`checkpoint_hf`	`checkpoint_megatron`	`convert/hf_to_megatron`
`checkpoint_megatron`	`checkpoint_hf`	`convert/megatron_to_hf`
`checkpoint_lora` plus original base checkpoint	merged `checkpoint_hf`	`convert/merge_lora`

Common cases:

AutoModel SFT or PEFT produces Hugging Face layout checkpoints. Use convert/hf_to_megatron before Megatron-Bridge consumers that require Megatron layout.
Megatron-Bridge SFT, RL, and some optimization steps produce Megatron distributed checkpoints. Use convert/megatron_to_hf before Hugging Face-native evaluation, deployment, or pruning flows.
PEFT steps produce adapter checkpoints. Use convert/merge_lora when deployment or evaluation needs a single merged Hugging Face checkpoint.

Preflight Checks#

Before conversion:

Pick one validated checkpoint iteration. For Megatron exports, point megatron_path at the concrete iter_* checkpoint directory, not the parent run directory.
Keep output paths separate from input paths. A failed conversion should never overwrite the source checkpoint.
Keep tokenizer and chat-template provenance with the checkpoint. If the converter needs hf_model_id, use the original model or config source used by training.
For LoRA merge, use the exact base checkpoint the adapter was trained against.
For large Megatron checkpoints, use the default distributed conversion path. The default config runs nvcr.io/nvidia/nemo:26.04, which ships the multi-GPU Megatron-Bridge conversion script.
Keep tp, pp, ep, and etp aligned with the model or checkpoint layout. The default distributed conversion path uses tp=1 pp=1 ep=8 etp=1 for Nemotron MoE checkpoints; dense models usually need an override such as tp=8 pp=1 ep=1 etp=1.

Convert Hugging Face to Megatron#

Use this path when a Megatron-Bridge consumer needs a Megatron distributed checkpoint.

$ nemotron steps run convert/hf_to_megatron -c default \
    hf_model_id=/path/to/hf_checkpoint_or_model_id \
    megatron_path=/path/to/output_megatron_checkpoint \
    tp=1 pp=1 ep=8

For NVIDIA Nemotron checkpoints, keep dtype=bfloat16 unless the source checkpoint requires another dtype. The step fails early if multiple ranks are launched but all model-parallel values are left at 1, because that would not reduce per-GPU model memory.

Convert Megatron to Hugging Face#

Use this path when the next consumer is Hugging Face-native evaluation, deployment, pruning, or a tool that expects safetensors.

$ nemotron steps run convert/megatron_to_hf -c default \
    megatron_path=/path/to/megatron/iter_0000100 \
    hf_model_id=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 \
    hf_path=/path/to/output_hf_checkpoint \
    tp=1 pp=1 ep=8

The hf_model_id value supplies the model configuration and tokenizer expectations used to reconstruct the Hugging Face layout. Keep tp, pp, ep, and etp aligned with the source Megatron checkpoint for export.

Merge LoRA Into a Hugging Face Base#

Use this path for adapters produced by Hugging Face-native PEFT flows.

$ nemotron steps run convert/merge_lora -c default \
    backend=hf_peft \
    lora_checkpoint=/path/to/adapter_checkpoint \
    base_hf_path=/path/to/original_hf_base \
    output_hf_path=/path/to/merged_hf_checkpoint

Do not merge into a different base model, even if the architecture name matches.

Merge Megatron-Bridge LoRA#

Use this path for adapters produced by Megatron-Bridge PEFT flows. The step can write a merged Megatron checkpoint and export a Hugging Face checkpoint when export_hf=true.

$ nemotron steps run convert/merge_lora -c default \
    backend=megatron_bridge \
    lora_checkpoint=/path/to/lora_megatron_checkpoint \
    base_megatron_path=/path/to/original_dense_megatron_checkpoint \
    hf_model_id=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 \
    output_megatron_path=/path/to/merged_megatron_checkpoint \
    output_hf_path=/path/to/merged_hf_checkpoint

Use tp, pp, and ep overrides when the merge must match a specific tensor, pipeline, or expert parallel layout.

Run on a Cluster Profile#

Generated environment files include one shared conversion profile per executor family. Use the profile that matches your site:

$ nemotron steps run convert/megatron_to_hf -c default --batch lepton_convert_model \
    megatron_path=/mnt/lustre-shared/output/sft/megatron_bridge/iter_0000100 \
    hf_model_id=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 \
    hf_path=/mnt/lustre-shared/output/convert/sft-hf

Equivalent profile names are slurm_convert_model for Slurm and dgxcloud_convert_model for DGX Cloud.

Validate the Output#

After conversion:

Confirm the output directory exists and contains model weights plus tokenizer/config files for Hugging Face outputs.
Run a small generation or evaluation smoke test before using the checkpoint for a larger training or evaluation job.
Preserve the source checkpoint until the converted checkpoint has passed validation.