Diffusion Language Model (dLLM) Fine-Tuning and Generation with NeMo AutoModel#
Introduction#
Diffusion language models (dLLMs) generate text by iteratively denoising masked tokens, rather than generating one token at a time left-to-right like autoregressive (AR) models. Starting from a sequence of [MASK] tokens, the model progressively unmasks the most confident positions over multiple denoising steps until the full response is revealed.
This approach enables parallel token generation and bidirectional attention, which gives the model more context for each prediction compared to AR models.
NeMo AutoModel currently supports the following dLLM model families:
LLaDA / LLaDA2 (MDLM) β Bidirectional masked diffusion. The model receives corrupted tokens and predicts the clean token at each masked position (see LLaDA2 paper).
Nemotron-Labs-Diffusion (Hybrid) β Combines diffusion with an autoregressive loss. During training, the model processes clean tokens plus a
masked_indicessidecar and learns both a diffusion objective and an AR objective simultaneously.DFlash β Speculative block diffusion. A small draft model proposes tokens for a block conditioned on frozen target LM hidden states; a decay-weighted loss trains it to predict the targetβs distribution (see DFlash paper).
Workflow Overview#
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β 1. Install β--->β 2. Configure β--->β 3. Train β--->β 4. Generate β
β β β YAML β β β β β
β pip install β β Recipe + β β torchrun β β Run dLLM β
β or Docker β β dLLM config β β β β inference β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
Step |
Section |
What You Do |
|---|---|---|
1. Install |
Install the package using pip or Docker |
|
2. Configure |
Write a YAML config specifying model, data, dLLM mode, and training settings |
|
3. Train |
Launch training with |
|
4. Generate |
Generate text from a fine-tuned checkpoint |
Supported Models#
Model Family |
dLLM Mode |
Loss |
Inference |
Example Config |
|---|---|---|---|---|
LLaDA / LLaDA2 |
|
MDLM cross-entropy |
Block-by-block, full-forward (no KV cache) |
|
Nemotron-Labs-Diffusion |
|
Diffusion + AR (alpha-weighted) |
Block diffusion with KV cache |
|
DFlash |
|
Decay-weighted cross-entropy (Eq. 4) |
Speculative block decoding (draft + target) |
Install NeMo AutoModel#
pip3 install nemo-automodel
Alternatively, use the pre-built Docker container:
docker pull nvcr.io/nvidia/nemo-automodel:26.02.00
docker run --gpus all -it --rm --shm-size=8g nvcr.io/nvidia/nemo-automodel:26.02.00
For the full set of installation methods, see the installation guide.
Configure Your Training Recipe#
dLLM fine-tuning is driven by:
A recipe script (
train_ft.py) β orchestrates the training loop with dLLM-specific corruption, loss, and batch handling.A YAML configuration file β specifies the model, data, optimizer, dLLM-specific settings, and distributed training strategy.
The recipe uses a strategy pattern to handle differences between model families. The dllm.mode field in the YAML selects the strategy:
Mode |
Strategy |
Description |
|---|---|---|
|
|
LLaDA-style: model receives corrupted tokens, MDLM cross-entropy loss |
|
|
Nemotron-Labs-Diffusion-style: model receives clean tokens + |
|
|
DFlash: frozen target LM provides hidden states; draft model trained with decay-weighted loss |
LLaDA Configuration#
See llada_sft.yaml for the full working config. The key dLLM-specific sections are:
model:
pretrained_model_name_or_path: GSAI-ML/LLaDA-8B-Base
torch_dtype: float32
trust_remote_code: true
dllm:
mode: mdlm
mask_token_id: 126336 # LLaDA mask token
eps: 0.001 # Minimum corruption ratio
dataset:
unshifted: true # Required for dLLM training
Nemotron-Labs-Diffusion Configuration#
See nemotron_labs_diffusion_sft.yaml for the full working config. The key dLLM-specific sections are:
model:
pretrained_model_name_or_path: nvidia/Nemotron-Labs-Diffusion-8B-Base
torch_dtype: float32 # Master-weight dtype. Use `float32` for an fp32 master copy or `bfloat16` for bf16.
trust_remote_code: true
dlm_paradigm: block_diff # required for SFT: HF default "bidirectional" is the inference mode
block_size: 32
dllm:
mode: hybrid
mask_token_id: 100 # Nemotron-Labs-Diffusion mask token
eps: 0.001
ar_loss_alpha: 0.3 # weight on the diffusion branch (AR branch is unweighted)
pad_seq_len_divisible: 1024
dataset:
unshifted: true
Key dLLM Config Fields#
Field |
Description |
|---|---|
|
Training strategy ( |
|
Token ID used for masking ( |
|
Minimum corruption ratio to avoid zero-corruption samples |
|
When set, use block-wise corruption (otherwise uniform). Hybrid mode only. |
|
Half-life ratio for block-wise corruption (defaults to 0.25 when unset). Hybrid mode only. |
|
Weight applied to the diffusion branch in the hybrid loss. Hybrid mode only. |
|
Must be |
DFlash Configuration#
DFlash trains a small draft model to predict tokens conditioned on a frozen causal target LM. Only the draft modelβs weights are updated; the target LM is loaded once and kept frozen.
See dflash_sft.yaml for the full working config. The key DFlash-specific sections are:
model: # Draft model
_target_: transformers.AutoModel.from_pretrained
pretrained_model_name_or_path: z-lab/Qwen3-4B-DFlash-b16
trust_remote_code: true
torch_dtype: bfloat16
dllm:
mode: dflash
mask_token_id: null # Resolved automatically from target tokenizer
eps: 0.001
dflash:
target_model_id: Qwen/Qwen3-4B # Frozen causal LM
target_torch_dtype: bfloat16
block_size: 0 # 0 reads from draft model config
loss_decay_gamma: 0.0 # 0 uses paper defaults (Ξ³=7 for block_size=16)
num_blocks_per_sample: 512 # Paper default (Appendix A.1)
attention_backend: flex_attention # required for N > ~64; sdpa OOMs
overlap_anchors: true # paper samples anchors independently
Field |
Description |
|---|---|
|
Hub ID of the frozen causal LM that conditions the draft |
|
Tokens per draft block; |
|
Decay Ξ³ for Eq. 4; |
|
Number of anchor blocks processed per sequence per step (paper default: 512, Appendix A.1) |
|
|
|
|
DFlash Training Metrics#
In addition to the shared metrics (loss, grad_norm, lr, mem, tps,
mfu), DFlash runs log a draft top-1 accuracy proxy for acceptance length:
Metric |
Meaning |
Where |
|---|---|---|
|
Overall fraction of valid block positions where |
Console line + wandb / mlflow / comet + file logger |
|
Same fraction restricted to block offset |
wandb / mlflow / comet + file logger (one panel per offset); intentionally omitted from the console line to keep it readable |
Both are computed for free inside the chunked linear-CE path (same logits used
for the loss) and DP/CP-reduced via per-rank raw (correct, count) sums that
are SUM-allreduced and then divided post-reduction, so the values are correct
across arbitrary per-rank token distributions under any of AutoModelβs
distributed modes.
Prepare DFlash Training Data#
The paper trains on responses regenerated by the target model (Β§5.1): βInstead of directly using the original dataset, we construct our training set with the responses generated by the target model for better target alignment.β Skipping this step trains the draft on a different output distribution than the target produces at inference, which directly reduces acceptance length.
The existing nemo_automodel.components.speculative.regenerate script handles
this. Stand up an SGLang server hosting the target, then re-roll the assistant
turns:
# 1. Serve the target model on the local node (default port 30000)
python -m sglang.launch_server \
--model-path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--served-model-name nemotron-30b \
--trust-remote-code
# 2. Regenerate the dataset's assistant turns through the target (separate shell)
python -m nemo_automodel.components.speculative.regenerate \
--input-data nvidia/Nemotron-Post-Training-Dataset-v2 \
--output-dir /data/dflash-train-regen \
--model nemotron-30b \
--temperature 0.8 \
--shard-size 1000 \
--concurrency 64 \
--resume
--temperature 0.8 (vs the scriptβs EAGLE-oriented 0.0 default) follows the
DFlash paper: sampling diversity in the supervised tokens teaches the draft to
handle a wider target distribution, improving acceptance length.
--concurrency 64 better saturates one vLLM/SGLang server.
Then point the recipeβs dataset.path_or_dataset_id at the regenerated
parquet shards (/data/dflash-train-regen) instead of the raw HF dataset.
Fine-Tune the Model#
Fine-Tune LLaDA2#
torchrun --nproc-per-node=8 \
examples/dllm_sft/finetune.py \
-c examples/dllm_sft/llada2_sft.yaml
Fine-Tune with DFlash#
torchrun --nproc-per-node=8 \
examples/dllm_sft/finetune.py \
-c examples/dllm_sft/dflash_sft.yaml
Fine-Tune Nemotron-Labs-Diffusion#
torchrun --nproc-per-node=8 \
nemo_automodel/recipes/dllm/train_ft.py \
-c examples/dllm_sft/nemotron_labs_diffusion_sft.yaml
Run Inference#
The generation script (generate.py) supports chat, raw, and infilling modes. Pick the sampler that matches the trained family with --sampler {llada,nemotron}.
--checkpoint accepts any of: a path to a consolidated/ directory, a step directory (.../epoch_0_step_499), or the top-level checkpoint dir (the script will follow LATEST/model/consolidated/).
Generate with LLaDA#
python examples/dllm_generate/generate.py \
--checkpoint <path> \
--prompt "Explain what a neural network is." \
--sampler llada
Generate with Nemotron-Labs-Diffusion#
python examples/dllm_generate/generate.py \
--checkpoint <path> \
--prompt "What is 2+2?" \
--sampler nemotron
The nemotron path internally invokes the modelβs built-in block-diffusion model.generate(...) (with the AR-seed mechanism), while the llada path uses the standalone DLLMSampler.sample(...).