bridge.diffusion.recipes.nemotron_labs_diffusion.ar_to_dlm#

Module Contents#

Functions#

nemotron_labs_diffusion_pretrain_config

Return a pre-training config for NemotronLabsDiffusion.

nemotron_labs_diffusion_3b_pretrain_config

Return a pre-training config for NemotronLabsDiffusion 3B (TP=1, MBS=1, GBS=512, 12.5k iters, WSD LR).

nemotron_labs_diffusion_8b_pretrain_config

Return a pre-training config for NemotronLabsDiffusion 8B (TP=4, MBS=1, GBS=512, 12.5k iters, WSD LR).

nemotron_labs_diffusion_14b_pretrain_config

Return a pre-training config for NemotronLabsDiffusion 14B (TP=8, MBS=1, GBS=512, 12.5k iters, WSD LR).

_nemotron_labs_diffusion_common

Create a pre-training configuration for NemotronLabsDiffusion models using a given model provider.

distributed_fused_adam_with_cosine_annealing_dllm

Creates a distributed fused Adam optimizer with cosine annealing scheduler. Here we use all default parameters from Megatron-Bridge

API#

bridge.diffusion.recipes.nemotron_labs_diffusion.ar_to_dlm.nemotron_labs_diffusion_pretrain_config(
**user_kwargs,
) megatron.bridge.training.config.ConfigContainer#

Return a pre-training config for NemotronLabsDiffusion.

See _nemotron_labs_diffusion_common for the full list of parameters.

bridge.diffusion.recipes.nemotron_labs_diffusion.ar_to_dlm.nemotron_labs_diffusion_3b_pretrain_config(
**user_kwargs,
) megatron.bridge.training.config.ConfigContainer#

Return a pre-training config for NemotronLabsDiffusion 3B (TP=1, MBS=1, GBS=512, 12.5k iters, WSD LR).

bridge.diffusion.recipes.nemotron_labs_diffusion.ar_to_dlm.nemotron_labs_diffusion_8b_pretrain_config(
**user_kwargs,
) megatron.bridge.training.config.ConfigContainer#

Return a pre-training config for NemotronLabsDiffusion 8B (TP=4, MBS=1, GBS=512, 12.5k iters, WSD LR).

bridge.diffusion.recipes.nemotron_labs_diffusion.ar_to_dlm.nemotron_labs_diffusion_14b_pretrain_config(
**user_kwargs,
) megatron.bridge.training.config.ConfigContainer#

Return a pre-training config for NemotronLabsDiffusion 14B (TP=8, MBS=1, GBS=512, 12.5k iters, WSD LR).

bridge.diffusion.recipes.nemotron_labs_diffusion.ar_to_dlm._nemotron_labs_diffusion_common(
model_provider: megatron.bridge.diffusion.models.nemotron_labs_diffusion.nemotron_labs_diffusion_provider.NemotronLabsDiffusionModelProvider | None = None,
hf_path: str | None = None,
dir: str | None = None,
name: str = 'default',
data_paths: list[str] | None = None,
data_args_path: str | None = None,
train_data_path: list[str] | None = None,
valid_data_path: list[str] | None = None,
test_data_path: list[str] | None = None,
per_split_data_args_path: str | None = None,
mock: bool = False,
tensor_parallelism: int = 1,
pipeline_parallelism: int = 1,
pipeline_parallelism_dtype: torch.dtype | None = None,
virtual_pipeline_parallelism: int | None = None,
context_parallelism: int = 1,
sequence_parallelism: bool = False,
use_megatron_fsdp: bool = False,
enable_recompute: bool = False,
train_iters: int = 300000,
global_batch_size: int = 32,
micro_batch_size: int = 2,
seq_length: int = 4096,
lr: float = 0.0003,
min_lr: float = 3e-05,
lr_warmup_iters: int = 500,
lr_decay_iters: int | None = None,
lr_decay_style: str = 'cosine',
lr_warmup_fraction: float | None = None,
lr_wsd_decay_iters: int | None = None,
tokenizer_model: str | None = None,
eval_interval: int = 500,
save_interval: int = 500,
load_hf_checkpoint: str | None = None,
precision_config: megatron.bridge.training.mixed_precision.MixedPrecisionConfig | str | None = 'bf16_mixed',
comm_overlap_config: megatron.bridge.training.comm_overlap.CommOverlapConfig | None = None,
) megatron.bridge.training.config.ConfigContainer#

Create a pre-training configuration for NemotronLabsDiffusion models using a given model provider.

Parameters:
  • hf_path (Optional[str]) – HuggingFace model path (e.g., β€œQwen/Qwen3-1.7B”).

  • model_provider (NemotronLabsDiffusionModelProvider) – Model provider for the model.

  • dir (Optional[str]) – Base directory for saving logs and checkpoints.

  • name (str) – Name of the pre-training run.

  • data_paths (Optional[List[str]]) – List of paths to dataset files. If None, mock data will be used.

  • data_args_path (Optional[str]) – Path to file containing data arguments.

  • train_data_path (Optional[List[str]]) – List of training data paths.

  • valid_data_path (Optional[List[str]]) – List of validation data paths.

  • test_data_path (Optional[List[str]]) – List of test data paths.

  • per_split_data_args_path (Optional[str]) – Path to JSON file with per-split data configuration.

  • mock (bool) – Whether to use mock data. If True, ignores data_paths.

  • tensor_parallelism (int) – Degree of tensor model parallelism.

  • pipeline_parallelism (int) – Degree of pipeline model parallelism.

  • pipeline_parallelism_dtype (Optional[torch.dtype]) – Data type for pipeline parallelism.

  • virtual_pipeline_parallelism (Optional[int]) – Size of virtual pipeline parallelism.

  • context_parallelism (int) – Degree of context parallelism to be passed to model_config.

  • sequence_parallelism (bool) – Whether to use sequence parallelism.

  • use_megatron_fsdp (bool) – Whether to use Megatron FSDP.

  • enable_recompute (bool) – Whether to enable recompute for memory optimization.

  • train_iters (int) – Total number of training iterations.

  • global_batch_size (int) – Global batch size for training.

  • micro_batch_size (int) – Micro batch size for training.

  • seq_length (int) – Sequence length for training data.

  • lr (float) – Learning rate.

  • min_lr (float) – Minimum learning rate for cosine decay.

  • lr_warmup_iters (int) – Number of warmup iterations for the learning rate.

  • lr_decay_iters (Optional[int]) – Number of iterations over which to decay the LR.

  • lr_decay_style (str) – LR decay style (β€œcosine” or β€œWSD”).

  • lr_warmup_fraction (Optional[float]) – Fraction of train_iters for warmup (WSD only).

  • lr_wsd_decay_iters (Optional[int]) – Number of decay iterations for WSD scheduler.

  • tokenizer_model (Optional[str]) – HuggingFace tokenizer model ID. If None, uses NullTokenizer.

  • precision_config (Optional[Union[MixedPrecisionConfig, str]]) – Precision configuration for the model.

  • comm_overlap_config (Optional[CommOverlapConfig]) – Communication overlap configuration.

Returns:

Configuration for pre-training.

Return type:

ConfigContainer

bridge.diffusion.recipes.nemotron_labs_diffusion.ar_to_dlm.distributed_fused_adam_with_cosine_annealing_dllm(
precision: str = 'bf16-mixed',
lr_warmup_iters: int = 2000,
lr_decay_iters: Optional[int] = None,
weight_decay: float = 0.1,
max_lr: float = 0.0001,
min_lr: Optional[float] = None,
clip_grad: float = 1.0,
) tuple[megatron.bridge.training.config.OptimizerConfig, megatron.bridge.training.config.SchedulerConfig]#

Creates a distributed fused Adam optimizer with cosine annealing scheduler. Here we use all default parameters from Megatron-Bridge

Parameters:
  • precision – Mixed precision type (β€œbf16-mixed”, β€œ16-mixed”, etc.)

  • lr_warmup_iters – Number of iterations for learning rate warmup

  • lr_decay_iters – Number of iterations for learning rate decay. If None, defaults to train_iters during training.

  • adam_beta1 – Adam optimizer beta1 parameter

  • adam_beta2 – Adam optimizer beta2 parameter

  • adam_eps – Adam optimizer epsilon parameter

  • weight_decay – Weight decay coefficient

  • max_lr – Maximum learning rate

  • min_lr – Minimum learning rate (defaults to 0.1 * max_lr)

  • clip_grad – Gradient clipping value

Returns:

Tuple of (OptimizerConfig, SchedulerConfig)