Configuration Overview#

The ConfigContainer is the central configuration object in Megatron Bridge that holds all settings for training. It acts as a single source of truth that brings together model architecture, training parameters, data loading, optimization, checkpointing, logging, and distributed training settings.

What is ConfigContainer#

ConfigContainer is a dataclass that holds all the configuration objects needed for training:

from megatron.bridge.training.config import ConfigContainer

# ConfigContainer brings together all training configurations
config = ConfigContainer(
    model=model_provider,             # Model architecture and parallelism
    train=training_config,            # Training loop parameters  
    optimizer=optimizer_config,       # Megatron Optimization settings
    scheduler=scheduler_config,       # Learning rate scheduling
    dataset=dataset_config,           # Data loading configuration
    logger=logger_config,             # Logging and monitoring
    tokenizer=tokenizer_config,       # Tokenization settings
    checkpoint=checkpoint_config,     # Checkpointing and resuming
    dist=distributed_config,          # Distributed training setup
    ddp=ddp_config,                   # Megatron Distributed Data Parallel settings
    # Optional configurations
    peft=peft_config,                 # Parameter-efficient fine-tuning
    profiling=profiling_config,       # Performance profiling
    mixed_precision=mp_config,        # Mixed precision training
    comm_overlap=comm_overlap_config, # Communication overlap settings
    # ... and more
)

Configuration Components#

Component

Purpose

Required

Default

model

Model architecture and parallelism strategy (GPT, T5, Mamba)

βœ…

-

train

Training loop parameters (batch sizes, iterations, validation)

βœ…

-

optimizer

Optimizer type and hyperparameters (from Megatron Core)

βœ…

-

scheduler

Learning rate and weight decay scheduling

βœ…

-

dataset

Data loading and preprocessing configuration

βœ…

-

logger

Logging, TensorBoard, and WandB configuration

βœ…

-

tokenizer

Tokenizer settings and vocabulary

βœ…

-

checkpoint

Checkpointing, saving, and loading

βœ…

-

dist

Distributed training initialization

DistributedInitConfig()

ddp

Data parallel configuration (from Megatron Core)

DistributedDataParallelConfig()

rng

Random number generation settings

RNGConfig()

rerun_state_machine

Result validation and error injection

RerunStateMachineConfig()

mixed_precision

Mixed precision training settings

None

comm_overlap

Communication overlap optimizations

None

peft

Parameter-efficient fine-tuning (LoRA, DoRA, etc.)

None

profiling

Performance profiling with nsys or PyTorch profiler

None

ft

Fault tolerance and automatic recovery

None

straggler

GPU straggler detection

None

nvrx_straggler

NVIDIA Resiliency Extension straggler detection

None

inprocess_restart

In-process restart for fault tolerance

None

Design Philosophy#

Interoperability with External Config Systems#

Megatron Bridge’s Python configurations are designed to be amenable to other configuration systems you already use, such as:

  • Programmatic configuration: Direct Python object manipulation

  • argparse: Command-line arguments can be easily mapped to dataclass fields

  • File-based overrides: JSON, YAML, or other config files can override Python configs

All of these approaches can be translated into Python dataclass instances. The framework provides utilities as a convenience for YAML-based overrides with OmegaConf, but the framework is not tied to any particular configuration system.

# All of these approaches work seamlessly:

# 1. Direct Python configuration
config = ConfigContainer(
    model=GPTModelProvider(num_layers=24, hidden_size=2048),
    train=TrainingConfig(global_batch_size=256, train_iters=10000),
    # ... other configs
)

# 2. YAML-based serialization and deserialization (round-trip)
config.to_yaml("my_config.yaml")
config = ConfigContainer.from_yaml("my_config.yaml")  # Load previously saved config

# 3. Programmatic override after creation
config.train.global_batch_size = 512  # Override after instantiation
config.model.num_layers = 48          # Modify model architecture

Centralized Configuration#

Megatron provides extensive flexibility through a rich set of configuration options. The ConfigContainer brings all these settings together in a single, organized object. This centralization makes configuration discoverable and maintainable - you have one place to understand and control all aspects of your training run.

Unlike pure YAML-based configuration systems, ConfigContainer provides centralization with the full power of Python. You get the organizational benefits of a single configuration file combined with the programmatic flexibility of Python.

The configuration system is built using nested dataclasses, providing:

  • Modularity: Each config component is independently defined and testable

  • Type safety: Full static type checking

  • IDE support: Autocomplete and type hints in development environments

  • Serialization: Easy conversion to/from YAML, JSON, or other formats

  • Validation: Built-in field validation

@dataclass
class ConfigContainer:
    model: GPTModelProvider      # Dataclass for model architecture
    train: TrainingConfig        # Dataclass for training parameters
    optimizer: OptimizerConfig   # Dataclass for optimization settings
    # ... nested dataclasses for each concern

Lazy Configuration and Deferred Validation#

For training workloads, configurations are lazy to support flexible user workflows:

Problem with Eager Validation:

# This would be problematic with eager validation:
config = TrainingConfig(train_iters=1000)
# __post_init__ calculates dependent values immediately

config.train_iters = 5000  # User override
# Dependent values are now stale and incorrect!

Solution with Lazy Finalization:

# Megatron Bridge approach - deferred validation
config = TrainingConfig(train_iters=1000)
config.train_iters = 5000  # User can safely override

# Validation happens automatically right when training starts
pretrain(config, forward_step_func)  # All dependent values calculated correctly

Benefits:

  • Users can instantiate configs and subsequently override fields safely

  • Dependent values are calculated correctly after all user modifications are applied

  • Validation happens at the right time, right before training begins

  • Flexible configuration workflows are supported

Model Independence#

Model configurations are designed to be independently usable outside the full training loop provided by thr framework:

# Models can be used standalone
model_provider = GPTModelProvider(
    num_layers=24,
    hidden_size=2048,
    vocab_size=50000,    # Must be explicitly set
    seq_length=2048,     # Must be explicitly set
)

# This works independently of other configs
model_provider.finalize()
model = model_provider.provide()

Trade-off: The price for this flexibility is the need to explicitly set values like seq_length in multiple places during training. These settings are checked for consistency at the beginning of training.

Usage#

# Create and configure
config = ConfigContainer(
    model=GPTModelProvider(num_layers=24, seq_length=2048),
    train=TrainingConfig(train_iters=1000),
    dataset=GPTDatasetConfig(sequence_length=2048),  # Must match model seq_length
    # ... other required configs
)

# Modify as needed
config.train.train_iters = 5000
config.model.hidden_size = 4096

# Start training - validation happens automatically
pretrain(config, forward_step_func)

Configuration Export and Import#

Export to YAML#

# Print YAML configuration to console
config.print_yaml()

# Save to file
config.to_yaml("config.yaml")

Load from YAML#

# Load configuration from YAML file
config = ConfigContainer.from_yaml("config.yaml")