Training Loop Configuration#
The bridge.training.config.TrainingConfig contains settings related to the training loop bounds, exit conditions, validation, batch sizing, and memory management.
Key Parameters#
Configure these parameters to control core training behavior, resource utilization, and monitoring across distributed setups.
Batch Configuration#
Define how data is batched and distributed across devices during training.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Batch size per model instance (local batch size) |
|
|
|
Training batch size across all devices |
|
|
|
Batch size ramp up: |
|
|
|
Automatically decrease batch size if needed for fault tolerance |
The relationship between batch sizes:
Global batch size =
micro_batch_size×data_parallel_size×gradient_accumulation_stepsIf
global_batch_sizeis not set, it defaults tomicro_batch_size×data_parallel_size
Training Duration#
Control when training stops using iteration counts, sample counts, or time-based limits.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Total number of iterations to train |
|
|
|
Total number of samples to train |
|
|
|
Exit after iteration divisible by this value |
|
|
|
Exit after this many minutes |
Training Mode Selection
Megatron-Bridge supports two modes for specifying training duration:
Iteration-based training: Specify
train_itersto control the total number of training iterations.Sample-based training: Specify
train_samplesto control the total number of training samples.
Important constraints:
You must specify exactly one of
train_itersortrain_samples- not both.When using
train_samples, training iterations are automatically calculated astrain_samples // global_batch_size.Batch size rampup (
rampup_batch_size) is not currently supported with sample-based training.Your scheduler configuration should match your training mode (see Learning Rate Scheduling).
Validation#
Configure validation frequency, duration, and evaluation-only modes.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Number of iterations for validation/test evaluation |
|
|
|
Interval between validation runs |
|
|
|
Skip training, only do evaluation and exit |
Note: To control validation behavior:
Set
eval_itersto0to disable validation entirely (both during and after training).Set
eval_intervaltoNoneto skip validation during training, but still run validation after training completes.
Memory Management#
Control GPU memory cleanup and garbage collection to prevent memory issues during training.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Call |
|
|
|
Synchronize Python garbage collection across ranks to avoid stragglers |
|
|
|
Training step interval for manual garbage collection (0=disabled) |
|
|
|
Enable garbage collection during evaluation when using manual GC |
Signal Handling and Exit Conditions#
Set up automatic checkpoint saving and clean exit procedures for signal-based interruptions.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Save checkpoint and shutdown gracefully on signal detection |
|
|
|
Signal to handle for graceful shutdown |
|
|
|
Use signal handler for dataloader workers |
Performance Monitoring#
Monitor training consistency and synchronization across distributed processes.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Check weight hash consistency across data parallel replicas |
|
|
|
CPU-GPU synchronization interval to prevent CPU running ahead |