Training Loop Configuration#
The bridge.training.config.TrainingConfig
contains settings related to the training loop bounds, exit conditions, validation, batch sizing, and memory management.
Key Parameters#
Configure these parameters to control core training behavior, resource utilization, and monitoring across distributed setups.
Batch Configuration#
Define how data is batched and distributed across devices during training.
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
|
|
Batch size per model instance (local batch size) |
|
|
|
Training batch size across all devices |
|
|
|
Batch size ramp up: |
|
|
|
Automatically decrease batch size if needed for fault tolerance |
The relationship between batch sizes:
Global batch size =
micro_batch_size
×data_parallel_size
×gradient_accumulation_steps
If
global_batch_size
is not set, it defaults tomicro_batch_size
×data_parallel_size
Training Duration#
Control when training stops using iteration counts or time-based limits.
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
|
|
Total number of iterations to train |
|
|
|
Exit after iteration divisible by this value |
|
|
|
Exit after this many minutes |
Validation#
Configure validation frequency, duration, and evaluation-only modes.
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
|
|
Number of iterations for validation/test evaluation |
|
|
|
Interval between validation runs |
|
|
|
Skip training, only do evaluation and exit |
Note: To control validation behavior:
Set
eval_iters
to0
to disable validation entirely (both during and after training).Set
eval_interval
toNone
to skip validation during training, but still run validation after training completes.
Memory Management#
Control GPU memory cleanup and garbage collection to prevent memory issues during training.
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
|
|
Call |
|
|
|
Synchronize Python garbage collection across ranks to avoid stragglers |
|
|
|
Training step interval for manual garbage collection (0=disabled) |
|
|
|
Enable garbage collection during evaluation when using manual GC |
Signal Handling and Exit Conditions#
Set up automatic checkpoint saving and clean exit procedures for signal-based interruptions.
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
|
|
Save checkpoint and shutdown gracefully on signal detection |
|
|
|
Signal to handle for graceful shutdown |
|
|
|
Use signal handler for dataloader workers |
Performance Monitoring#
Monitor training consistency and synchronization across distributed processes.
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
|
|
Check weight hash consistency across data parallel replicas |
|
|
|
CPU-GPU synchronization interval to prevent CPU running ahead |