nemo_automodel.components.datasets.diffusion.sampler
nemo_automodel.components.datasets.diffusion.sampler
Module Contents
Classes
Data
API
Bases: Sampler[List[int]]
Production-grade Sampler that:
- Supports Distributed Data Parallel (DDP) - splits data across GPUs
- Deterministic shuffling via torch.Generator (resumable training)
- Lazy batch generation (saves RAM compared to pre-computing all batches)
- Guarantees equal batch counts across all ranks (prevents DDP deadlocks)
- Processes all images in bucket A before moving to bucket B
- Shuffles samples within each bucket (deterministically)
- Drops incomplete batches at end of each bucket
- Uses dynamic batch sizes based on resolution
_batches_to_skip
_batches_yielded
_total_batches
bucket_groups
bucket_keys
calculator
epoch
Calculate total batches ensuring ALL ranks get the same count. We pad each bucket to be divisible by (num_replicas * batch_size).
Get batch size for resolution (dynamic or fixed based on setting).
Get information about a specific batch.
Note: With lazy evaluation, we don’t pre-compute batches, so this returns bucket-level info for the estimated batch.
Restore sampler state; the next iter will skip already-yielded batches.
Crucial for reproducibility and different shuffles per epoch.
Return sampler state for mid-epoch checkpointing.