bridge.data.energon.base_energon_datamodule#
Module Contents#
Classes#
A DataModule for handling multimodal datasets with images and text. |
|
A wrapper to use Megatron Energon dataloader with the Megatron-LM training loop. |
Functions#
Create a cyclic iterator that loops over the given iterable indefinitely. |
Data#
API#
- bridge.data.energon.base_energon_datamodule.logger#
‘getLogger(…)’
- class bridge.data.energon.base_energon_datamodule.EnergonMultiModalDataModule(
- path: str,
- tokenizer,
- image_processor,
- seq_length: int = 2048,
- micro_batch_size: int = 1,
- global_batch_size: int = 1,
- num_workers: int = 1,
- num_val_workers: int | None = None,
- pin_memory: bool = True,
- shuffle_buffer_size: int = 100,
- max_samples_per_sequence: int | None = None,
- multimodal_sample_config: Optional[Any] = None,
- task_encoder: Optional[Any] = None,
- decoder_seq_length: Optional[int] = None,
- packing_buffer_size: Optional[int] = None,
- validation_task_encoder: Optional[Any] = None,
- pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
- **kwargs,
A DataModule for handling multimodal datasets with images and text.
This data module is designed to work with multimodal datasets that involve both images and text. It provides a seamless interface to load training and validation data, manage batching, and handle the state of the data pipeline across training epochs. The module integrates with the Megatron-Energon framework for efficient data handling in large-scale distributed training.
Attributes: path (str): Path to the energon dataset. tokenizer (Tokenizer): The tokenizer used for processing text. image_processor (ImageProcessor): The image processor used for preprocessing images. seq_length (int): The maximum sequence length for tokenized text. micro_batch_size (int): The batch size for training and validation. num_workers (int): Number of workers for data loading. pin_memory (bool): Whether to pin memory in the DataLoader. multimodal_sample_config (MultiModalSampleConfig): Configuration object for multimodal samples. task_encoder (MultiModalTaskEncoder): Encoder responsible for encoding and batching samples. init_global_step (int): The initial global step for the trainer, used for resuming training. train_dataloader_object (Optional): The DataLoader object for training data. val_dataloader_object (Optional): The DataLoader object for validation data.
Initialization
Initialize the EnergonMultiModalDataModule.
Parameters: path (str): Path to the dataset. tokenizer (Tokenizer): The tokenizer used for processing text. image_processor (ImageProcessor): The image processor used for preprocessing images. seq_length (int, optional): The maximum sequence length for tokenized text. Defaults to 2048. micro_batch_size (int, optional): The batch size for training and validation. Defaults to 1. num_workers (int, optional): Number of workers for data loading. Defaults to 1. num_val_workers (int, optional): Number of workers for validation data loading. Defaults to num_workers. pin_memory (bool, optional): Whether to pin memory in the DataLoader. Defaults to True. multimodal_sample_config (MultiModalSampleConfig, optional): Configuration object for multimodal samples. Defaults to MultiModalSampleConfig(). shuffle_buffer_size (int, optional): Size of the shuffle buffer. Defaults to 100. max_samples_per_sequence (int, optional): Maximum number of samples per sequence to load from memory. Defaults to None (loads the whole tar file at once). task_encoder (MultiModalTaskEncoder, optional): Encoder responsible for encoding and batching samples. If not provided, a default (MultimodalTaskEncoder) encoder will be created. Defaults to None. decoder_seq_length (int, optional): The max sequence length for the decoder. Used in encoder-decoder models packing_buffer_size (int, optional): Size of the packing buffer for batched samples. Defaults to None. validation_task_encoder (MultiModalTaskEncoder, optional): Encoder responsible for encoding and batching samples for validation. Defaults to None and will be the same as task_encoder. pg_collection (ProcessGroupCollection, optional): Process group collection for distributed training. If provided, used instead of the global parallel_state. Defaults to None. **kwargs: Additional keyword arguments. Will be passed to get_train_dataset() of Energon
- _build_worker_config(
- num_workers: int,
- split: str = 'train',
Build a WorkerConfig using pg_collection, falling back to default_worker_config.
NOTE: We intentionally use the pure DP rank (pg_collection.dp) rather than the combined DP-CP rank. With Megatron’s rank ordering (default “tp-cp-ep-dp-pp”), all CP ranks within the same DP replica already share the same pure DP rank. This ensures that CP ranks processing different sequence portions of the same batch receive identical data from the dataloader. Using dp_cp would be INCORRECT here — it would assign each CP rank a unique rank, causing them to read different data shards.
- datasets_provider(
- worker_config,
- split: Literal[train, val] = 'val',
Provide the dataset for training or validation.
This method retrieves the dataset for the specified split (either ‘train’ or ‘val’) and configures it according to the worker configuration.
Parameters: worker_config: Configuration for the data loader workers. split (Literal[‘train’, ‘val’], optional): The data split to retrieve (‘train’ or ‘val’). Defaults to ‘val’.
Returns: Dataset: The dataset configured for the specified split.
- build()#
- train_dataloader() Any#
Initialize and return the training DataLoader.
Returns: TRAIN_DATALOADERS: The DataLoader for the training dataset.
- val_dataloader()#
Initialize and return the validation DataLoader.
This method initializes the DataLoader for the validation dataset. It ensures that the parallel state is initialized correctly for distributed training and returns a configured DataLoader object.
Returns: EVAL_DATALOADERS: The DataLoader for the validation dataset.
- test_dataloader() None#
Return None as test dataset split does not exist.
This method overrides the test_dataloader method and returns None since the test dataset split is not defined or used in this module.
Returns: None
- class bridge.data.energon.base_energon_datamodule.EnergonDataloader(dataloader)#
A wrapper to use Megatron Energon dataloader with the Megatron-LM training loop.
Initialization
- __next__()#
- __iter__()#
- save_state()#
- bridge.data.energon.base_energon_datamodule.cyclic_iter(iter)#
Create a cyclic iterator that loops over the given iterable indefinitely.
- Parameters:
iter (iterable) – The input iterable to cycle through.
- Yields:
Any – The next item from the iterable, looping back to the start when exhausted.