bridge.data.iterator_utils#
Iterator utilities for handling virtual pipeline parallelism.
Module Contents#
Functions#
Convert data iterator into form expected by Megatron with virtual pipeline parallelism. |
Data#
API#
- bridge.data.iterator_utils.DataT#
‘TypeVar(…)’
- bridge.data.iterator_utils.make_data_iterator_list(
- model: list,
- data_iterator: Iterator[bridge.data.iterator_utils.DataT],
Convert data iterator into form expected by Megatron with virtual pipeline parallelism.
With interleaved/virtual pipeline parallelism, Megatron expects a list of one data iterator per model chunk. Each model chunk independently gets data from its data iterator, so we need to interact with the data iterator multiple times for each microbatch step. Instead of incorporating this logic into the data loader, we cache the iterator’s output to the first model chunk and reuse it in the other model chunks.
- Parameters:
model – List of model chunks (when virtual PP is used) or single-element list
data_iterator – Iterator yielding microbatch data
- Returns:
returns the iterator as-is If model has multiple chunks: returns a list of iterators with caching behavior - First iterator in list consumes from data_iterator and caches values - Remaining iterators are proxies that read from the cache
- Return type:
If model has only 1 chunk
.. rubric:: Example
With virtual PP size = 2 (model has 2 chunks)
iters = make_data_iterator_list(model=[chunk1, chunk2], data_iterator=iter(microbatches))
len(iters) == 2
Both iters[0] and iters[1] will yield the same microbatch data
batch_from_chunk0 = next(iters[0]) # Fetches from data_iterator, caches batch_from_chunk1 = next(iters[1]) # Reads from cache, same data