Multi epoch dataset
            EpochIndex
    
              Bases: NamedTuple
A tuple that contains both the current epoch and index for multi-epoch training.
Source code in bionemo/core/data/multi_epoch_dataset.py
                | 42 43 44 45 46 47 48 49 |  | 
            epoch: int
  
      instance-attribute
  
    An integer representing the current epoch.
            idx: int
  
      instance-attribute
  
    An integer representing the index within the current epoch.
            IdentityMultiEpochDatasetWrapper
  
      dataclass
  
    
              Bases: MultiEpochDatasetWrapper[T, T]
An implementation of the MultiEpochDatasetWrapper that does not apply any transformations.
Source code in bionemo/core/data/multi_epoch_dataset.py
                | 177 178 179 180 181 182 183 |  | 
            apply_transform(sample, index)
    Return the sample as is.
Source code in bionemo/core/data/multi_epoch_dataset.py
              | 180 181 182 183 |  | 
            MultiEpochDataset
    
              Bases: Protocol[T_co]
A protocol for datasets for multi-epoch training in Megatron-LM.
Dataset determinism in Megatron-LM
In megatron training, the sampler and dataset objects are used to ensure consistent data loading across
model-parallel ranks. For datasets to work with megatron training, they must return exactly the same data for
every call to __getitem__ with the same index.
Source code in bionemo/core/data/multi_epoch_dataset.py
                | 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |  | 
            MultiEpochDatasetResampler
  
      dataclass
  
    
              Bases: Dataset[T_co]
A dataset wrapper class that converts the sequential sampling from Megatron-LM to epoch-based sampling.
Either num_epochs or num_samples should be provided. If neither are provided, the dataset will use a single
epoch. If num_epochs is given, the resampled dataset will have len(dataset) * num_epochs samples. If
num_samples the resampled dataset will have num_samples samples. For num_samples, the dataset will be repeated
for multiple epochs until the desired number of samples is reached (with the final epoch being truncated).
Source code in bionemo/core/data/multi_epoch_dataset.py
                | 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |  | 
            dataset: MultiEpochDataset[T_co]
  
      instance-attribute
  
    The dataset to resample. Must support indexing with an EpochIndex.
            num_epochs: int | None = None
  
      class-attribute
      instance-attribute
  
    The total number of epochs. The length of the resampled dataset will be len(dataset) * num_epochs.
            num_samples: int | None = None
  
      class-attribute
      instance-attribute
  
    The total number of samples to draw.
The number of epochs will be determined by the number of samples and the length of the dataset.
            seed: int = 42
  
      class-attribute
      instance-attribute
  
    A random seed for reproducibility.
            shuffle: bool = True
  
      class-attribute
      instance-attribute
  
    Whether to shuffle the samples in the dataset each epoch.
            __getitem__(index)
    Get the sample at the given index.
Source code in bionemo/core/data/multi_epoch_dataset.py
              | 131 132 133 134 135 |  | 
            __len__()
    Return the length of the resampled dataset.
Source code in bionemo/core/data/multi_epoch_dataset.py
              | 137 138 139 |  | 
            __post_init__()
    Pre-shuffle each epoch's samples.
Source code in bionemo/core/data/multi_epoch_dataset.py
              | 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |  | 
            MultiEpochDatasetWrapper
  
      dataclass
  
    
              Bases: Dataset[U_co], Generic[T, U_co], ABC
A wrapper to convert a standard pytorch dataset into one that supports multi-epoch megatron training.
The underlying dataset's getitem method must be deterministic, i.e. it must return the same data for the same
index every time it is called. If there are any non-deterministic operations, they should be moved to the
apply_transform method. This method must also be deterministic for every (epoch, index) pair, but it can use
the epoch to implement data augmentation each epoch.
Source code in bionemo/core/data/multi_epoch_dataset.py
                | 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |  | 
            dataset: SizedDataset[T]
  
      instance-attribute
  
    A deterministic dataset that supports indexing with an integer index.
            __getitem__(index)
    Get the sample at the given epoch and index.
Source code in bionemo/core/data/multi_epoch_dataset.py
              | 168 169 170 |  | 
            __len__()
    Return the length of the dataset.
Source code in bionemo/core/data/multi_epoch_dataset.py
              | 172 173 174 |  | 
            apply_transform(sample, index)
  
      abstractmethod
  
    Apply any transformations to the sample for the given epoch.
Source code in bionemo/core/data/multi_epoch_dataset.py
              | 163 164 165 166 |  | 
            SizedDataset
    
              Bases: Protocol[T_co]
A protocol for integer-indexed datasets that have a fixed length.
Source code in bionemo/core/data/multi_epoch_dataset.py
                | 52 53 54 55 56 57 58 59 |  |