core.datasets.retro.db.dataset#
A DBDataset is for iterating the chunks of the chunk database.
This dataset is used for both training a vector index, and adding vectors to a trained index.
Module Contents#
Classes#
Dataset for iterating chunks. |
API#
- class core.datasets.retro.db.dataset.DBDataset(
- db_path: str,
- indexed_datasets: List[megatron.core.datasets.indexed_dataset.IndexedDataset],
- chunks: numpy.ndarray,
- chunk_length: int,
- eod_token_id: int,
Bases:
torch.utils.data.DatasetDataset for iterating chunks.
- Parameters:
db_path (str) – Path of HDF5-format chunk database.
indexed_datasets (List[IndexedDataset]) – Indexed datasets used to build database.
chunks (np.ndarray) – Array of chunk indexes, for indexing into indexed datasets. Format [dataset_idx, doc_id, start_idx, end_idx, bert_length].
chunk_length (int) – Max GPT chunk length (e.g., 64).
eod_token_id (int) – EOD token ID.
Initialization
- __len__() int#
Length of DB dataset.
- Returns:
Number of chunks contained in the dataset.
- __getitem__(chunk_id: int) dict#
DB dataset sample.
- Parameters:
chunk_id (int) – Index of chunk within dataset.
- Returns:
‘doc_id’: Document index within indexed dataset.
’text’: GPT token IDs.
- Return type:
A dict containing
- load_doc_tuples() None#
Load the dataset & document ids.
Load the dataset id & document id of each chunk in the database, to be used for causality filtering during querying.