`core.datasets.retro.db.dataset`#

A DBDataset is for iterating the chunks of the chunk database.

This dataset is used for both training a vector index, and adding vectors to a trained index.

Module Contents#

Classes#

DBDataset

Dataset for iterating chunks.

API#

class core.datasets.retro.db.dataset.DBDataset( db_path: str, indexed_datasets: List[megatron.core.datasets.indexed_dataset.IndexedDataset], chunks: numpy.ndarray, chunk_length: int, eod_token_id: int, )#

Bases: torch.utils.data.Dataset

Dataset for iterating chunks.

Parameters:

db_path (str) – Path of HDF5-format chunk database.
indexed_datasets (List[IndexedDataset]) – Indexed datasets used to build database.
chunks (np.ndarray) – Array of chunk indexes, for indexing into indexed datasets. Format [dataset_idx, doc_id, start_idx, end_idx, bert_length].
chunk_length (int) – Max GPT chunk length (e.g., 64).
eod_token_id (int) – EOD token ID.

Initialization

__len__() → int#

Length of DB dataset.

Returns:: Number of chunks contained in the dataset.

__getitem__(chunk_id: int) → dict#

DB dataset sample.

Parameters:

chunk_id (int) – Index of chunk within dataset.

Returns:

‘doc_id’: Document index within indexed dataset.
’text’: GPT token IDs.

Return type:

A dict containing

load_doc_tuples() → None#

Load the dataset & document ids.

Load the dataset id & document id of each chunk in the database, to be used for causality filtering during querying.

core.datasets.retro.db.dataset#

Module Contents#

Classes#

API#

`core.datasets.retro.db.dataset`#