core.datasets.retro.db.dataset#

A DBDataset is for iterating the chunks of the chunk database.

This dataset is used for both training a vector index, and adding vectors to a trained index.

Module Contents#

Classes#

DBDataset

Dataset for iterating chunks.

API#

class core.datasets.retro.db.dataset.DBDataset(
db_path: str,
indexed_datasets: List[megatron.core.datasets.indexed_dataset.IndexedDataset],
chunks: numpy.ndarray,
chunk_length: int,
eod_token_id: int,
)#

Bases: torch.utils.data.Dataset

Dataset for iterating chunks.

Parameters:
  • db_path (str) – Path of HDF5-format chunk database.

  • indexed_datasets (List[IndexedDataset]) – Indexed datasets used to build database.

  • chunks (np.ndarray) – Array of chunk indexes, for indexing into indexed datasets. Format [dataset_idx, doc_id, start_idx, end_idx, bert_length].

  • chunk_length (int) – Max GPT chunk length (e.g., 64).

  • eod_token_id (int) – EOD token ID.

Initialization

__len__() int#

Length of DB dataset.

Returns:

Number of chunks contained in the dataset.

__getitem__(chunk_id: int) dict#

DB dataset sample.

Parameters:

chunk_id (int) – Index of chunk within dataset.

Returns:

  • ‘doc_id’: Document index within indexed dataset.

  • ’text’: GPT token IDs.

Return type:

A dict containing

load_doc_tuples() None#

Load the dataset & document ids.

Load the dataset id & document id of each chunk in the database, to be used for causality filtering during querying.