core.datasets.retro.query.utils#
Utilities for querying the pretraining dataset.
Module Contents#
Functions#
Get root directory of all saved query data. |
|
Get directory containing neighbor IDs for a dataset (i.e., train, valid, or test). |
API#
- core.datasets.retro.query.utils.get_query_dir(project_dir: str) str#
Get root directory of all saved query data.
- Parameters:
project_dir (str) – Retro project dir.
- Returns:
Path to query sub-directory in Retro project.
- core.datasets.retro.query.utils.get_neighbor_dir(
- project_dir: str,
- key: str,
- dataset: megatron.core.datasets.megatron_dataset.MegatronDataset,
Get directory containing neighbor IDs for a dataset (i.e., train, valid, or test).
- Parameters:
project_dir (str) – Retro project dir.
key (str) – Dataset split key; ‘train’, ‘valid’, or ‘test’.
dataset (MegatronDataset) – Dataset containing unique hash for finding corresponding neighbors.
- Returns:
Path to directory containing this dataset’s neighbors within Retro project.