core.datasets.retro.query.utils#

Utilities for querying the pretraining dataset.

Module Contents#

Functions#

get_query_dir

Get root directory of all saved query data.

get_neighbor_dir

Get directory containing neighbor IDs for a dataset (i.e., train, valid, or test).

API#

core.datasets.retro.query.utils.get_query_dir(project_dir: str) str#

Get root directory of all saved query data.

Parameters:

project_dir (str) – Retro project dir.

Returns:

Path to query sub-directory in Retro project.

core.datasets.retro.query.utils.get_neighbor_dir(
project_dir: str,
key: str,
dataset: megatron.core.datasets.megatron_dataset.MegatronDataset,
) str#

Get directory containing neighbor IDs for a dataset (i.e., train, valid, or test).

Parameters:
  • project_dir (str) – Retro project dir.

  • key (str) – Dataset split key; ‘train’, ‘valid’, or ‘test’.

  • dataset (MegatronDataset) – Dataset containing unique hash for finding corresponding neighbors.

Returns:

Path to directory containing this dataset’s neighbors within Retro project.