core.datasets.retro.query.query#

Entry point for querying an index using a GPTChunkDataset.

Querying involves:

  • Iterate all chunks in the GPTChunkDataset.

  • Query index for neighbor chunk IDs (i.e., chunks from the chunk database).

  • Save neighbor chunk IDs to disk, for use in building a RetroDataset sample during pretraining.

Module Contents#

Functions#

get_index

Read index from disk.

embed_block

Embed block of chunks.

query_embeddings

Query neighbors of a block of embeddings.

query_embedding_block

Query a block of embeddings.

query_block_neighbors

Query neighbors of a dataset block (i.e., range).

query_dataset_neighbors

Query neighbors of each chunk within a dataset.

query_neighbors

Query pretraining datasets (train & valid).

API#

core.datasets.retro.query.query.get_index(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
ondisk: bool = False,
) faiss.Index#

Read index from disk.

Parameters:
  • config (RetroPreprocessingConfig) – Retro preprocessing config.

  • ondisk (bool) – If ondisk = True, memory map the index. (For debugging purposes only; very non-performant.)

Returns:

A Faiss index, loaded from storage.

core.datasets.retro.query.query.embed_block(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
gpt_dataset: megatron.core.datasets.retro.query.gpt_chunk_dataset.GPTChunkDataset,
block: dict,
) numpy.ndarray#

Embed block of chunks.

Parameters:
  • config (RetroPreprocessingConfig) – Retro preprocessing config.

  • gpt_dataset (GPTChunkDataset) – Chunk dataset to be embedded.

  • block (dict) – Range information containing start/end indices of subset of chunk dataset.

Returns:

Embeddings array, with shape (len(block[“range”]), dimension(embedder)).

core.datasets.retro.query.query.query_embeddings(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
db_dataset: megatron.core.datasets.retro.db.dataset.DBDataset,
index: megatron.core.datasets.retro.index.index.Index,
embeddings: numpy.ndarray,
chunk_id_range: range,
sample_map: dict,
n_chunks_per_sample: int,
verbose: bool = True,
) Tuple[numpy.ndarray, numpy.ndarray]#

Query neighbors of a block of embeddings.

Querying includes:

  • Query index for neighbor chunk IDs.

  • Filter chunk IDs that have the same document ID as the queried embedding.

Parameters:
  • config (RetroPreprocessingConfig) – Retro preprocessing config.

  • db_dataset (DBDataset) – Dataset containing chunk database entries.

  • index (Index) – Vector index populated with chunk database indices.

  • embeddings (np.ndarray) – Embeddings from GPT chunk dataset.

  • chunk_id_range (range) – Chunk ID range from GPT chunk dataset.

  • sample_map (dict) – Mapping of sample_idx to dataset_idx and document_ids. Used for document filtering.

  • n_chunks_per_sample (int) – Number of chunks per sample (e.g., sequence_length / chunk_length).

  • verbose (bool) – Log querying progress.

Returns:

A tuple of original (unfiltered) neighbor IDs, and filtered (by document ID) neighbor IDs.

core.datasets.retro.query.query.query_embedding_block(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
db_dataset: megatron.core.datasets.retro.db.dataset.DBDataset,
index: megatron.core.datasets.retro.index.index.Index,
embeddings: numpy.ndarray,
chunk_id_range: range,
sample_map: dict,
n_chunks_per_sample: int,
) Tuple[numpy.ndarray, numpy.ndarray]#

Query a block of embeddings.

The block is broken into smaller sub-blocks, for easier tracking of progress. Both the raw neighbor IDs and the filtered neighbor IDs (i.e., chunks with the same document ID are removed) are collected.

Parameters:
  • config (RetroPreprocessingConfig) – Retro preprocessing config.

  • db_dataset (DBDataset) – Dataset containing chunk database entries.

  • index (Index) – Vector index populated with chunk database indices.

  • embeddings (np.ndarray) – Embeddings from GPT chunk dataset.

  • chunk_id_range (range) – Chunk ID range from GPT chunk dataset.

  • sample_map (dict) – Mapping of sample_idx to dataset_idx and document_ids. Used for document filtering.

  • n_chunks_per_sample (int) – Number of chunks per sample (e.g., sequence_length / chunk_length).

Returns:

A tuple of original (unfiltered) neighbor IDs, and filtered (by document ID) neighbor IDs.

core.datasets.retro.query.query.query_block_neighbors(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
db_dataset: megatron.core.datasets.retro.db.dataset.DBDataset,
query_dataset: megatron.core.datasets.retro.query.gpt_chunk_dataset.GPTChunkDataset,
index: megatron.core.datasets.retro.index.index.Index,
block: dict,
) None#

Query neighbors of a dataset block (i.e., range).

Parameters:
  • config (RetroPreprocessingConfig) – Retro preprocessing config.

  • db_dataset (DBDataset) – Dataset containing chunk database entries.

  • query_dataset (GPTChunkDataset) – GPT chunk dataset to be queried.

  • index (Index) – Vector index populated with chunk database indices.

  • block (dict) – Range information containing start/end indices for querying GPT chunk dataset.

core.datasets.retro.query.query.query_dataset_neighbors(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
db_dataset: megatron.core.datasets.retro.db.dataset.DBDataset,
query_dataset: megatron.core.datasets.retro.query.gpt_chunk_dataset.GPTChunkDataset,
num_active_chunks: int,
prefix: str,
neighbor_dir: str,
index: megatron.core.datasets.retro.index.index.Index,
) None#

Query neighbors of each chunk within a dataset.

Parameters:
  • config (RetroPreprocessingConfig) – Retro preprocessing config.

  • db_dataset (DBDataset) – Dataset containing chunk database entries.

  • query_dataset (GPTChunkDataset) – GPT chunk dataset to be queried.

  • num_active_chunks (int) – The ‘active’ chunks are the subset of the GPT chunk dataset that aren’t being queried. This argument is used when validating the correctness of a subset of the GPT chunk dataset.

  • prefix (str) – Extra string for logging progress.

  • neighbor_dir (str) – File path to directory for saving neighbor IDs.

  • index (Index) – Vector index populated with chunk database indices.

core.datasets.retro.query.query.query_neighbors(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) None#

Query pretraining datasets (train & valid).

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.