`core.datasets.retro.query.query`#

Entry point for querying an index using a GPTChunkDataset.

Querying involves:

Iterate all chunks in the GPTChunkDataset.
Query index for neighbor chunk IDs (i.e., chunks from the chunk database).
Save neighbor chunk IDs to disk, for use in building a RetroDataset sample during pretraining.

Module Contents#

Functions#

`get_index`	Read index from disk.
`embed_block`	Embed block of chunks.
`query_embeddings`	Query neighbors of a block of embeddings.
`query_embedding_block`	Query a block of embeddings.
`query_block_neighbors`	Query neighbors of a dataset block (i.e., range).
`query_dataset_neighbors`	Query neighbors of each chunk within a dataset.
`query_neighbors`	Query pretraining datasets (train & valid).

API#

core.datasets.retro.query.query.get_index( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, ondisk: bool = False, ) → faiss.Index#

Read index from disk.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.
ondisk (bool) – If ondisk = True, memory map the index. (For debugging purposes only; very non-performant.)

Returns:

A Faiss index, loaded from storage.

core.datasets.retro.query.query.embed_block( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, gpt_dataset: megatron.core.datasets.retro.query.gpt_chunk_dataset.GPTChunkDataset, block: dict, ) → numpy.ndarray#

Embed block of chunks.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.
gpt_dataset (GPTChunkDataset) – Chunk dataset to be embedded.
block (dict) – Range information containing start/end indices of subset of chunk dataset.

Returns:

Embeddings array, with shape (len(block[“range”]), dimension(embedder)).

core.datasets.retro.query.query.query_embeddings( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, db_dataset: megatron.core.datasets.retro.db.dataset.DBDataset, index: megatron.core.datasets.retro.index.index.Index, embeddings: numpy.ndarray, chunk_id_range: range, sample_map: dict, n_chunks_per_sample: int, verbose: bool = True, ) → Tuple[numpy.ndarray, numpy.ndarray]#

Query neighbors of a block of embeddings.

Querying includes:

Query index for neighbor chunk IDs.
Filter chunk IDs that have the same document ID as the queried embedding.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.
db_dataset (DBDataset) – Dataset containing chunk database entries.
index (Index) – Vector index populated with chunk database indices.
embeddings (np.ndarray) – Embeddings from GPT chunk dataset.
chunk_id_range (range) – Chunk ID range from GPT chunk dataset.
sample_map (dict) – Mapping of sample_idx to dataset_idx and document_ids. Used for document filtering.
n_chunks_per_sample (int) – Number of chunks per sample (e.g., sequence_length / chunk_length).
verbose (bool) – Log querying progress.

Returns:

A tuple of original (unfiltered) neighbor IDs, and filtered (by document ID) neighbor IDs.

core.datasets.retro.query.query.query_embedding_block( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, db_dataset: megatron.core.datasets.retro.db.dataset.DBDataset, index: megatron.core.datasets.retro.index.index.Index, embeddings: numpy.ndarray, chunk_id_range: range, sample_map: dict, n_chunks_per_sample: int, ) → Tuple[numpy.ndarray, numpy.ndarray]#

Query a block of embeddings.

The block is broken into smaller sub-blocks, for easier tracking of progress. Both the raw neighbor IDs and the filtered neighbor IDs (i.e., chunks with the same document ID are removed) are collected.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.
db_dataset (DBDataset) – Dataset containing chunk database entries.
index (Index) – Vector index populated with chunk database indices.
embeddings (np.ndarray) – Embeddings from GPT chunk dataset.
chunk_id_range (range) – Chunk ID range from GPT chunk dataset.
sample_map (dict) – Mapping of sample_idx to dataset_idx and document_ids. Used for document filtering.
n_chunks_per_sample (int) – Number of chunks per sample (e.g., sequence_length / chunk_length).

Returns:

A tuple of original (unfiltered) neighbor IDs, and filtered (by document ID) neighbor IDs.

core.datasets.retro.query.query.query_block_neighbors( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, db_dataset: megatron.core.datasets.retro.db.dataset.DBDataset, query_dataset: megatron.core.datasets.retro.query.gpt_chunk_dataset.GPTChunkDataset, index: megatron.core.datasets.retro.index.index.Index, block: dict, ) → None#

Query neighbors of a dataset block (i.e., range).

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.
db_dataset (DBDataset) – Dataset containing chunk database entries.
query_dataset (GPTChunkDataset) – GPT chunk dataset to be queried.
index (Index) – Vector index populated with chunk database indices.
block (dict) – Range information containing start/end indices for querying GPT chunk dataset.

core.datasets.retro.query.query.query_dataset_neighbors( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, db_dataset: megatron.core.datasets.retro.db.dataset.DBDataset, query_dataset: megatron.core.datasets.retro.query.gpt_chunk_dataset.GPTChunkDataset, num_active_chunks: int, prefix: str, neighbor_dir: str, index: megatron.core.datasets.retro.index.index.Index, ) → None#

Query neighbors of each chunk within a dataset.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.
db_dataset (DBDataset) – Dataset containing chunk database entries.
query_dataset (GPTChunkDataset) – GPT chunk dataset to be queried.
num_active_chunks (int) – The ‘active’ chunks are the subset of the GPT chunk dataset that aren’t being queried. This argument is used when validating the correctness of a subset of the GPT chunk dataset.
prefix (str) – Extra string for logging progress.
neighbor_dir (str) – File path to directory for saving neighbor IDs.
index (Index) – Vector index populated with chunk database indices.

core.datasets.retro.query.query.query_neighbors( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, ) → None#

Query pretraining datasets (train & valid).

Parameters:: config (RetroPreprocessingConfig) – Retro preprocessing config.

core.datasets.retro.query.query#

Module Contents#

Functions#

API#

`core.datasets.retro.query.query`#