core.datasets.retro.query.query#
Entry point for querying an index using a GPTChunkDataset.
Querying involves:
Iterate all chunks in the GPTChunkDataset.
Query index for neighbor chunk IDs (i.e., chunks from the chunk database).
Save neighbor chunk IDs to disk, for use in building a RetroDataset sample during pretraining.
Module Contents#
Functions#
Read index from disk. |
|
Embed block of chunks. |
|
Query neighbors of a block of embeddings. |
|
Query a block of embeddings. |
|
Query neighbors of a dataset block (i.e., range). |
|
Query neighbors of each chunk within a dataset. |
|
Query pretraining datasets (train & valid). |
API#
- core.datasets.retro.query.query.get_index(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
- ondisk: bool = False,
Read index from disk.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
ondisk (bool) – If
ondisk = True, memory map the index. (For debugging purposes only; very non-performant.)
- Returns:
A Faiss index, loaded from storage.
- core.datasets.retro.query.query.embed_block(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
- gpt_dataset: megatron.core.datasets.retro.query.gpt_chunk_dataset.GPTChunkDataset,
- block: dict,
Embed block of chunks.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
gpt_dataset (GPTChunkDataset) – Chunk dataset to be embedded.
block (dict) – Range information containing start/end indices of subset of chunk dataset.
- Returns:
Embeddings array, with shape (len(block[“range”]), dimension(embedder)).
- core.datasets.retro.query.query.query_embeddings(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
- db_dataset: megatron.core.datasets.retro.db.dataset.DBDataset,
- index: megatron.core.datasets.retro.index.index.Index,
- embeddings: numpy.ndarray,
- chunk_id_range: range,
- sample_map: dict,
- n_chunks_per_sample: int,
- verbose: bool = True,
Query neighbors of a block of embeddings.
Querying includes:
Query index for neighbor chunk IDs.
Filter chunk IDs that have the same document ID as the queried embedding.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
db_dataset (DBDataset) – Dataset containing chunk database entries.
index (Index) – Vector index populated with chunk database indices.
embeddings (np.ndarray) – Embeddings from GPT chunk dataset.
chunk_id_range (range) – Chunk ID range from GPT chunk dataset.
sample_map (dict) – Mapping of sample_idx to dataset_idx and document_ids. Used for document filtering.
n_chunks_per_sample (int) – Number of chunks per sample (e.g., sequence_length / chunk_length).
verbose (bool) – Log querying progress.
- Returns:
A tuple of original (unfiltered) neighbor IDs, and filtered (by document ID) neighbor IDs.
- core.datasets.retro.query.query.query_embedding_block(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
- db_dataset: megatron.core.datasets.retro.db.dataset.DBDataset,
- index: megatron.core.datasets.retro.index.index.Index,
- embeddings: numpy.ndarray,
- chunk_id_range: range,
- sample_map: dict,
- n_chunks_per_sample: int,
Query a block of embeddings.
The block is broken into smaller sub-blocks, for easier tracking of progress. Both the raw neighbor IDs and the filtered neighbor IDs (i.e., chunks with the same document ID are removed) are collected.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
db_dataset (DBDataset) – Dataset containing chunk database entries.
index (Index) – Vector index populated with chunk database indices.
embeddings (np.ndarray) – Embeddings from GPT chunk dataset.
chunk_id_range (range) – Chunk ID range from GPT chunk dataset.
sample_map (dict) – Mapping of sample_idx to dataset_idx and document_ids. Used for document filtering.
n_chunks_per_sample (int) – Number of chunks per sample (e.g., sequence_length / chunk_length).
- Returns:
A tuple of original (unfiltered) neighbor IDs, and filtered (by document ID) neighbor IDs.
- core.datasets.retro.query.query.query_block_neighbors(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
- db_dataset: megatron.core.datasets.retro.db.dataset.DBDataset,
- query_dataset: megatron.core.datasets.retro.query.gpt_chunk_dataset.GPTChunkDataset,
- index: megatron.core.datasets.retro.index.index.Index,
- block: dict,
Query neighbors of a dataset block (i.e., range).
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
db_dataset (DBDataset) – Dataset containing chunk database entries.
query_dataset (GPTChunkDataset) – GPT chunk dataset to be queried.
index (Index) – Vector index populated with chunk database indices.
block (dict) – Range information containing start/end indices for querying GPT chunk dataset.
- core.datasets.retro.query.query.query_dataset_neighbors(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
- db_dataset: megatron.core.datasets.retro.db.dataset.DBDataset,
- query_dataset: megatron.core.datasets.retro.query.gpt_chunk_dataset.GPTChunkDataset,
- num_active_chunks: int,
- prefix: str,
- neighbor_dir: str,
- index: megatron.core.datasets.retro.index.index.Index,
Query neighbors of each chunk within a dataset.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
db_dataset (DBDataset) – Dataset containing chunk database entries.
query_dataset (GPTChunkDataset) – GPT chunk dataset to be queried.
num_active_chunks (int) – The ‘active’ chunks are the subset of the GPT chunk dataset that aren’t being queried. This argument is used when validating the correctness of a subset of the GPT chunk dataset.
prefix (str) – Extra string for logging progress.
neighbor_dir (str) – File path to directory for saving neighbor IDs.
index (Index) – Vector index populated with chunk database indices.
- core.datasets.retro.query.query.query_neighbors(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Query pretraining datasets (train & valid).
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.