nemo_automodel.components.datasets.llm.retrieval_dataset
nemo_automodel.components.datasets.llm.retrieval_dataset
Module Contents
Classes
Functions
Data
API
Interface for corpus datasets addressable by document id.
Data structure to hold corpus metadata and dataset object together. Provides easy access to both components with descriptive attribute names.
Get corpus ID from metadata
Get passage instruction from metadata
Get corpus path from the corpus object
Get query instruction from metadata
Get task type from metadata
Delegate to corpus for convenience
Delegate to corpus for convenience
Bases: AbstractDataset
Wraps an already-loaded HuggingFace Dataset as a corpus (in-memory, no local Parquet).
Stateful transform for retrieval datasets with epoch-based positive cycling.
This class encapsulates the transform state (epoch, corpus_dict, etc.) and provides a clean interface for updating the epoch without recreating the transform.
Update the epoch for positive document cycling.
Transform function to convert from raw format to cross-encoder training format.
Discover all subset names in repo_id by finding dataset_metadata.json files.
Load one or more hf:// URIs and return (Dataset, corpus_dict).
Load a single HF subset and return (normalized_data_list, CorpusInfo).
Normalize a single source or list of sources into parsed entries.
Parse a data entry.
Supported forms:
- “path_or_hf_uri”: use all samples
- {“path”: “path_or_hf_uri”, “num_samples”: N}: sample N examples once from that source
Parse an hf:// URI into (repo_id, subset_or_none).
Examples::
“hf://nvidia/embed-nemotron-dataset-v1/FEVER” -> (“nvidia/embed-nemotron-dataset-v1”, “FEVER”) “hf://nvidia/embed-nemotron-dataset-v1” -> (“nvidia/embed-nemotron-dataset-v1”, None)
Transform function to convert from raw format to training format.
Parameters:
Batch of examples with question, corpus_id, pos_doc, neg_doc
Number of negative documents to use
Dictionary mapping corpus_id to corpus objects
Whether to use instruction from dataset’s metadata
Current epoch for cycling through positive documents
Add one or more corpus paths to a corpus dictionary.
Instantiate a corpus dataset from a path and optional metadata.
Load Merlin corpus metadata from a corpus directory.
Load datasets from JSON files.
Entries can be strings (use all samples) or dictionaries with path and optional num_samples fields (sample a fixed subset once while loading).
Returns:
Tuple of (dataset, corpus_dict)
Load and return dataset in retrieval format for encoder training.
Entries in data_dir_list can be local JSON file paths or hf:// URIs
pointing to a HuggingFace dataset repository (e.g.
hf://nvidia/embed-nemotron-dataset-v1/SciFact). A source can also be
provided as {"path": path_or_uri, "num_samples": N} to sample a fixed
subset once while loading. Uses set_transform() for lazy evaluation —
tokenization is handled by the collator.
Parameters:
Path(s) to JSON file(s), hf:// URIs, or dictionary entries with path and
num_samples.
“bi_encoder” (default) or “cross_encoder”
Type of data (“train” or “eval”)
Number of passages (1 positive + n-1 negatives)
Number of negative documents for evaluation
Random seed for reproducibility (for shuffling if needed)
Shuffle dataset rows before subset selection. Only applied when
max_train_samples is set; otherwise iteration order is controlled by
the dataloader’s sampler (e.g. StatefulDistributedSampler).
Maximum number of training samples to use
Offset for selecting training samples
Whether to use instruction from dataset’s metadata
Whether training should cycle through positive documents across epochs.
Defaults to False (always use the first positive document). Set to True only
when a query has multiple positive documents and you want to rotate through them by epoch.
Returns:
A HuggingFace Dataset where each example is a dict with keys: