`nemo_automodel.components.datasets.llm.retrieval_dataset`#

Module Contents#

Classes#

`AbstractDataset`
`TextQADataset`
`HFCorpusDataset`	Wraps an already-loaded HuggingFace Dataset as a corpus (in-memory, no local Parquet).
`CorpusInfo`	Data structure to hold corpus metadata and dataset object together. Provides easy access to both components with descriptive attribute names.

Functions#

`load_corpus_metadata`
`load_corpus`
`add_corpus`
`load_datasets`	Load datasets from JSON files.
`_parse_hf_uri`	Parse an `hf://` URI into `(repo_id, subset_or_none)`.
`_list_hf_subsets`	Discover all subset names in repo_id by finding `dataset_metadata.json` files.
`_load_hf_subset`	Load a single HF subset and return `(normalized_data_list, CorpusInfo)`.
`_load_hf_sources`	Load one or more `hf://` URIs and return `(Dataset, corpus_dict)`.
`_transform_func`	Transform function to convert from raw format to training format. Same as _format_process_data in RetrievalMultiModalDatasetLoader.
`_create_transform_func`	Create transform function with specified number of negative documents.
`make_retrieval_dataset`	Load and return dataset in retrieval format for biencoder training.

Data#

`EXAMPLE_TEMPLATE`
`DATASETS`
`_HF_PREFIX`

API#

nemo_automodel.components.datasets.llm.retrieval_dataset.EXAMPLE_TEMPLATE#: None

class nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset#

Bases: abc.ABC

abstractmethod get_document_by_id(id)#

abstractmethod get_all_ids()#

class nemo_automodel.components.datasets.llm.retrieval_dataset.TextQADataset(path)#

Bases: nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset

get_document_by_id(id)#

get_all_ids()#

class nemo_automodel.components.datasets.llm.retrieval_dataset.HFCorpusDataset(hf_dataset: datasets.Dataset, path: str = '')#

Bases: nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset

Wraps an already-loaded HuggingFace Dataset as a corpus (in-memory, no local Parquet).

Initialization

get_document_by_id(id)#

get_all_ids()#

nemo_automodel.components.datasets.llm.retrieval_dataset.DATASETS#: None

class nemo_automodel.components.datasets.llm.retrieval_dataset.CorpusInfo#

Data structure to hold corpus metadata and dataset object together. Provides easy access to both components with descriptive attribute names.

metadata: dict#: None

corpus: nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset#: None

property corpus_id: str#: Get corpus ID from metadata

property query_instruction: str#: Get query instruction from metadata

property passage_instruction: str#: Get passage instruction from metadata

property task_type: str#: Get task type from metadata

property path: str#: Get corpus path from the corpus object

get_document_by_id(doc_id: str)#: Delegate to corpus for convenience

get_all_ids()#: Delegate to corpus for convenience

nemo_automodel.components.datasets.llm.retrieval_dataset.load_corpus_metadata(path: str)#

nemo_automodel.components.datasets.llm.retrieval_dataset.load_corpus(path, metadata: Optional[dict] = None)#

nemo_automodel.components.datasets.llm.retrieval_dataset.add_corpus( qa_corpus_paths: Union[dict, list], corpus_dict: dict, )#

nemo_automodel.components.datasets.llm.retrieval_dataset.load_datasets( data_dir_list: Union[List[str], str], concatenate: bool = True, )#

Load datasets from JSON files.

Copied from nemo-retriever-research/src/data/datasets.py

Returns:: Tuple of (dataset, corpus_dict)

nemo_automodel.components.datasets.llm.retrieval_dataset._HF_PREFIX#: ‘hf://’

nemo_automodel.components.datasets.llm.retrieval_dataset._parse_hf_uri(uri: str)#

Parse an hf:// URI into (repo_id, subset_or_none).

Examples::

"hf://nvidia/embed-nemotron-dataset-v1/FEVER"  -> ("nvidia/embed-nemotron-dataset-v1", "FEVER")
"hf://nvidia/embed-nemotron-dataset-v1"         -> ("nvidia/embed-nemotron-dataset-v1", None)

nemo_automodel.components.datasets.llm.retrieval_dataset._list_hf_subsets(repo_id: str) → List[str]#: Discover all subset names in repo_id by finding dataset_metadata.json files.

nemo_automodel.components.datasets.llm.retrieval_dataset._load_hf_subset(repo_id: str, subset: str)#

Load a single HF subset and return (normalized_data_list, CorpusInfo).

.. note::

The direct hf:// path currently expects the Automodel retrieval schema:

{subset}/dataset_metadata.json with corpus_id metadata
{subset}_corpus split with corpus columns like id and text
{subset} split with query columns like question and pos_doc

FEVER and SyntheticClassificationData from nvidia/embed-nemotron-dataset-v1 are examples that follow this layout. Datasets with different structures should use a custom adapter/preprocessor before calling this loader.

nemo_automodel.components.datasets.llm.retrieval_dataset._load_hf_sources(hf_uris: List[str])#: Load one or more hf:// URIs and return (Dataset, corpus_dict).

nemo_automodel.components.datasets.llm.retrieval_dataset._transform_func( examples, num_neg_docs, corpus_dict, use_dataset_instruction: bool = False, )#

Transform function to convert from raw format to training format. Same as _format_process_data in RetrievalMultiModalDatasetLoader.

Parameters:

examples – Batch of examples with question, corpus_id, pos_doc, neg_doc
num_neg_docs – Number of negative documents to use
corpus_dict – Dictionary mapping corpus_id to corpus objects
use_dataset_instruction – Whether to use instruction from dataset’s metadata

nemo_automodel.components.datasets.llm.retrieval_dataset._create_transform_func( num_neg_docs, corpus_dict, use_dataset_instruction: bool = False, )#: Create transform function with specified number of negative documents.

nemo_automodel.components.datasets.llm.retrieval_dataset.make_retrieval_dataset( data_dir_list: Union[List[str], str] = None, data_type: str = 'train', train_n_passages: int = 5, eval_negative_size: int = 10, seed: int = 42, do_shuffle: bool = False, max_train_samples: int = None, train_data_select_offset: int = 0, use_dataset_instruction: bool = False, )#

Load and return dataset in retrieval format for biencoder training.

Entries in data_dir_list can be local JSON file paths or hf:// URIs pointing to a HuggingFace dataset repository (e.g. hf://nvidia/embed-nemotron-dataset-v1/SciFact). Uses set_transform() for lazy evaluation — tokenization is handled by the collator.

Parameters:

data_dir_list – Path(s) to JSON file(s) or hf:// URIs.
data_type – Type of data (“train” or “eval”)
train_n_passages – Number of passages for training (1 positive + n-1 negatives)
eval_negative_size – Number of negative documents for evaluation
seed – Random seed for reproducibility (for shuffling if needed)
do_shuffle – Whether to shuffle the dataset
max_train_samples – Maximum number of training samples to use
train_data_select_offset – Offset for selecting training samples
use_dataset_instruction – Whether to use instruction from dataset’s metadata

Returns:

‘question’: Query text
’doc_text’: List of document texts [positive, negatives…]
’doc_image’: List of images or empty strings

Return type:

A HuggingFace Dataset where each example is a dict with keys