nemo_automodel.components.datasets.llm.retrieval_dataset#

Module Contents#

Classes#

AbstractDataset

TextQADataset

HFCorpusDataset

Wraps an already-loaded HuggingFace Dataset as a corpus (in-memory, no local Parquet).

CorpusInfo

Data structure to hold corpus metadata and dataset object together. Provides easy access to both components with descriptive attribute names.

Functions#

load_corpus_metadata

load_corpus

add_corpus

load_datasets

Load datasets from JSON files.

_parse_hf_uri

Parse an hf:// URI into (repo_id, subset_or_none).

_list_hf_subsets

Discover all subset names in repo_id by finding dataset_metadata.json files.

_load_hf_subset

Load a single HF subset and return (normalized_data_list, CorpusInfo).

_load_hf_sources

Load one or more hf:// URIs and return (Dataset, corpus_dict).

_transform_func

Transform function to convert from raw format to training format. Same as _format_process_data in RetrievalMultiModalDatasetLoader.

_create_transform_func

Create transform function with specified number of negative documents.

make_retrieval_dataset

Load and return dataset in retrieval format for biencoder training.

Data#

API#

nemo_automodel.components.datasets.llm.retrieval_dataset.EXAMPLE_TEMPLATE#

None

class nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset#

Bases: abc.ABC

abstractmethod get_document_by_id(id)#
abstractmethod get_all_ids()#
class nemo_automodel.components.datasets.llm.retrieval_dataset.TextQADataset(path)#

Bases: nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset

get_document_by_id(id)#
get_all_ids()#
class nemo_automodel.components.datasets.llm.retrieval_dataset.HFCorpusDataset(hf_dataset: datasets.Dataset, path: str = '')#

Bases: nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset

Wraps an already-loaded HuggingFace Dataset as a corpus (in-memory, no local Parquet).

Initialization

get_document_by_id(id)#
get_all_ids()#
nemo_automodel.components.datasets.llm.retrieval_dataset.DATASETS#

None

class nemo_automodel.components.datasets.llm.retrieval_dataset.CorpusInfo#

Data structure to hold corpus metadata and dataset object together. Provides easy access to both components with descriptive attribute names.

metadata: dict#

None

corpus: nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset#

None

property corpus_id: str#

Get corpus ID from metadata

property query_instruction: str#

Get query instruction from metadata

property passage_instruction: str#

Get passage instruction from metadata

property task_type: str#

Get task type from metadata

property path: str#

Get corpus path from the corpus object

get_document_by_id(doc_id: str)#

Delegate to corpus for convenience

get_all_ids()#

Delegate to corpus for convenience

nemo_automodel.components.datasets.llm.retrieval_dataset.load_corpus_metadata(path: str)#
nemo_automodel.components.datasets.llm.retrieval_dataset.load_corpus(path, metadata: Optional[dict] = None)#
nemo_automodel.components.datasets.llm.retrieval_dataset.add_corpus(
qa_corpus_paths: Union[dict, list],
corpus_dict: dict,
)#
nemo_automodel.components.datasets.llm.retrieval_dataset.load_datasets(
data_dir_list: Union[List[str], str],
concatenate: bool = True,
)#

Load datasets from JSON files.

Copied from nemo-retriever-research/src/data/datasets.py

Returns:

Tuple of (dataset, corpus_dict)

nemo_automodel.components.datasets.llm.retrieval_dataset._HF_PREFIX#

‘hf://’

nemo_automodel.components.datasets.llm.retrieval_dataset._parse_hf_uri(uri: str)#

Parse an hf:// URI into (repo_id, subset_or_none).

Examples::

"hf://nvidia/embed-nemotron-dataset-v1/FEVER"  -> ("nvidia/embed-nemotron-dataset-v1", "FEVER")
"hf://nvidia/embed-nemotron-dataset-v1"         -> ("nvidia/embed-nemotron-dataset-v1", None)
nemo_automodel.components.datasets.llm.retrieval_dataset._list_hf_subsets(repo_id: str) List[str]#

Discover all subset names in repo_id by finding dataset_metadata.json files.

nemo_automodel.components.datasets.llm.retrieval_dataset._load_hf_subset(repo_id: str, subset: str)#

Load a single HF subset and return (normalized_data_list, CorpusInfo).

.. note::

The direct hf:// path currently expects the Automodel retrieval schema:

  • {subset}/dataset_metadata.json with corpus_id metadata

  • {subset}_corpus split with corpus columns like id and text

  • {subset} split with query columns like question and pos_doc

FEVER and SyntheticClassificationData from nvidia/embed-nemotron-dataset-v1 are examples that follow this layout. Datasets with different structures should use a custom adapter/preprocessor before calling this loader.

nemo_automodel.components.datasets.llm.retrieval_dataset._load_hf_sources(hf_uris: List[str])#

Load one or more hf:// URIs and return (Dataset, corpus_dict).

nemo_automodel.components.datasets.llm.retrieval_dataset._transform_func(
examples,
num_neg_docs,
corpus_dict,
use_dataset_instruction: bool = False,
)#

Transform function to convert from raw format to training format. Same as _format_process_data in RetrievalMultiModalDatasetLoader.

Parameters:
  • examples – Batch of examples with question, corpus_id, pos_doc, neg_doc

  • num_neg_docs – Number of negative documents to use

  • corpus_dict – Dictionary mapping corpus_id to corpus objects

  • use_dataset_instruction – Whether to use instruction from dataset’s metadata

nemo_automodel.components.datasets.llm.retrieval_dataset._create_transform_func(
num_neg_docs,
corpus_dict,
use_dataset_instruction: bool = False,
)#

Create transform function with specified number of negative documents.

nemo_automodel.components.datasets.llm.retrieval_dataset.make_retrieval_dataset(
data_dir_list: Union[List[str], str] = None,
data_type: str = 'train',
train_n_passages: int = 5,
eval_negative_size: int = 10,
seed: int = 42,
do_shuffle: bool = False,
max_train_samples: int = None,
train_data_select_offset: int = 0,
use_dataset_instruction: bool = False,
)#

Load and return dataset in retrieval format for biencoder training.

Entries in data_dir_list can be local JSON file paths or hf:// URIs pointing to a HuggingFace dataset repository (e.g. hf://nvidia/embed-nemotron-dataset-v1/SciFact). Uses set_transform() for lazy evaluation — tokenization is handled by the collator.

Parameters:
  • data_dir_list – Path(s) to JSON file(s) or hf:// URIs.

  • data_type – Type of data (“train” or “eval”)

  • train_n_passages – Number of passages for training (1 positive + n-1 negatives)

  • eval_negative_size – Number of negative documents for evaluation

  • seed – Random seed for reproducibility (for shuffling if needed)

  • do_shuffle – Whether to shuffle the dataset

  • max_train_samples – Maximum number of training samples to use

  • train_data_select_offset – Offset for selecting training samples

  • use_dataset_instruction – Whether to use instruction from dataset’s metadata

Returns:

  • ‘question’: Query text

  • ’doc_text’: List of document texts [positive, negatives…]

  • ’doc_image’: List of images or empty strings

Return type:

A HuggingFace Dataset where each example is a dict with keys

.. note::

Direct hf:// loading currently supports HF datasets that already follow the Automodel retrieval schema (corpus-id based layout used by nvidia/embed-nemotron-dataset-v1 subsets such as FEVER and SyntheticClassificationData). For other HF dataset formats, implement a custom adapter/preprocessor before calling this loader.

Tokenization should be handled by a collator (e.g., RetrievalBiencoderCollator) which is more efficient for batch padding and supports dynamic processing.