nemo_automodel.components.datasets.llm.retrieval_dataset

Module Contents

Classes

Name	Description
`AbstractDataset`	Interface for corpus datasets addressable by document id.
`CorpusInfo`	Data structure to hold corpus metadata and dataset object together.
`HFCorpusDataset`	Wraps an already-loaded HuggingFace Dataset as a corpus (in-memory, no local Parquet).
`RetrievalTransform`	Stateful transform for retrieval datasets with epoch-based positive cycling.
`TextQADataset`	Load TextQA corpus documents from a HuggingFace dataset path.

Functions

Name	Description
`_cross_encoder_transform_func`	Transform function to convert from raw format to cross-encoder training format.
`_list_hf_subsets`	Discover all subset names in repo_id by finding `dataset_metadata.json` files.
`_load_hf_sources`	Load one or more `hf://` URIs and return `(Dataset, corpus_dict)`.
`_load_hf_subset`	Load a single HF subset and return `(normalized_data_list, CorpusInfo)`.
`_normalize_data_entries`	Normalize a single source or list of sources into parsed entries.
`_parse_data_entry`	Parse a data entry.
`_parse_hf_uri`	Parse an `hf://` URI into `(repo_id, subset_or_none)`.
`_sample_data_items`	-
`_transform_func`	Transform function to convert from raw format to training format.
`add_corpus`	Add one or more corpus paths to a corpus dictionary.
`load_corpus`	Instantiate a corpus dataset from a path and optional metadata.
`load_corpus_metadata`	Load Merlin corpus metadata from a corpus directory.
`load_datasets`	Load datasets from JSON files.
`make_retrieval_dataset`	Load and return dataset in retrieval format for encoder training.

Data

_OVERSAMPLING_WARNED_CORPORA

API

class nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset()

Abstract

Interface for corpus datasets addressable by document id.

nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset.get_all_ids()

abstract

nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset.get_document_by_id(
    id
)

abstract

class nemo_automodel.components.datasets.llm.retrieval_dataset.CorpusInfo(
    metadata: dict,
    corpus: nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset
)

Dataclass

Data structure to hold corpus metadata and dataset object together. Provides easy access to both components with descriptive attribute names.

corpus

AbstractDataset

corpus_id

str

Get corpus ID from metadata

metadata

dict

passage_instruction

str

Get passage instruction from metadata

path

str

Get corpus path from the corpus object

query_instruction

str

Get query instruction from metadata

task_type

str

Get task type from metadata

nemo_automodel.components.datasets.llm.retrieval_dataset.CorpusInfo.get_all_ids()

Delegate to corpus for convenience

nemo_automodel.components.datasets.llm.retrieval_dataset.CorpusInfo.get_document_by_id(
    doc_id: str
)

Delegate to corpus for convenience

class nemo_automodel.components.datasets.llm.retrieval_dataset.HFCorpusDataset(
    hf_dataset: datasets.Dataset,
    path: str = ''
)

Bases: AbstractDataset

Wraps an already-loaded HuggingFace Dataset as a corpus (in-memory, no local Parquet).

_docid2idx

nemo_automodel.components.datasets.llm.retrieval_dataset.HFCorpusDataset.get_all_ids()

nemo_automodel.components.datasets.llm.retrieval_dataset.HFCorpusDataset.get_document_by_id(
    id
)

class nemo_automodel.components.datasets.llm.retrieval_dataset.RetrievalTransform(
    num_neg_docs: int,
    corpus_dict: dict,
    use_dataset_instruction: bool = False,
    model_type: str = 'bi_encoder',
    cycle_positive_docs: bool = False
)

Stateful transform for retrieval datasets with epoch-based positive cycling.

This class encapsulates the transform state (epoch, corpus_dict, etc.) and provides a clean interface for updating the epoch without recreating the transform.

epoch

= 0

nemo_automodel.components.datasets.llm.retrieval_dataset.RetrievalTransform.__call__(
    examples
)

nemo_automodel.components.datasets.llm.retrieval_dataset.RetrievalTransform.set_epoch(
    epoch: int
)

Update the epoch for positive document cycling.

class nemo_automodel.components.datasets.llm.retrieval_dataset.TextQADataset(
    path
)

Bases: AbstractDataset

Load TextQA corpus documents from a HuggingFace dataset path.

data

= load_dataset(path)['train']

nemo_automodel.components.datasets.llm.retrieval_dataset.TextQADataset.get_all_ids()

nemo_automodel.components.datasets.llm.retrieval_dataset.TextQADataset.get_document_by_id(
    id
)

nemo_automodel.components.datasets.llm.retrieval_dataset._cross_encoder_transform_func(
    examples,
    num_neg_docs,
    corpus_dict,
    use_dataset_instruction: bool = False,
    epoch: int = 0
)

Transform function to convert from raw format to cross-encoder training format.

nemo_automodel.components.datasets.llm.retrieval_dataset._list_hf_subsets(
    repo_id: str
) -> typing.List[str]

Discover all subset names in repo_id by finding dataset_metadata.json files.

nemo_automodel.components.datasets.llm.retrieval_dataset._load_hf_sources(
    hf_entries: typing.List[typing.Tuple[typing.Optional[int], str]],
    seed: int = 42
)

Load one or more hf:// URIs and return (Dataset, corpus_dict).

nemo_automodel.components.datasets.llm.retrieval_dataset._load_hf_subset(
    repo_id: str,
    subset: str
)

Load a single HF subset and return (normalized_data_list, CorpusInfo).

nemo_automodel.components.datasets.llm.retrieval_dataset._normalize_data_entries(
    data_dir_list: typing.Union[typing.List[nemo_automodel.components.datasets.llm.retrieval_dataset.DataEntry], nemo_automodel.components.datasets.llm.retrieval_dataset.DataEntry]
) -> typing.List[typing.Tuple[typing.Optional[int], str]]

Normalize a single source or list of sources into parsed entries.

nemo_automodel.components.datasets.llm.retrieval_dataset._parse_data_entry(
    entry: nemo_automodel.components.datasets.llm.retrieval_dataset.DataEntry
) -> typing.Tuple[typing.Optional[int], str]

Parse a data entry.

Supported forms:

“path_or_hf_uri”: use all samples
{“path”: “path_or_hf_uri”, “num_samples”: N}: sample N examples once from that source

nemo_automodel.components.datasets.llm.retrieval_dataset._parse_hf_uri(
    uri: str
)

Parse an hf:// URI into (repo_id, subset_or_none).

Examples::

“hf://nvidia/embed-nemotron-dataset-v1/FEVER” -> (“nvidia/embed-nemotron-dataset-v1”, “FEVER”) “hf://nvidia/embed-nemotron-dataset-v1” -> (“nvidia/embed-nemotron-dataset-v1”, None)

nemo_automodel.components.datasets.llm.retrieval_dataset._sample_data_items(
    data_items: typing.List[dict],
    num_samples: typing.Optional[int],
    source: str,
    seed: int
) -> typing.List[dict]

nemo_automodel.components.datasets.llm.retrieval_dataset._transform_func(
    examples,
    num_neg_docs,
    corpus_dict,
    use_dataset_instruction: bool = False,
    epoch: int = 0
)

Transform function to convert from raw format to training format.

Parameters:

examples

Batch of examples with question, corpus_id, pos_doc, neg_doc

num_neg_docs

Number of negative documents to use

corpus_dict

Dictionary mapping corpus_id to corpus objects

use_dataset_instruction

boolDefaults to False

Whether to use instruction from dataset’s metadata

epoch

intDefaults to 0

Current epoch for cycling through positive documents

nemo_automodel.components.datasets.llm.retrieval_dataset.add_corpus(
    qa_corpus_paths: typing.Union[dict, list],
    corpus_dict: dict
)

Add one or more corpus paths to a corpus dictionary.

nemo_automodel.components.datasets.llm.retrieval_dataset.load_corpus(
    path,
    metadata: typing.Optional[dict] = None
)

Instantiate a corpus dataset from a path and optional metadata.

nemo_automodel.components.datasets.llm.retrieval_dataset.load_corpus_metadata(
    path: str
)

Load Merlin corpus metadata from a corpus directory.

nemo_automodel.components.datasets.llm.retrieval_dataset.load_datasets(
    data_dir_list: typing.Union[typing.List[nemo_automodel.components.datasets.llm.retrieval_dataset.DataEntry], nemo_automodel.components.datasets.llm.retrieval_dataset.DataEntry],
    concatenate: bool = True,
    seed: int = 42
)

Load datasets from JSON files.

Entries can be strings (use all samples) or dictionaries with path and optional num_samples fields (sample a fixed subset once while loading).

Returns:

Tuple of (dataset, corpus_dict)

nemo_automodel.components.datasets.llm.retrieval_dataset.make_retrieval_dataset(
    data_dir_list: typing.Union[typing.List[nemo_automodel.components.datasets.llm.retrieval_dataset.DataEntry], nemo_automodel.components.datasets.llm.retrieval_dataset.DataEntry] = None,
    model_type: str = 'bi_encoder',
    data_type: str = 'train',
    n_passages: int = 5,
    eval_negative_size: int = None,
    seed: int = 42,
    do_shuffle: bool = False,
    max_train_samples: int = None,
    train_data_select_offset: int = 0,
    use_dataset_instruction: bool = False,
    cycle_positive_docs: bool = False
)

Load and return dataset in retrieval format for encoder training.

Entries in data_dir_list can be local JSON file paths or hf:// URIs pointing to a HuggingFace dataset repository (e.g. hf://nvidia/embed-nemotron-dataset-v1/SciFact). A source can also be provided as {"path": path_or_uri, "num_samples": N} to sample a fixed subset once while loading. Uses set_transform() for lazy evaluation — tokenization is handled by the collator.

Parameters:

data_dir_list

Union[List[DataEntry], DataEntry]Defaults to None

Path(s) to JSON file(s), hf:// URIs, or dictionary entries with path and num_samples.

model_type

strDefaults to 'bi_encoder'

“bi_encoder” (default) or “cross_encoder”

data_type

strDefaults to 'train'

Type of data (“train” or “eval”)

n_passages

intDefaults to 5

Number of passages (1 positive + n-1 negatives)

eval_negative_size

intDefaults to None

Number of negative documents for evaluation

seed

intDefaults to 42

Random seed for reproducibility (for shuffling if needed)

do_shuffle

boolDefaults to False

Shuffle dataset rows before subset selection. Only applied when max_train_samples is set; otherwise iteration order is controlled by the dataloader’s sampler (e.g. StatefulDistributedSampler).

max_train_samples

intDefaults to None

Maximum number of training samples to use

train_data_select_offset

intDefaults to 0

Offset for selecting training samples

use_dataset_instruction

boolDefaults to False

Whether to use instruction from dataset’s metadata

cycle_positive_docs

boolDefaults to False

Whether training should cycle through positive documents across epochs. Defaults to False (always use the first positive document). Set to True only when a query has multiple positive documents and you want to rotate through them by epoch.

Returns:

A HuggingFace Dataset where each example is a dict with keys:

nemo_automodel.components.datasets.llm.retrieval_dataset.DATASETS = {'TextQADataset': TextQADataset}

nemo_automodel.components.datasets.llm.retrieval_dataset.DataEntry = Union[str, dict[str, Any]]

nemo_automodel.components.datasets.llm.retrieval_dataset.EXAMPLE_TEMPLATE = {'text': '', 'image': '', 'nr_ocr': ''}

nemo_automodel.components.datasets.llm.retrieval_dataset._HF_PREFIX = 'hf://'

nemo_automodel.components.datasets.llm.retrieval_dataset._OVERSAMPLING_WARNED_CORPORA: set[str] = set()

nemo_automodel.components.datasets.llm.retrieval_dataset._VALID_MODEL_TYPES = ('bi_encoder', 'cross_encoder')

nemo_automodel.components.datasets.llm.retrieval_dataset.args = parser.parse_args()

nemo_automodel.components.datasets.llm.retrieval_dataset.dataset = make_retrieval_dataset(data_dir_list=(args.data_dir_list), data_type=(args.data_...

nemo_automodel.components.datasets.llm.retrieval_dataset.example = dataset[0]

nemo_automodel.components.datasets.llm.retrieval_dataset.parser = argparse.ArgumentParser(description='Load and transform dataset to retrieval for...