nemo_automodel.components.datasets.llm.retrieval_dataset

View as Markdown

Module Contents

Classes

NameDescription
AbstractDatasetInterface for corpus datasets addressable by document id.
CorpusInfoData structure to hold corpus metadata and dataset object together.
HFCorpusDatasetWraps an already-loaded HuggingFace Dataset as a corpus (in-memory, no local Parquet).
RetrievalTransformStateful transform for retrieval datasets with epoch-based positive cycling.
TextQADatasetLoad TextQA corpus documents from a HuggingFace dataset path.

Functions

NameDescription
_cross_encoder_transform_funcTransform function to convert from raw format to cross-encoder training format.
_list_hf_subsetsDiscover all subset names in repo_id by finding dataset_metadata.json files.
_load_hf_sourcesLoad one or more hf:// URIs and return (Dataset, corpus_dict).
_load_hf_subsetLoad a single HF subset and return (normalized_data_list, CorpusInfo).
_normalize_data_entriesNormalize a single source or list of sources into parsed entries.
_parse_data_entryParse a data entry.
_parse_hf_uriParse an hf:// URI into (repo_id, subset_or_none).
_sample_data_items-
_transform_funcTransform function to convert from raw format to training format.
add_corpusAdd one or more corpus paths to a corpus dictionary.
load_corpusInstantiate a corpus dataset from a path and optional metadata.
load_corpus_metadataLoad Merlin corpus metadata from a corpus directory.
load_datasetsLoad datasets from JSON files.
make_retrieval_datasetLoad and return dataset in retrieval format for encoder training.

Data

DATASETS

DataEntry

EXAMPLE_TEMPLATE

_HF_PREFIX

_OVERSAMPLING_WARNED_CORPORA

_VALID_MODEL_TYPES

args

dataset

example

parser

API

class nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset()
Abstract

Interface for corpus datasets addressable by document id.

nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset.get_all_ids()
abstract
nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset.get_document_by_id(
id
)
abstract
class nemo_automodel.components.datasets.llm.retrieval_dataset.CorpusInfo(
metadata: dict,
corpus: nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset
)
Dataclass

Data structure to hold corpus metadata and dataset object together. Provides easy access to both components with descriptive attribute names.

corpus
AbstractDataset
corpus_id
str

Get corpus ID from metadata

metadata
dict
passage_instruction
str

Get passage instruction from metadata

path
str

Get corpus path from the corpus object

query_instruction
str

Get query instruction from metadata

task_type
str

Get task type from metadata

nemo_automodel.components.datasets.llm.retrieval_dataset.CorpusInfo.get_all_ids()

Delegate to corpus for convenience

nemo_automodel.components.datasets.llm.retrieval_dataset.CorpusInfo.get_document_by_id(
doc_id: str
)

Delegate to corpus for convenience

class nemo_automodel.components.datasets.llm.retrieval_dataset.HFCorpusDataset(
hf_dataset: datasets.Dataset,
path: str = ''
)

Bases: AbstractDataset

Wraps an already-loaded HuggingFace Dataset as a corpus (in-memory, no local Parquet).

_docid2idx
nemo_automodel.components.datasets.llm.retrieval_dataset.HFCorpusDataset.get_all_ids()
nemo_automodel.components.datasets.llm.retrieval_dataset.HFCorpusDataset.get_document_by_id(
id
)
class nemo_automodel.components.datasets.llm.retrieval_dataset.RetrievalTransform(
num_neg_docs: int,
corpus_dict: dict,
use_dataset_instruction: bool = False,
model_type: str = 'bi_encoder',
cycle_positive_docs: bool = False
)

Stateful transform for retrieval datasets with epoch-based positive cycling.

This class encapsulates the transform state (epoch, corpus_dict, etc.) and provides a clean interface for updating the epoch without recreating the transform.

epoch
= 0
nemo_automodel.components.datasets.llm.retrieval_dataset.RetrievalTransform.__call__(
examples
)
nemo_automodel.components.datasets.llm.retrieval_dataset.RetrievalTransform.set_epoch(
epoch: int
)

Update the epoch for positive document cycling.

class nemo_automodel.components.datasets.llm.retrieval_dataset.TextQADataset(
path
)

Bases: AbstractDataset

Load TextQA corpus documents from a HuggingFace dataset path.

data
= load_dataset(path)['train']
nemo_automodel.components.datasets.llm.retrieval_dataset.TextQADataset.get_all_ids()
nemo_automodel.components.datasets.llm.retrieval_dataset.TextQADataset.get_document_by_id(
id
)
nemo_automodel.components.datasets.llm.retrieval_dataset._cross_encoder_transform_func(
examples,
num_neg_docs,
corpus_dict,
use_dataset_instruction: bool = False,
epoch: int = 0
)

Transform function to convert from raw format to cross-encoder training format.

nemo_automodel.components.datasets.llm.retrieval_dataset._list_hf_subsets(
repo_id: str
) -> typing.List[str]

Discover all subset names in repo_id by finding dataset_metadata.json files.

nemo_automodel.components.datasets.llm.retrieval_dataset._load_hf_sources(
hf_entries: typing.List[typing.Tuple[typing.Optional[int], str]],
seed: int = 42
)

Load one or more hf:// URIs and return (Dataset, corpus_dict).

nemo_automodel.components.datasets.llm.retrieval_dataset._load_hf_subset(
repo_id: str,
subset: str
)

Load a single HF subset and return (normalized_data_list, CorpusInfo).

nemo_automodel.components.datasets.llm.retrieval_dataset._normalize_data_entries(
data_dir_list: typing.Union[typing.List[nemo_automodel.components.datasets.llm.retrieval_dataset.DataEntry], nemo_automodel.components.datasets.llm.retrieval_dataset.DataEntry]
) -> typing.List[typing.Tuple[typing.Optional[int], str]]

Normalize a single source or list of sources into parsed entries.

nemo_automodel.components.datasets.llm.retrieval_dataset._parse_data_entry(
entry: nemo_automodel.components.datasets.llm.retrieval_dataset.DataEntry
) -> typing.Tuple[typing.Optional[int], str]

Parse a data entry.

Supported forms:

  • “path_or_hf_uri”: use all samples
  • {“path”: “path_or_hf_uri”, “num_samples”: N}: sample N examples once from that source
nemo_automodel.components.datasets.llm.retrieval_dataset._parse_hf_uri(
uri: str
)

Parse an hf:// URI into (repo_id, subset_or_none).

Examples::

“hf://nvidia/embed-nemotron-dataset-v1/FEVER” -> (“nvidia/embed-nemotron-dataset-v1”, “FEVER”) “hf://nvidia/embed-nemotron-dataset-v1” -> (“nvidia/embed-nemotron-dataset-v1”, None)

nemo_automodel.components.datasets.llm.retrieval_dataset._sample_data_items(
data_items: typing.List[dict],
num_samples: typing.Optional[int],
source: str,
seed: int
) -> typing.List[dict]
nemo_automodel.components.datasets.llm.retrieval_dataset._transform_func(
examples,
num_neg_docs,
corpus_dict,
use_dataset_instruction: bool = False,
epoch: int = 0
)

Transform function to convert from raw format to training format.

Parameters:

examples

Batch of examples with question, corpus_id, pos_doc, neg_doc

num_neg_docs

Number of negative documents to use

corpus_dict

Dictionary mapping corpus_id to corpus objects

use_dataset_instruction
boolDefaults to False

Whether to use instruction from dataset’s metadata

epoch
intDefaults to 0

Current epoch for cycling through positive documents

nemo_automodel.components.datasets.llm.retrieval_dataset.add_corpus(
qa_corpus_paths: typing.Union[dict, list],
corpus_dict: dict
)

Add one or more corpus paths to a corpus dictionary.

nemo_automodel.components.datasets.llm.retrieval_dataset.load_corpus(
path,
metadata: typing.Optional[dict] = None
)

Instantiate a corpus dataset from a path and optional metadata.

nemo_automodel.components.datasets.llm.retrieval_dataset.load_corpus_metadata(
path: str
)

Load Merlin corpus metadata from a corpus directory.

nemo_automodel.components.datasets.llm.retrieval_dataset.load_datasets(
data_dir_list: typing.Union[typing.List[nemo_automodel.components.datasets.llm.retrieval_dataset.DataEntry], nemo_automodel.components.datasets.llm.retrieval_dataset.DataEntry],
concatenate: bool = True,
seed: int = 42
)

Load datasets from JSON files.

Entries can be strings (use all samples) or dictionaries with path and optional num_samples fields (sample a fixed subset once while loading).

Returns:

Tuple of (dataset, corpus_dict)

nemo_automodel.components.datasets.llm.retrieval_dataset.make_retrieval_dataset(
data_dir_list: typing.Union[typing.List[nemo_automodel.components.datasets.llm.retrieval_dataset.DataEntry], nemo_automodel.components.datasets.llm.retrieval_dataset.DataEntry] = None,
model_type: str = 'bi_encoder',
data_type: str = 'train',
n_passages: int = 5,
eval_negative_size: int = None,
seed: int = 42,
do_shuffle: bool = False,
max_train_samples: int = None,
train_data_select_offset: int = 0,
use_dataset_instruction: bool = False,
cycle_positive_docs: bool = False
)

Load and return dataset in retrieval format for encoder training.

Entries in data_dir_list can be local JSON file paths or hf:// URIs pointing to a HuggingFace dataset repository (e.g. hf://nvidia/embed-nemotron-dataset-v1/SciFact). A source can also be provided as {"path": path_or_uri, "num_samples": N} to sample a fixed subset once while loading. Uses set_transform() for lazy evaluation — tokenization is handled by the collator.

Parameters:

data_dir_list
Union[List[DataEntry], DataEntry]Defaults to None

Path(s) to JSON file(s), hf:// URIs, or dictionary entries with path and num_samples.

model_type
strDefaults to 'bi_encoder'

“bi_encoder” (default) or “cross_encoder”

data_type
strDefaults to 'train'

Type of data (“train” or “eval”)

n_passages
intDefaults to 5

Number of passages (1 positive + n-1 negatives)

eval_negative_size
intDefaults to None

Number of negative documents for evaluation

seed
intDefaults to 42

Random seed for reproducibility (for shuffling if needed)

do_shuffle
boolDefaults to False

Shuffle dataset rows before subset selection. Only applied when max_train_samples is set; otherwise iteration order is controlled by the dataloader’s sampler (e.g. StatefulDistributedSampler).

max_train_samples
intDefaults to None

Maximum number of training samples to use

train_data_select_offset
intDefaults to 0

Offset for selecting training samples

use_dataset_instruction
boolDefaults to False

Whether to use instruction from dataset’s metadata

cycle_positive_docs
boolDefaults to False

Whether training should cycle through positive documents across epochs. Defaults to False (always use the first positive document). Set to True only when a query has multiple positive documents and you want to rotate through them by epoch.

Returns:

A HuggingFace Dataset where each example is a dict with keys:

nemo_automodel.components.datasets.llm.retrieval_dataset.DATASETS = {'TextQADataset': TextQADataset}
nemo_automodel.components.datasets.llm.retrieval_dataset.DataEntry = Union[str, dict[str, Any]]
nemo_automodel.components.datasets.llm.retrieval_dataset.EXAMPLE_TEMPLATE = {'text': '', 'image': '', 'nr_ocr': ''}
nemo_automodel.components.datasets.llm.retrieval_dataset._HF_PREFIX = 'hf://'
nemo_automodel.components.datasets.llm.retrieval_dataset._OVERSAMPLING_WARNED_CORPORA: set[str] = set()
nemo_automodel.components.datasets.llm.retrieval_dataset._VALID_MODEL_TYPES = ('bi_encoder', 'cross_encoder')
nemo_automodel.components.datasets.llm.retrieval_dataset.args = parser.parse_args()
nemo_automodel.components.datasets.llm.retrieval_dataset.dataset = make_retrieval_dataset(data_dir_list=(args.data_dir_list), data_type=(args.data_...
nemo_automodel.components.datasets.llm.retrieval_dataset.example = dataset[0]
nemo_automodel.components.datasets.llm.retrieval_dataset.parser = argparse.ArgumentParser(description='Load and transform dataset to retrieval for...