nemo_automodel.components.datasets.llm.retrieval_dataset#

Module Contents#

Classes#

AbstractDataset

TextQADataset

CorpusInfo

Data structure to hold corpus metadata and dataset object together. Provides easy access to both components with descriptive attribute names.

Functions#

load_corpus_metadata

load_corpus

add_corpus

load_datasets

Load datasets from JSON files.

_transform_func

Transform function to convert from raw format to training format. Same as _format_process_data in RetrievalMultiModalDatasetLoader.

_create_transform_func

Create transform function with specified number of negative documents.

make_retrieval_dataset

Load and return dataset in retrieval format for biencoder training.

Data#

API#

nemo_automodel.components.datasets.llm.retrieval_dataset.EXAMPLE_TEMPLATE#

None

class nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset#

Bases: abc.ABC

abstractmethod get_document_by_id(id)#
abstractmethod get_all_ids()#
class nemo_automodel.components.datasets.llm.retrieval_dataset.TextQADataset(path)#

Bases: nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset

get_document_by_id(id)#
get_all_ids()#
nemo_automodel.components.datasets.llm.retrieval_dataset.DATASETS#

None

class nemo_automodel.components.datasets.llm.retrieval_dataset.CorpusInfo#

Data structure to hold corpus metadata and dataset object together. Provides easy access to both components with descriptive attribute names.

metadata: dict#

None

corpus: nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset#

None

property corpus_id: str#

Get corpus ID from metadata

property query_instruction: str#

Get query instruction from metadata

property passage_instruction: str#

Get passage instruction from metadata

property task_type: str#

Get task type from metadata

property path: str#

Get corpus path from the corpus object

get_document_by_id(doc_id: str)#

Delegate to corpus for convenience

get_all_ids()#

Delegate to corpus for convenience

nemo_automodel.components.datasets.llm.retrieval_dataset.load_corpus_metadata(path: str)#
nemo_automodel.components.datasets.llm.retrieval_dataset.load_corpus(path, metadata: Optional[dict] = None)#
nemo_automodel.components.datasets.llm.retrieval_dataset.add_corpus(
qa_corpus_paths: Union[dict, list],
corpus_dict: dict,
)#
nemo_automodel.components.datasets.llm.retrieval_dataset.load_datasets(
data_dir_list: Union[List[str], str],
concatenate: bool = True,
)#

Load datasets from JSON files.

Copied from nemo-retriever-research/src/data/datasets.py

Returns:

Tuple of (dataset, corpus_dict)

nemo_automodel.components.datasets.llm.retrieval_dataset._transform_func(
examples,
num_neg_docs,
corpus_dict,
use_dataset_instruction: bool = False,
)#

Transform function to convert from raw format to training format. Same as _format_process_data in RetrievalMultiModalDatasetLoader.

Parameters:
  • examples – Batch of examples with question, corpus_id, pos_doc, neg_doc

  • num_neg_docs – Number of negative documents to use

  • corpus_dict – Dictionary mapping corpus_id to corpus objects

  • use_dataset_instruction – Whether to use instruction from dataset’s metadata

nemo_automodel.components.datasets.llm.retrieval_dataset._create_transform_func(
num_neg_docs,
corpus_dict,
use_dataset_instruction: bool = False,
)#

Create transform function with specified number of negative documents.

nemo_automodel.components.datasets.llm.retrieval_dataset.make_retrieval_dataset(
data_dir_list: Union[List[str], str],
data_type: str = 'train',
train_n_passages: int = 5,
eval_negative_size: int = 10,
seed: int = 42,
do_shuffle: bool = False,
max_train_samples: int = None,
train_data_select_offset: int = 0,
use_dataset_instruction: bool = False,
)#

Load and return dataset in retrieval format for biencoder training.

This function loads data from JSON files using the same method as RetrievalMultiModalDatasetLoader and returns it ready for training. Uses set_transform() for lazy evaluation - tokenization is handled by collator.

Parameters:
  • data_dir_list – Path(s) to JSON file(s) containing training data

  • data_type – Type of data (“train” or “eval”)

  • train_n_passages – Number of passages for training (1 positive + n-1 negatives)

  • eval_negative_size – Number of negative documents for evaluation

  • seed – Random seed for reproducibility (for shuffling if needed)

  • do_shuffle – Whether to shuffle the dataset

  • max_train_samples – Maximum number of training samples to use

  • train_data_select_offset – Offset for selecting training samples

Returns:

  • ‘question’: Query text

  • ’doc_text’: List of document texts [positive, negatives…]

  • ’doc_image’: List of images or empty strings

Return type:

A HuggingFace Dataset where each example is a dict with keys

.. note::

Tokenization should be handled by a collator (e.g., RetrievalBiencoderCollator) which is more efficient for batch padding and supports dynamic processing.