nemo_automodel.components.datasets.llm.retrieval_dataset#
Module Contents#
Classes#
Data structure to hold corpus metadata and dataset object together. Provides easy access to both components with descriptive attribute names. |
Functions#
Load datasets from JSON files. |
|
Transform function to convert from raw format to training format. Same as _format_process_data in RetrievalMultiModalDatasetLoader. |
|
Create transform function with specified number of negative documents. |
|
Load and return dataset in retrieval format for biencoder training. |
Data#
API#
- nemo_automodel.components.datasets.llm.retrieval_dataset.EXAMPLE_TEMPLATE#
None
- class nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset#
Bases:
abc.ABC- abstractmethod get_document_by_id(id)#
- abstractmethod get_all_ids()#
- class nemo_automodel.components.datasets.llm.retrieval_dataset.TextQADataset(path)#
Bases:
nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset- get_document_by_id(id)#
- get_all_ids()#
- nemo_automodel.components.datasets.llm.retrieval_dataset.DATASETS#
None
- class nemo_automodel.components.datasets.llm.retrieval_dataset.CorpusInfo#
Data structure to hold corpus metadata and dataset object together. Provides easy access to both components with descriptive attribute names.
- metadata: dict#
None
- property corpus_id: str#
Get corpus ID from metadata
- property query_instruction: str#
Get query instruction from metadata
- property passage_instruction: str#
Get passage instruction from metadata
- property task_type: str#
Get task type from metadata
- property path: str#
Get corpus path from the corpus object
- get_document_by_id(doc_id: str)#
Delegate to corpus for convenience
- get_all_ids()#
Delegate to corpus for convenience
- nemo_automodel.components.datasets.llm.retrieval_dataset.load_corpus_metadata(path: str)#
- nemo_automodel.components.datasets.llm.retrieval_dataset.load_corpus(path, metadata: Optional[dict] = None)#
- nemo_automodel.components.datasets.llm.retrieval_dataset.add_corpus(
- qa_corpus_paths: Union[dict, list],
- corpus_dict: dict,
- nemo_automodel.components.datasets.llm.retrieval_dataset.load_datasets(
- data_dir_list: Union[List[str], str],
- concatenate: bool = True,
Load datasets from JSON files.
Copied from nemo-retriever-research/src/data/datasets.py
- Returns:
Tuple of (dataset, corpus_dict)
- nemo_automodel.components.datasets.llm.retrieval_dataset._transform_func(
- examples,
- num_neg_docs,
- corpus_dict,
- use_dataset_instruction: bool = False,
Transform function to convert from raw format to training format. Same as _format_process_data in RetrievalMultiModalDatasetLoader.
- Parameters:
examples – Batch of examples with question, corpus_id, pos_doc, neg_doc
num_neg_docs – Number of negative documents to use
corpus_dict – Dictionary mapping corpus_id to corpus objects
use_dataset_instruction – Whether to use instruction from dataset’s metadata
- nemo_automodel.components.datasets.llm.retrieval_dataset._create_transform_func(
- num_neg_docs,
- corpus_dict,
- use_dataset_instruction: bool = False,
Create transform function with specified number of negative documents.
- nemo_automodel.components.datasets.llm.retrieval_dataset.make_retrieval_dataset(
- data_dir_list: Union[List[str], str],
- data_type: str = 'train',
- train_n_passages: int = 5,
- eval_negative_size: int = 10,
- seed: int = 42,
- do_shuffle: bool = False,
- max_train_samples: int = None,
- train_data_select_offset: int = 0,
- use_dataset_instruction: bool = False,
Load and return dataset in retrieval format for biencoder training.
This function loads data from JSON files using the same method as RetrievalMultiModalDatasetLoader and returns it ready for training. Uses set_transform() for lazy evaluation - tokenization is handled by collator.
- Parameters:
data_dir_list – Path(s) to JSON file(s) containing training data
data_type – Type of data (“train” or “eval”)
train_n_passages – Number of passages for training (1 positive + n-1 negatives)
eval_negative_size – Number of negative documents for evaluation
seed – Random seed for reproducibility (for shuffling if needed)
do_shuffle – Whether to shuffle the dataset
max_train_samples – Maximum number of training samples to use
train_data_select_offset – Offset for selecting training samples
- Returns:
‘question’: Query text
’doc_text’: List of document texts [positive, negatives…]
’doc_image’: List of images or empty strings
- Return type:
A HuggingFace Dataset where each example is a dict with keys
.. note::
Tokenization should be handled by a collator (e.g., RetrievalBiencoderCollator) which is more efficient for batch padding and supports dynamic processing.