nemo_automodel.components.datasets.llm.retrieval_dataset#
Module Contents#
Classes#
Wraps an already-loaded HuggingFace Dataset as a corpus (in-memory, no local Parquet). |
|
Data structure to hold corpus metadata and dataset object together. Provides easy access to both components with descriptive attribute names. |
Functions#
Load datasets from JSON files. |
|
Parse an |
|
Discover all subset names in repo_id by finding |
|
Load a single HF subset and return |
|
Load one or more |
|
Transform function to convert from raw format to training format. Same as _format_process_data in RetrievalMultiModalDatasetLoader. |
|
Create transform function with specified number of negative documents. |
|
Load and return dataset in retrieval format for biencoder training. |
Data#
API#
- nemo_automodel.components.datasets.llm.retrieval_dataset.EXAMPLE_TEMPLATE#
None
- class nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset#
Bases:
abc.ABC- abstractmethod get_document_by_id(id)#
- abstractmethod get_all_ids()#
- class nemo_automodel.components.datasets.llm.retrieval_dataset.TextQADataset(path)#
Bases:
nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDataset- get_document_by_id(id)#
- get_all_ids()#
- class nemo_automodel.components.datasets.llm.retrieval_dataset.HFCorpusDataset(hf_dataset: datasets.Dataset, path: str = '')#
Bases:
nemo_automodel.components.datasets.llm.retrieval_dataset.AbstractDatasetWraps an already-loaded HuggingFace Dataset as a corpus (in-memory, no local Parquet).
Initialization
- get_document_by_id(id)#
- get_all_ids()#
- nemo_automodel.components.datasets.llm.retrieval_dataset.DATASETS#
None
- class nemo_automodel.components.datasets.llm.retrieval_dataset.CorpusInfo#
Data structure to hold corpus metadata and dataset object together. Provides easy access to both components with descriptive attribute names.
- metadata: dict#
None
- property corpus_id: str#
Get corpus ID from metadata
- property query_instruction: str#
Get query instruction from metadata
- property passage_instruction: str#
Get passage instruction from metadata
- property task_type: str#
Get task type from metadata
- property path: str#
Get corpus path from the corpus object
- get_document_by_id(doc_id: str)#
Delegate to corpus for convenience
- get_all_ids()#
Delegate to corpus for convenience
- nemo_automodel.components.datasets.llm.retrieval_dataset.load_corpus_metadata(path: str)#
- nemo_automodel.components.datasets.llm.retrieval_dataset.load_corpus(path, metadata: Optional[dict] = None)#
- nemo_automodel.components.datasets.llm.retrieval_dataset.add_corpus(
- qa_corpus_paths: Union[dict, list],
- corpus_dict: dict,
- nemo_automodel.components.datasets.llm.retrieval_dataset.load_datasets(
- data_dir_list: Union[List[str], str],
- concatenate: bool = True,
Load datasets from JSON files.
Copied from nemo-retriever-research/src/data/datasets.py
- Returns:
Tuple of (dataset, corpus_dict)
- nemo_automodel.components.datasets.llm.retrieval_dataset._HF_PREFIX#
‘hf://’
- nemo_automodel.components.datasets.llm.retrieval_dataset._parse_hf_uri(uri: str)#
Parse an
hf://URI into(repo_id, subset_or_none).Examples::
"hf://nvidia/embed-nemotron-dataset-v1/FEVER" -> ("nvidia/embed-nemotron-dataset-v1", "FEVER") "hf://nvidia/embed-nemotron-dataset-v1" -> ("nvidia/embed-nemotron-dataset-v1", None)
- nemo_automodel.components.datasets.llm.retrieval_dataset._list_hf_subsets(repo_id: str) List[str]#
Discover all subset names in repo_id by finding
dataset_metadata.jsonfiles.
- nemo_automodel.components.datasets.llm.retrieval_dataset._load_hf_subset(repo_id: str, subset: str)#
Load a single HF subset and return
(normalized_data_list, CorpusInfo)... note::
The direct
hf://path currently expects the Automodel retrieval schema:{subset}/dataset_metadata.jsonwithcorpus_idmetadata{subset}_corpussplit with corpus columns likeidandtext{subset}split with query columns likequestionandpos_doc
FEVER and SyntheticClassificationData from
nvidia/embed-nemotron-dataset-v1are examples that follow this layout. Datasets with different structures should use a custom adapter/preprocessor before calling this loader.
- nemo_automodel.components.datasets.llm.retrieval_dataset._load_hf_sources(hf_uris: List[str])#
Load one or more
hf://URIs and return(Dataset, corpus_dict).
- nemo_automodel.components.datasets.llm.retrieval_dataset._transform_func(
- examples,
- num_neg_docs,
- corpus_dict,
- use_dataset_instruction: bool = False,
Transform function to convert from raw format to training format. Same as _format_process_data in RetrievalMultiModalDatasetLoader.
- Parameters:
examples – Batch of examples with question, corpus_id, pos_doc, neg_doc
num_neg_docs – Number of negative documents to use
corpus_dict – Dictionary mapping corpus_id to corpus objects
use_dataset_instruction – Whether to use instruction from dataset’s metadata
- nemo_automodel.components.datasets.llm.retrieval_dataset._create_transform_func(
- num_neg_docs,
- corpus_dict,
- use_dataset_instruction: bool = False,
Create transform function with specified number of negative documents.
- nemo_automodel.components.datasets.llm.retrieval_dataset.make_retrieval_dataset(
- data_dir_list: Union[List[str], str] = None,
- data_type: str = 'train',
- train_n_passages: int = 5,
- eval_negative_size: int = 10,
- seed: int = 42,
- do_shuffle: bool = False,
- max_train_samples: int = None,
- train_data_select_offset: int = 0,
- use_dataset_instruction: bool = False,
Load and return dataset in retrieval format for biencoder training.
Entries in data_dir_list can be local JSON file paths or
hf://URIs pointing to a HuggingFace dataset repository (e.g.hf://nvidia/embed-nemotron-dataset-v1/SciFact). Usesset_transform()for lazy evaluation — tokenization is handled by the collator.- Parameters:
data_dir_list – Path(s) to JSON file(s) or
hf://URIs.data_type – Type of data (“train” or “eval”)
train_n_passages – Number of passages for training (1 positive + n-1 negatives)
eval_negative_size – Number of negative documents for evaluation
seed – Random seed for reproducibility (for shuffling if needed)
do_shuffle – Whether to shuffle the dataset
max_train_samples – Maximum number of training samples to use
train_data_select_offset – Offset for selecting training samples
use_dataset_instruction – Whether to use instruction from dataset’s metadata
- Returns:
‘question’: Query text
’doc_text’: List of document texts [positive, negatives…]
’doc_image’: List of images or empty strings
- Return type:
A HuggingFace Dataset where each example is a dict with keys
.. note::
Direct
hf://loading currently supports HF datasets that already follow the Automodel retrieval schema (corpus-id based layout used bynvidia/embed-nemotron-dataset-v1subsets such as FEVER and SyntheticClassificationData). For other HF dataset formats, implement a custom adapter/preprocessor before calling this loader.Tokenization should be handled by a collator (e.g., RetrievalBiencoderCollator) which is more efficient for batch padding and supports dynamic processing.