nemo_automodel.components.datasets.llm.retrieval_collator#
Module Contents#
Classes#
Collator for biencoder retrieval training. |
Functions#
Unpack document lists into individual examples. |
API#
- nemo_automodel.components.datasets.llm.retrieval_collator._unpack_doc_values(
- features: List[Dict[str, Any]],
Unpack document lists into individual examples.
.. rubric:: Example
Input: [{‘input_ids’: [[1,2], [3,4]], ‘attention_mask’: [[1,1], [1,1]]}] Output: [{‘input_ids’: [1,2], ‘attention_mask’: [1,1]}, {‘input_ids’: [3,4], ‘attention_mask’: [1,1]}]
- class nemo_automodel.components.datasets.llm.retrieval_collator.RetrievalBiencoderCollator(
- tokenizer: transformers.PreTrainedTokenizerBase,
- q_max_len: int = 512,
- p_max_len: int = 512,
- query_prefix: str = '',
- passage_prefix: str = '',
- padding: Union[bool, str, transformers.file_utils.PaddingStrategy] = True,
- pad_to_multiple_of: int = None,
Collator for biencoder retrieval training.
This collator handles tokenization of queries and documents at batch time, which is more memory-efficient than pre-tokenization and allows for dynamic padding based on batch max length.
Based on BiencoderCollator from nemo-retriever-research but adapted for Automodel.
Initialization
Initialize the collator.
- Parameters:
tokenizer – Tokenizer to use for encoding
q_max_len – Maximum length for queries
p_max_len – Maximum length for passages
query_prefix – Prefix to add to queries (e.g., “query: “)
passage_prefix – Prefix to add to passages (e.g., “passage: “)
padding – Padding strategy (“longest”, “max_length”, or “do_not_pad”)
pad_to_multiple_of – Pad to multiple of this value (e.g., 8 for FP16)
- __call__(
- batch: List[Dict[str, Any]],
Collate a batch of examples.
- Parameters:
batch – List of examples, each with ‘question’, ‘doc_text’, ‘doc_image’ keys
- Returns:
q_input_ids: Query input IDs [batch_size, q_seq_len]
q_attention_mask: Query attention mask [batch_size, q_seq_len]
d_input_ids: Document input IDs [batch_size * num_docs, d_seq_len]
d_attention_mask: Document attention mask [batch_size * num_docs, d_seq_len]
labels: Dummy labels for compatibility [batch_size]
- Return type:
Dictionary with
- _merge_batch_dict(
- query_batch_dict: Dict[str, List],
- doc_batch_dict: Dict[str, List],
- train_n_passages: int,
Merge query and document batches into a single dictionary.
Adapted from nemo-retriever-research/src/loaders/loader_utils.py
- _convert_dict_to_list(
- input_dict: Dict[str, List],
Convert dictionary of lists to list of dictionaries.
.. rubric:: Example
Input: {‘a’: [1, 2], ‘b’: [3, 4]} Output: [{‘a’: 1, ‘b’: 3}, {‘a’: 2, ‘b’: 4}]