`nemo_automodel.components.datasets.llm.retrieval_collator`#

Module Contents#

Classes#

RetrievalBiencoderCollator

Collator for biencoder retrieval training.

Functions#

_unpack_doc_values

Unpack document lists into individual examples.

API#

nemo_automodel.components.datasets.llm.retrieval_collator._unpack_doc_values( features: List[Dict[str, Any]], ) → List[Dict[str, Any]]#

Unpack document lists into individual examples.

.. rubric:: Example

Input: [{‘input_ids’: [[1,2], [3,4]], ‘attention_mask’: [[1,1], [1,1]]}] Output: [{‘input_ids’: [1,2], ‘attention_mask’: [1,1]}, {‘input_ids’: [3,4], ‘attention_mask’: [1,1]}]

class nemo_automodel.components.datasets.llm.retrieval_collator.RetrievalBiencoderCollator( tokenizer: transformers.PreTrainedTokenizerBase, q_max_len: int = 512, p_max_len: int = 512, query_prefix: str = '', passage_prefix: str = '', padding: Union[bool, str, transformers.file_utils.PaddingStrategy] = True, pad_to_multiple_of: int = None, )#

Collator for biencoder retrieval training.

This collator handles tokenization of queries and documents at batch time, which is more memory-efficient than pre-tokenization and allows for dynamic padding based on batch max length.

Based on BiencoderCollator from nemo-retriever-research but adapted for Automodel.

Initialization

Initialize the collator.

Parameters:

tokenizer – Tokenizer to use for encoding
q_max_len – Maximum length for queries
p_max_len – Maximum length for passages
query_prefix – Prefix to add to queries (e.g., “query: “)
passage_prefix – Prefix to add to passages (e.g., “passage: “)
padding – Padding strategy (“longest”, “max_length”, or “do_not_pad”)
pad_to_multiple_of – Pad to multiple of this value (e.g., 8 for FP16)

__call__( batch: List[Dict[str, Any]], ) → Dict[str, torch.Tensor]#

Collate a batch of examples.

Parameters:

batch – List of examples, each with ‘question’, ‘doc_text’, ‘doc_image’ keys

Returns:

q_input_ids: Query input IDs [batch_size, q_seq_len]
q_attention_mask: Query attention mask [batch_size, q_seq_len]
d_input_ids: Document input IDs [batch_size * num_docs, d_seq_len]
d_attention_mask: Document attention mask [batch_size * num_docs, d_seq_len]
labels: Dummy labels for compatibility [batch_size]

Return type:

Dictionary with

_merge_batch_dict( query_batch_dict: Dict[str, List], doc_batch_dict: Dict[str, List], train_n_passages: int, ) → Dict[str, List]#

Merge query and document batches into a single dictionary.

Adapted from nemo-retriever-research/src/loaders/loader_utils.py

_convert_dict_to_list( input_dict: Dict[str, List], ) → List[Dict[str, Any]]#

Convert dictionary of lists to list of dictionaries.