nemo_automodel.components.datasets.llm.retrieval_collator
nemo_automodel.components.datasets.llm.retrieval_collator
Module Contents
Classes
Functions
API
Collator for encoder retrieval training.
This collator handles tokenization of queries and documents at batch time, which is more memory-efficient than pre-tokenization and allows for dynamic padding based on batch max length.
Based on EncoderCollator from nemo-retriever-research but adapted for Automodel.
Collate a batch of examples.
Parameters:
List of examples, each with ‘question’, ‘doc_text’, ‘doc_image’ keys
Returns: Dict[str, torch.Tensor]
Dictionary with:
Convert dictionary of lists to list of dictionaries.
Merge query and document batches into a single dictionary.
Adapted from nemo-retriever-research/src/loaders/loader_utils.py
Bases: DataCollatorWithPadding
Collate query-document pairs for cross-encoder reranking.
Stable 63-bit int for corpus doc id strings (for in-batch duplicate masking).
Unpack document lists into individual examples.