> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.datasets.llm.retrieval_collator

## Module Contents

### Classes

| Name                                                                                                      | Description                                               |
| --------------------------------------------------------------------------------------------------------- | --------------------------------------------------------- |
| [`BiEncoderCollator`](#nemo_automodel-components-datasets-llm-retrieval_collator-BiEncoderCollator)       | Collator for encoder retrieval training.                  |
| [`CrossEncoderCollator`](#nemo_automodel-components-datasets-llm-retrieval_collator-CrossEncoderCollator) | Collate query-document pairs for cross-encoder reranking. |

### Functions

| Name                                                                                                      | Description                                                                   |
| --------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------- |
| [`_doc_id_str_to_int64`](#nemo_automodel-components-datasets-llm-retrieval_collator-_doc_id_str_to_int64) | Stable 63-bit int for corpus doc id strings (for in-batch duplicate masking). |
| [`_unpack_doc_values`](#nemo_automodel-components-datasets-llm-retrieval_collator-_unpack_doc_values)     | Unpack document lists into individual examples.                               |

### API

```python
class nemo_automodel.components.datasets.llm.retrieval_collator.BiEncoderCollator(
    tokenizer: transformers.PreTrainedTokenizerBase,
    q_max_len: int = 512,
    p_max_len: int = 512,
    query_prefix: str = '',
    passage_prefix: str = '',
    padding: typing.Union[bool, str, transformers.file_utils.PaddingStrategy] = True,
    pad_to_multiple_of: int = None,
    use_dataset_instruction: bool = False
)
```

Collator for encoder retrieval training.

This collator handles tokenization of queries and documents at batch time,
which is more memory-efficient than pre-tokenization and allows for
dynamic padding based on batch max length.

Based on EncoderCollator from nemo-retriever-research but adapted for Automodel.

```python
nemo_automodel.components.datasets.llm.retrieval_collator.BiEncoderCollator.__call__(
    batch: typing.List[typing.Dict[str, typing.Any]]
) -> typing.Dict[str, torch.Tensor]
```

Collate a batch of examples.

**Parameters:**

List of examples, each with 'question', 'doc\_text', 'doc\_image' keys

**Returns:** `Dict[str, torch.Tensor]`

Dictionary with:

```python
nemo_automodel.components.datasets.llm.retrieval_collator.BiEncoderCollator._convert_dict_to_list(
    input_dict: typing.Dict[str, typing.List]
) -> typing.List[typing.Dict[str, typing.Any]]
```

Convert dictionary of lists to list of dictionaries.

```python
nemo_automodel.components.datasets.llm.retrieval_collator.BiEncoderCollator._merge_batch_dict(
    query_batch_dict: typing.Dict[str, typing.List],
    doc_batch_dict: typing.Dict[str, typing.List],
    train_n_passages: int
) -> typing.Dict[str, typing.List]
```

Merge query and document batches into a single dictionary.

Adapted from nemo-retriever-research/src/loaders/loader\_utils.py

```python
class nemo_automodel.components.datasets.llm.retrieval_collator.CrossEncoderCollator(
    rerank_max_length: int,
    args = (),
    prompt_template: str = 'question:{query} \n \n pas...,
    kwargs = {}
)
```

**Bases:** `DataCollatorWithPadding`

Collate query-document pairs for cross-encoder reranking.

```python
nemo_automodel.components.datasets.llm.retrieval_collator.CrossEncoderCollator.__call__(
    features: typing.List[typing.Dict[str, typing.Any]]
) -> transformers.BatchEncoding
```

```python
nemo_automodel.components.datasets.llm.retrieval_collator._doc_id_str_to_int64(
    doc_id: str
) -> int
```

Stable 63-bit int for corpus doc id strings (for in-batch duplicate masking).

```python
nemo_automodel.components.datasets.llm.retrieval_collator._unpack_doc_values(
    features: typing.List[typing.Dict[str, typing.Any]]
) -> typing.List[typing.Dict[str, typing.Any]]
```

Unpack document lists into individual examples.