> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.datasets.llm.retrieval_dataset_inline

## Module Contents

### Functions

| Name                                                                                                                                            | Description                                                                     |
| ----------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------- |
| [`_coerce_to_list`](#nemo_automodel-components-datasets-llm-retrieval_dataset_inline-_coerce_to_list)                                           | -                                                                               |
| [`_create_cross_encoder_transform_func`](#nemo_automodel-components-datasets-llm-retrieval_dataset_inline-_create_cross_encoder_transform_func) | Create transform function with specified number of negative documents.          |
| [`_create_retrieval_transform_func`](#nemo_automodel-components-datasets-llm-retrieval_dataset_inline-_create_retrieval_transform_func)         | Create transform function with specified number of negative documents.          |
| [`_cross_encoder_transform_func`](#nemo_automodel-components-datasets-llm-retrieval_dataset_inline-_cross_encoder_transform_func)               | Transform function to convert from raw format to cross-encoder training format. |
| [`_load_json_or_jsonl`](#nemo_automodel-components-datasets-llm-retrieval_dataset_inline-_load_json_or_jsonl)                                   | Load a JSON file, falling back to JSONL (one JSON object per line).             |
| [`_normalize_inline_doc`](#nemo_automodel-components-datasets-llm-retrieval_dataset_inline-_normalize_inline_doc)                               | Normalize an inline doc (text/image provided) into a canonical dict shape.      |
| [`_resolve_doc_to_example`](#nemo_automodel-components-datasets-llm-retrieval_dataset_inline-_resolve_doc_to_example)                           | Resolve a doc reference into an example dict with keys: text, image, nr\_ocr.   |
| [`_retrieval_transform_func`](#nemo_automodel-components-datasets-llm-retrieval_dataset_inline-_retrieval_transform_func)                       | Transform function to convert from raw format to training format.               |
| [`flatten_bi_encoder_to_cross_encoder`](#nemo_automodel-components-datasets-llm-retrieval_dataset_inline-flatten_bi_encoder_to_cross_encoder)   | Flatten grouped bi-encoder output into cross-encoder format.                    |
| [`load_datasets`](#nemo_automodel-components-datasets-llm-retrieval_dataset_inline-load_datasets)                                               | Load retrieval datasets from JSON/JSONL files.                                  |
| [`make_retrieval_dataset`](#nemo_automodel-components-datasets-llm-retrieval_dataset_inline-make_retrieval_dataset)                             | Load and return dataset in retrieval format for encoder training.               |

### Data

[`INLINE_CORPUS_ID`](#nemo_automodel-components-datasets-llm-retrieval_dataset_inline-INLINE_CORPUS_ID)

### API

```python
nemo_automodel.components.datasets.llm.retrieval_dataset_inline._coerce_to_list(
    value: typing.Any
) -> list
```

```python
nemo_automodel.components.datasets.llm.retrieval_dataset_inline._create_cross_encoder_transform_func(
    num_neg_docs,
    corpus_dict,
    use_dataset_instruction: bool = False
)
```

Create transform function with specified number of negative documents.

```python
nemo_automodel.components.datasets.llm.retrieval_dataset_inline._create_retrieval_transform_func(
    num_neg_docs,
    corpus_dict,
    use_dataset_instruction: bool = False
)
```

Create transform function with specified number of negative documents.

```python
nemo_automodel.components.datasets.llm.retrieval_dataset_inline._cross_encoder_transform_func(
    examples,
    num_neg_docs,
    corpus_dict,
    use_dataset_instruction: bool = False
)
```

Transform function to convert from raw format to cross-encoder training format.

```python
nemo_automodel.components.datasets.llm.retrieval_dataset_inline._load_json_or_jsonl(
    path: str
) -> typing.Union[dict, list]
```

Load a JSON file, falling back to JSONL (one JSON object per line).

```python
nemo_automodel.components.datasets.llm.retrieval_dataset_inline._normalize_inline_doc(
    doc: typing.Any
) -> typing.Dict[str, typing.Any]
```

Normalize an inline doc (text/image provided) into a canonical dict shape.

```python
nemo_automodel.components.datasets.llm.retrieval_dataset_inline._resolve_doc_to_example(
    doc: typing.Any
) -> dict
```

Resolve a doc reference into an example dict with keys: text, image, nr\_ocr.

Supported doc forms:

* `str`: interpreted as inline document text
* `dict`: must include `text` (optionally `image`, `nr_ocr`)

```python
nemo_automodel.components.datasets.llm.retrieval_dataset_inline._retrieval_transform_func(
    examples,
    num_neg_docs,
    corpus_dict,
    use_dataset_instruction: bool = False
)
```

Transform function to convert from raw format to training format.
Args:
examples: Batch of examples with question, corpus\_id, pos\_doc, neg\_doc
num\_neg\_docs: Number of negative documents to use
corpus\_dict: Dictionary mapping corpus\_id to corpus objects
use\_dataset\_instruction: Whether to use instruction from dataset's metadata

```python
nemo_automodel.components.datasets.llm.retrieval_dataset_inline.flatten_bi_encoder_to_cross_encoder(
    data: dict
) -> dict
```

Flatten grouped bi-encoder output into cross-encoder format.

Takes bi-encoder-style data (queries with grouped doc lists) and flattens it
so each query-doc pair becomes a separate entry. Used by cross-encoder transforms
in both retrieval\_dataset.py and retrieval\_dataset\_inline.py.

```python
nemo_automodel.components.datasets.llm.retrieval_dataset_inline.load_datasets(
    data_dir_list: typing.Union[typing.List[str], str],
    concatenate: bool = True
)
```

Load retrieval datasets from JSON/JSONL files.

Copied from nemo-retriever-research/src/data/datasets.py

**Returns:**

Tuple of (dataset, corpus\_dict)

```python
nemo_automodel.components.datasets.llm.retrieval_dataset_inline.make_retrieval_dataset(
    data_dir_list: typing.Union[typing.List[str], str],
    model_type: str = 'bi_encoder',
    data_type: str = 'train',
    n_passages: int = 5,
    eval_negative_size: int = None,
    seed: int = 42,
    do_shuffle: bool = False,
    max_train_samples: int = None,
    train_data_select_offset: int = 0,
    use_dataset_instruction: bool = False
)
```

Load and return dataset in retrieval format for encoder training.

This function loads data from JSON files and returns it ready for training.
Uses set\_transform() for lazy evaluation - tokenization is handled by collator.

**Parameters:**

Path(s) to JSON file(s) containing training data

"bi\_encoder" (default) or "cross\_encoder"

Type of data ("train" or "eval")

Number of passages (1 positive + n-1 negatives)

Number of negative documents for evaluation

Random seed for reproducibility (for shuffling if needed)

Whether to shuffle the dataset

Maximum number of training samples to use

Offset for selecting training samples

**Returns:**

A HuggingFace Dataset where each example is a dict with keys:

```python
nemo_automodel.components.datasets.llm.retrieval_dataset_inline.INLINE_CORPUS_ID = '__inline__'
```