> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo-platform/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo-platform/_mcp/server.

# Dataset Format Requirements

<a id="customizer-data-format" />

Use the following guidelines to prepare your training dataset for the supported models.

## Dataset Preparation Guidelines

* **File Format**: Save your training data as `.jsonl` files (one JSON object per line).
* **Validation**: Each record is automatically validated against the appropriate schema when training begins. The required format depends on `training.type` (for example, `sft`) specified in your job configuration.

For dataset creation tutorials, refer to [Format Training Dataset](/documentation/customizer-reference/tutorials/format-training-dataset).

## Dataset Formats

The following sections describe the required schema and example dataset entries for each model type.

### Embedding Model

For embedding model training, the following additional guidelines apply:

* **Negative Examples**: Provide multiple negative documents per query for better contrastive learning.
* **Content Quality**: Ensure positive documents are semantically relevant to the query.
* **JSONL Format**: The supported embedding models require training data in JSONL format with a specific triplet structure for contrastive learning.

#### Required Schema

Each line in your JSONL file must contain a JSON object with these required fields:

* **`query`** (string): The query text to use as an anchor for similarity learning.
* **`pos_doc`** (string): A document that should match positively with the query.
* **`neg_doc`** (array of strings): One or more documents that should not match with the query.

#### Example Dataset Entry

```json
{
  "query": "3D trajectory recovery for tracking multiple objects",
  "pos_doc": "Recursive Estimation of Motion, Structure, and Focal Length",
  "neg_doc": [
    "Characterization of 1.2 kV SiC super-junction SBD implemented by trench technique"
  ]
}
```

### OpenAI Chat Model

#### Required Schema

Each line in your JSONL file must contain a JSON object with these required fields:

* **`messages`** (array of objects): The messages in the conversation.
* **`role`** (string): The role of the message.
* **`content`** (string): The content of the message.

#### Example Dataset Entry

```
{
 "messages": [
 {
 "role": "system",
 "content": "You are an email writing assistant. Please help people write cogent emails."
 },
 ...
 ]
}
```

### OpenAI Chat With Tool

#### Required Schema

Each line in your JSONL file must contain a JSON object with these required fields:

* **`messages`** (array of objects): The messages in the conversation, including tool calls.
* **`tools`** (array of objects): The available tools that can be called.

#### Example Dataset Entry

```json
{
  "messages": [
    {
      "role": "user",
      "content": "Retrieve information on crimes with no location from the West Yorkshire Police for May 2023 in the categories 'Shoplifting' and 'Violence and Sexual Offences'"
    },
    {
      "role": "assistant",
      "content": "",
      "tool_calls": [
        {
          "type": "function",
          "function": {
            "name": "crimes_with_no_location",
            "arguments": {
              "date": "2023-05",
              "force": "West Yorkshire Police",
              "category": "Shoplifting"
            }
          }
        },
        {
          "type": "function",
          "function": {
            "name": "crimes_with_no_location",
            "arguments": {
              "date": "2023-05",
              "force": "West Yorkshire Police",
              "category": "Violence and Sexual Offences"
            }
          }
        }
      ]
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "crimes_with_no_location",
        "description": "Fetches a list of crimes from a specified police force on a given date and category, where the crimes have no specified location.",
        "parameters": {
          "type": "object",
          "properties": {
            "date": {
              "description": "The date of the crimes to retrieve in 'YYYY-MM' format.",
              "type": "string",
              "default": "2011-09"
            },
            "force": {
              "description": "The identifier for the police force responsible for handling the crimes.",
              "type": "string",
              "default": "warwickshire"
            },
            "category": {
              "description": "The category of the crimes to retrieve.",
              "type": "string",
              "default": "all-crime"
            }
          }
        }
      }
    }
  ]
}
```

### Basic Prompt Completion

#### Required Schema

Each line in your JSONL file must contain a JSON object with these required fields:

* **`prompt`** (string): The input prompt or question.
* **`completion`** (string): The expected response or completion.

#### Example Dataset Entry

```
{
 "prompt": "your string",
 "completion": "your expected response"
}
```

### SFT Legacy Conversational

#### Required Schema

Each line in your JSONL file must contain a JSON object with these required fields:

* **`system`** (string): The system message that defines the assistant's role or behavior.
* **`conversations`** (array of objects): The conversation turns between user and assistant.
* **`from`** (string): The role of the message sender ("User" or "Assistant").
* **`value`** (string): The content of the message.

#### Example Dataset Entry

```
{
 "system": "you are a robot",
 "conversations": [
 {"from": "User", "value": "Choose a number that is greater than 0 and less than 2\n"},
 {"from": "Assistant", "value": "1"}
 ]
}
```