Dataset Format Requirements

View as Markdown

Use the following guidelines to prepare your training dataset for the supported models.

Dataset Preparation Guidelines

  • File Format: Save your training data as .jsonl files (one JSON object per line).
  • Validation: Each record is automatically validated against the appropriate schema when training begins. The required format depends on training.type (for example, sft) specified in your job configuration.

For dataset creation tutorials, refer to Format Training Dataset.

Dataset Formats

The following sections describe the required schema and example dataset entries for each model type.

Embedding Model

For embedding model training, the following additional guidelines apply:

  • Negative Examples: Provide multiple negative documents per query for better contrastive learning.
  • Content Quality: Ensure positive documents are semantically relevant to the query.
  • JSONL Format: The supported embedding models require training data in JSONL format with a specific triplet structure for contrastive learning.

Required Schema

Each line in your JSONL file must contain a JSON object with these required fields:

  • query (string): The query text to use as an anchor for similarity learning.
  • pos_doc (string): A document that should match positively with the query.
  • neg_doc (array of strings): One or more documents that should not match with the query.

Example Dataset Entry

1{
2 "query": "3D trajectory recovery for tracking multiple objects",
3 "pos_doc": "Recursive Estimation of Motion, Structure, and Focal Length",
4 "neg_doc": [
5 "Characterization of 1.2 kV SiC super-junction SBD implemented by trench technique"
6 ]
7}

OpenAI Chat Model

Required Schema

Each line in your JSONL file must contain a JSON object with these required fields:

  • messages (array of objects): The messages in the conversation.
  • role (string): The role of the message.
  • content (string): The content of the message.

Example Dataset Entry

{
"messages": [
{
"role": "system",
"content": "You are an email writing assistant. Please help people write cogent emails."
},
...
]
}

OpenAI Chat With Tool

Required Schema

Each line in your JSONL file must contain a JSON object with these required fields:

  • messages (array of objects): The messages in the conversation, including tool calls.
  • tools (array of objects): The available tools that can be called.

Example Dataset Entry

1{
2 "messages": [
3 {
4 "role": "user",
5 "content": "Retrieve information on crimes with no location from the West Yorkshire Police for May 2023 in the categories 'Shoplifting' and 'Violence and Sexual Offences'"
6 },
7 {
8 "role": "assistant",
9 "content": "",
10 "tool_calls": [
11 {
12 "type": "function",
13 "function": {
14 "name": "crimes_with_no_location",
15 "arguments": {
16 "date": "2023-05",
17 "force": "West Yorkshire Police",
18 "category": "Shoplifting"
19 }
20 }
21 },
22 {
23 "type": "function",
24 "function": {
25 "name": "crimes_with_no_location",
26 "arguments": {
27 "date": "2023-05",
28 "force": "West Yorkshire Police",
29 "category": "Violence and Sexual Offences"
30 }
31 }
32 }
33 ]
34 }
35 ],
36 "tools": [
37 {
38 "type": "function",
39 "function": {
40 "name": "crimes_with_no_location",
41 "description": "Fetches a list of crimes from a specified police force on a given date and category, where the crimes have no specified location.",
42 "parameters": {
43 "type": "object",
44 "properties": {
45 "date": {
46 "description": "The date of the crimes to retrieve in 'YYYY-MM' format.",
47 "type": "string",
48 "default": "2011-09"
49 },
50 "force": {
51 "description": "The identifier for the police force responsible for handling the crimes.",
52 "type": "string",
53 "default": "warwickshire"
54 },
55 "category": {
56 "description": "The category of the crimes to retrieve.",
57 "type": "string",
58 "default": "all-crime"
59 }
60 }
61 }
62 }
63 }
64 ]
65}

Basic Prompt Completion

Required Schema

Each line in your JSONL file must contain a JSON object with these required fields:

  • prompt (string): The input prompt or question.
  • completion (string): The expected response or completion.

Example Dataset Entry

{
"prompt": "your string",
"completion": "your expected response"
}

SFT Legacy Conversational

Required Schema

Each line in your JSONL file must contain a JSON object with these required fields:

  • system (string): The system message that defines the assistant’s role or behavior.
  • conversations (array of objects): The conversation turns between user and assistant.
  • from (string): The role of the message sender (“User” or “Assistant”).
  • value (string): The content of the message.

Example Dataset Entry

{
"system": "you are a robot",
"conversations": [
{"from": "User", "value": "Choose a number that is greater than 0 and less than 2\n"},
{"from": "Assistant", "value": "1"}
]
}