Dataset Format Requirements
Use the following guidelines to prepare your training dataset for the supported models.
Dataset Preparation Guidelines
- File Format: Save your training data as
.jsonlfiles (one JSON object per line). - Validation: Each record is automatically validated against the appropriate schema when training begins. The required format depends on
training.type(for example,sft) specified in your job configuration.
For dataset creation tutorials, refer to Format Training Dataset.
Dataset Formats
The following sections describe the required schema and example dataset entries for each model type.
Embedding Model
For embedding model training, the following additional guidelines apply:
- Negative Examples: Provide multiple negative documents per query for better contrastive learning.
- Content Quality: Ensure positive documents are semantically relevant to the query.
- JSONL Format: The supported embedding models require training data in JSONL format with a specific triplet structure for contrastive learning.
Required Schema
Each line in your JSONL file must contain a JSON object with these required fields:
query(string): The query text to use as an anchor for similarity learning.pos_doc(string): A document that should match positively with the query.neg_doc(array of strings): One or more documents that should not match with the query.
Example Dataset Entry
OpenAI Chat Model
Required Schema
Each line in your JSONL file must contain a JSON object with these required fields:
messages(array of objects): The messages in the conversation.role(string): The role of the message.content(string): The content of the message.
Example Dataset Entry
OpenAI Chat With Tool
Required Schema
Each line in your JSONL file must contain a JSON object with these required fields:
messages(array of objects): The messages in the conversation, including tool calls.tools(array of objects): The available tools that can be called.
Example Dataset Entry
Basic Prompt Completion
Required Schema
Each line in your JSONL file must contain a JSON object with these required fields:
prompt(string): The input prompt or question.completion(string): The expected response or completion.
Example Dataset Entry
SFT Legacy Conversational
Required Schema
Each line in your JSONL file must contain a JSON object with these required fields:
system(string): The system message that defines the assistant’s role or behavior.conversations(array of objects): The conversation turns between user and assistant.from(string): The role of the message sender (“User” or “Assistant”).value(string): The content of the message.