Format Training Dataset#
Learn how to format a training dataset to work with the model type you want to train, such as a chat or completion model.
Chat Model — Requires data that adheres to the
messagesschema.Completion Model — Requires data that adheres to the
prompt-completionschema.
Customizer expects all datasets to use JSONL format, where each line in the dataset is a training example serialized in JSON.
Prerequisites#
Platform Prerequisites#
New to using NeMo microservices?
NeMo microservices use an entity management system to organize all resources—including datasets, models, and job artifacts—into namespaces and projects. Without setting up these organizational entities first, you cannot use the microservices.
If you’re new to the platform, complete these foundational tutorials first:
Get Started Tutorials: Learn how to deploy, customize, and evaluate models using the platform end-to-end
Set Up Organizational Entities: Learn how to create namespaces and projects to organize your work
If you’re already familiar with namespaces, projects, and how to upload datasets to the platform, you can proceed directly with this tutorial.
Learn more: Entity Concepts
NeMo Customizer Prerequisites#
Microservice Setup Requirements and Environment Variables
Before starting, make sure you have:
Access to NeMo Customizer
The
huggingface_hubPython package installed(Optional) Weights & Biases account and API key for enhanced visualization
Set up environment variables:
# Set up environment variables
export CUSTOMIZER_BASE_URL="<your-customizer-service-url>"
export ENTITY_HOST="<your-entity-store-url>"
export DS_HOST="<your-datastore-url>"
export NAMESPACE="default"
export DATASET_NAME="test-dataset"
# Hugging Face environment variables (for dataset/model file management)
export HF_ENDPOINT="${DS_HOST}/v1/hf"
export HF_TOKEN="dummy-unused-value" # Or your actual HF token
# Optional monitoring
export WANDB_API_KEY="<your-wandb-api-key>"
Replace the placeholder values with your actual service URLs and credentials.
Chat Models#
Chat and instruction models require additional structure in their data to capture concepts such as multi-turn conversations. We support the OpenAI messages format.
Format a Conversation Dataset#
Train a chat model to optimize for generating responses using multiple messages as context.
Basic Schema#
A conversational dataset contains a sequence of messages that represent interactions between users and assistants. Each message has:
A
rolefield to categorize the message text. Options includesystem,user, andassistant.A
contentfield for the actual body of information communicated by that role.
Tip
For best training results, the assistant role should be the last message in each training example.
Example entry formatted for JSONL dataset file:
{"messages": [{"role": "system","content": "<system message>"}, {"role": "user","content": "<user message>"}, {"role": "assistant","content": "<assistant message>"}]}
Expanded JSON example
For illustrative purposes only, we show an example entry as multi-line JSON.
{
"messages": [
{
"role": "system",
"content": "<system message>"
}, {
"role": "user",
"content": "<user message>"
}, {
"role": "assistant",
"content": "<assistant message>"
}
]
}
Reasoning Considerations#
Some models (such as Llama Nemotron) support a detailed thinking mode, which you can toggle in the system message. This setting controls whether the model is encouraged to show step-by-step reasoning in its responses.
Training data without reasoning: Use
detailed thinking offin the system message.Training data with reasoning: Use
detailed thinking onin the system message.
Note
If you have an existing system message that must be preserved, prepend detailed thinking on or detailed thinking off to the beginning of your system message.
For example, if your original system message is You are a helpful assistant., you should use detailed thinking on\nYou are a helpful assistant. or detailed thinking off\nYou are a helpful assistant.
{"messages": [
{"role": "system", "content": "detailed thinking off"},
{"role": "user", "content": "What is 2 + 2?"},
{"role": "assistant", "content": "4"}
]}
{"messages": [
{"role": "system", "content": "detailed thinking on"},
{"role": "user", "content": "What is 2 + 2?"},
{"role": "assistant", "content": "<think>To solve 2 + 2, add 2 and 2 together.</think> The answer is 4."}
]}
You can adjust the system message for each training example to match the style of your data and the behavior you want the model to learn.
Schema with Tool Calling#
Tool calling (also known as function calling) allows the model to directly interact with external systems based on user inputs. By integrating the model with external applications and APIs, you can significantly expand its capabilities to include:
Data retrieval from databases or services
Execution of specific actions in connected systems
Performance of computation tasks
Implementation of complex business logic
Inline Tools#
To train a model with tool calling capabilities, use the conversational dataset format with additional fields. Beyond the standard messages structure, you’ll need to define the list of tools the model can access along with their function parameters. Within the messages, include tool_calls when you want the model to invoke a tool with specific arguments.
Every sample must be a single line as in this example below.
{"messages": [{"role": "user","content": ""},{"role": "assistant","content": "","tool_calls": [{"type": "function","function": {"name": "fibonacci","arguments": {"n": 20}}}]}],"tools": [{"type": "function","function": {"name": "fibonacci","description": "Calculates the nth Fibonacci number.","parameters": {"type": "object","properties": {"n": {"description": "The position of the Fibonacci number.","type": "integer"}}}}}]}
Expanded JSON example
For illustrative purposes only, we show an example entry as multi-line JSON.
{
"messages": [
{
"role": "user",
"content": ""
},
{
"role": "assistant",
"content": "",
"tool_calls": [{
"type": "function",
"function": {
"name": "fibonacci",
"arguments": {"n": 20}
}
}]
}
],
"tools": [{
"type": "function",
"function": {
"name": "fibonacci",
"description": "Calculates the nth Fibonacci number.",
"parameters": {
"type": "object",
"properties": {
"n": {
"description": "The position of the Fibonacci number.",
"type": "integer"
}
}
}
}
}]
}
Find Chat Models#
You can perform a GET request to Entity Store’s model registry to discover chat models, which have spec.is_chat set to true.
from nemo_microservices import NeMoMicroservices
import os
# Initialize the client
client = NeMoMicroservices(
base_url=os.environ['CUSTOMIZER_BASE_URL']
)
# Get a specific model
model = client.models.retrieve(
name="model-name@version",
namespace="my-namespace"
)
print(f"Model: {model.name}")
print(f"Is chat model: {model.spec.is_chat}")
print(f"Parameters: {model.spec.num_parameters}")
print(f"Context size: {model.spec.context_size}")
# Display model details
if model.spec.is_chat:
print("This is a chat model - use /chat/completions")
else:
print("This is a completion model - use /completions")
curl http://nemo.test/v1/models/my-namespace/model-name@version | jq
Example Response
{
"name": "model-name@version",
"namespace": "my-namespace",
"spec": {
"num_parameters": 8000000000,
"context_size": 4096,
"num_virtual_tokens": 0,
"is_chat": true
},
"artifact": {
"gpu_arch": "Ampere",
"precision": "bf16",
"tensor_parallelism": 1,
"backend_engine": "nemo",
"status": "upload_completed",
"files_url": "hf://my-namespace/model-name@version"
},
"base_model": "meta/llama-3.1-8b-instruct",
"peft": {
"finetuning_type": "lora"
}
}
Chat with the Model#
To run inference with a chat model, use the /chat/completions endpoint for the model.
from nemo_microservices import NeMoMicroservices
import os
# Initialize the client
client = NeMoMicroservices(
base_url=os.environ['CUSTOMIZER_BASE_URL']
)
# Create a chat completion
response = client.chat.completions.create(
model="my-namespace/model-name@version",
messages=[
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello! How are you?"
}
],
temperature=0.5,
top_p=1,
max_tokens=1024
)
print(f"Response: {response.choices[0].message.content}")
curl -X POST http://nim.test/v1/chat/completions \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "my-namespace/model-name@version",
"messages": [
{
"role":"system",
"content":""
},
{
"role":"user",
"content":""
}
],
"temperature": 0.5,
"top_p": 1,
"max_tokens": 1024
}'
Next Steps#
Now that you know how to format your training datasets, you can proceed with creating customization jobs:
Start a LoRA Model Customization Job - For parameter-efficient fine-tuning
Start a Full SFT Customization Job - For full model fine-tuning
Start a DPO Customization Job - For preference-based alignment
Start a Knowledge Distillation Job - For model compression
Completion Models#
Train a model using the completion dataset format for tasks like text summarization, information extraction, question answering, text classification, reasoning, or story writing.
Format a Prompt-Completion Dataset#
Prompt completion datasets have a simple schema. Each datum has:
A
promptfield for the body of information provided by the user.A
completionfield for output of the model.
{"prompt": "Hello", "completion": " world."}
Prompt the Model#
To run inference with a completion model, use the /completions endpoint for the model.
from nemo_microservices import NeMoMicroservices
import os
# Initialize the client
client = NeMoMicroservices(
base_url=os.environ['CUSTOMIZER_BASE_URL']
)
# Create a completion
response = client.completions.create(
model="my-namespace/model-name@version",
prompt="Once upon a time",
temperature=0.5,
top_p=1,
max_tokens=1024
)
print(f"Response: {response.choices[0].text}")
curl -X POST http://nim.test/v1/completions \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "my-namespace/model-name@version",
"prompt": "",
"temperature": 0.5,
"top_p": 1,
"max_tokens": 1024
}'