Generate Tool-Calling Data for SFT#
Use this guide when you need multi-turn chat JSONL where the assistant issues OpenAI-style tool_calls and a tool role returns structured results, suitable for supervised fine-tuning (SFT) with a tools definition array.
You will use the sample config customer_support_tools.yaml, which produces ecommerce-style support threads. Each output row includes a messages array (with tool turns) and a tools array, ready for packing and training.
Outcomes#
Understand how the shipped config asks one
llm_textcolumn to emit a full JSON multi-turn trace in a single model call.Preview, generate, and validate records before training.
Know how to retarget seeds, prompts, and schema for your own domain.
How It Works#
Compared with single-turn configs such as default.yaml, this setup drives the whole conversation from one llm_text column.
The prompt tells the model to return a JSON object with tools and messages keys.
The structured_messages output projection parses that JSON object, extracts messages and tools, adds metadata, and serializes nested tool payload objects into OpenAI-compatible string fields.
# Multi-turn customer-support SFT data with OpenAI-style tool calls.
#
# Output records are training-ready JSONL:
# {"messages": [...], "tools": [...], ...metadata}
#
# Each generated conversation includes one assistant tool call, one matching
# tool response, and a final assistant answer grounded in that tool result.
output_dir: ${oc.env:SDG_OUTPUT_DIR,${oc.env:NEMO_RUN_DIR,${oc.env:PWD}/output}/sdg}
output_path: ${output_dir}/customer_support_tool_sft.jsonl
num_records: 100
seed_dataset:
path: ${oc.env:PWD}/src/nemotron/steps/sdg/data_designer/data/customer_support_tool_seeds.jsonl
strategy: shuffle
fields: [customer_name, issue, order_id, product, policy_hint]
# Optional custom endpoint example:
#
# To route this pipeline through an OpenAI-compatible endpoint instead of the
# built-in NVIDIA provider, uncomment and edit both blocks below.
# Keep providers[].api_key as the environment variable name. Data Designer
# resolves it at request time; using `${oc.env:OPENAI_API_KEY}` here would put
# the secret into the resolved config.
#
# providers:
# - name: my-provider
# endpoint: ${oc.env:OPENAI_BASE_URL}
# provider_type: openai
# api_key: OPENAI_API_KEY
#
# models:
# - alias: nvidia-text
# model: google/gemma-4-31B-it
# provider: my-provider
# skip_health_check: true
# inference_parameters:
# temperature: 0.75
# top_p: 0.95
# max_tokens: 1800
models:
- alias: nvidia-text
model: openai/gpt-oss-20b
provider: nvidia
skip_health_check: false
inference_parameters:
temperature: 0.75
top_p: 0.95
max_tokens: 1800
columns:
- name: urgency
type: category
values: [calm, frustrated, rushed, confused]
- name: channel
type: category
values: [web_chat, mobile_app, email_followup]
- name: conversation
# Keep this as text instead of llm_structured: Designer writes intermediate
# batches to parquet before this step can project records, and complex
# nested objects can produce mixed object schemas across rows.
type: llm_text
model_alias: nvidia-text
prompt: |
Generate one realistic multi-turn ecommerce customer-support chat for SFT.
Seed facts:
- customer_name: {{ customer_name }}
- customer_issue: {{ issue }}
- order_id: {{ order_id }}
- product: {{ product }}
- policy_hint: {{ policy_hint }}
- customer_tone: {{ urgency }}
- channel: {{ channel }}
Requirements:
- Return ONLY a JSON object. The first character MUST be `{`, the last MUST be `}`.
- No prose, no preamble, no apology, no commentary, no markdown fences (no ```).
- Top-level keys MUST be exactly "tools" and "messages".
- Produce 6 to 10 messages in OpenAI chat format.
- Include a brief system message that defines the support-agent behavior.
- The user should speak naturally and provide imperfect information at first.
- The assistant should ask at least one clarifying question before using a tool.
- Include exactly one assistant message with tool_calls.
- Include exactly one tool message that has the matching tool_call_id.
- The assistant's final answer must use the tool result and the policy_hint.
- Do not include markdown in message content.
- In the generated JSON object, function arguments must be JSON objects, not escaped JSON strings.
- In the generated JSON object, tool message content must be a JSON object, not an escaped JSON string.
Available tool functions:
- lookup_order(order_id: string)
- check_refund_eligibility(order_id: string, reason: string)
- update_shipping_address(order_id: string, new_address: string)
- verify_warranty(order_id: string, product: string)
- get_subscription_status(order_id: string)
Example assistant tool-call message shape:
{
"role": "assistant",
"content": "",
"tool_calls": [
{
"id": "call_lookup_001",
"type": "function",
"function": {
"name": "lookup_order",
"arguments": {"order_id": "ORD-10492"}
}
}
]
}
Example tool message shape:
{
"role": "tool",
"tool_call_id": "call_lookup_001",
"name": "lookup_order",
"content": {"status": "delayed", "eta": "tomorrow"}
}
Schema:
{
"tools": [
{
"type": "function",
"function": {
"name": "lookup_order",
"description": "Look up an order by ID.",
"parameters": {
"type": "object",
"properties": {"order_id": {"type": "string"}},
"required": ["order_id"]
}
}
}
],
"messages": [
{"role": "system", "content": "You are a helpful support agent."},
{"role": "user", "content": "User message"},
{
"role": "assistant",
"content": "",
"tool_calls": [
{
"id": "call_lookup_001",
"type": "function",
"function": {
"name": "lookup_order",
"arguments": {"order_id": "ORD-10492"}
}
}
]
},
{
"role": "tool",
"tool_call_id": "call_lookup_001",
"name": "lookup_order",
"content": {"status": "delayed", "eta": "tomorrow"}
},
{"role": "assistant", "content": "Final grounded answer"}
]
}
output_projection:
type: structured_messages
source_field: conversation
messages_field: messages
tools_field: tools
metadata_fields: [customer_name, issue, order_id, product, urgency, channel]
Each seed row supplies five anchor fields the prompt interpolates: customer_name, issue, order_id, product, and policy_hint. Two extra category columns (urgency, channel) add variety without multiplying seed rows for every combination.
Prerequisites#
Nemotron CLI available and working; if this is your first SDG run, complete Generate Your First Synthetic Dataset.
NVIDIA_API_KEYset in the environment.The bundled seed file
data/customer_support_tool_seeds.jsonl(shipped with the step). Add rows, or point the config at your own JSONL.
Procedure#
Preview two records so structured output matches the schema:
$ nemotron steps run sdg/data_designer -c customer_support_tools preview=true num_records=2
In the preview, confirm:
Exactly one assistant message with
tool_calls.Exactly one
toolmessage whosetool_call_idmatches the call.function.argumentsand tool-messagecontentare JSON strings after projection.The assistant’s closing turn references the tool result (not a generic reply).
No markdown in message
contentif your trainer expects plain text.
Generate the dataset:
$ nemotron steps run sdg/data_designer -c customer_support_tools num_records=200
Output path:
./output/sdg/customer_support_tool_sft.jsonl. Spot-check a few lines. Each record exposes top-levelmessagesandtoolsplus metadata, like the following example:{ "messages": [ {"role": "system", "content": "You are a helpful ecommerce support agent..."}, {"role": "user", "content": "Hi, I haven't received my headphones yet..."}, {"role": "assistant", "content": "I'd be happy to help. Could you share your order number?"}, {"role": "user", "content": "It's ORD-10492."}, {"role": "assistant", "content": "", "tool_calls": [{"id": "call_001", "type": "function", "function": {"name": "lookup_order", "arguments": "{\"order_id\":\"ORD-10492\"}"}}]}, {"role": "tool", "tool_call_id": "call_001", "name": "lookup_order", "content": "{\"status\":\"delayed\",\"eta\":\"tomorrow\"}"}, {"role": "assistant", "content": "Your order is delayed and should arrive tomorrow. Per our policy, I can arrange an expedited replacement if you prefer."} ], "tools": [{"type": "function", "function": {"name": "lookup_order", "description": "...", "parameters": {...}}}], "customer_name": "Priya", "issue": "late delivery", "urgency": "frustrated", "channel": "web_chat" }
Adapt to Your Domain#
Replace or extend the seed file so rows cover your entities. You may rename the five anchor fields as long as the prompt and YAML refer to the same names.
Update
seed_dataset.fieldsin the YAML to match those names.Rewrite the
promptfor your scenario and tool surface.Adjust the JSON schema described in the prompt if the message layout changes, for example multiple tool calls per conversation.
Keep output_projection as structured_messages so the step extracts messages and tools from the structured column and merges category metadata onto each record.
Validation Checklist#
Before training, sample at least 50 records and verify:
Every
tool_callsblock has a matchingtoolmessage with the sametool_call_id.function.argumentsand tool-messagecontentvalues are JSON strings in the projected JSONL.The assistant’s final reply uses the tool result (not a canned answer that ignores it).
No unexpected markdown in
contentif the trainer assumes plain text.toolsis present and non-empty on every record.
Downstream Use#
customer_support_tool_sft.jsonl → data_prep/sft_packing → SFT training
The structured_messages projection writes messages and tools at the top level, matching formats common to AutoModel-style SFT and Megatron-Bridge-style workflows. Run data_prep/sft_packing in dry-run mode before a large training job to confirm the packer accepts your file.
Next Steps#
Output projection reference: Output Projections to learn the
structured_messagesschema.Config schema: Config Schema for column types and output projections.