Generate Tool-Calling Data for SFT#

Use this guide when you need multi-turn chat JSONL where the assistant issues OpenAI-style tool_calls and a tool role returns structured results, suitable for supervised fine-tuning (SFT) with a tools definition array.

You will use the sample config customer_support_tools.yaml, which produces ecommerce-style support threads. Each output row includes a messages array (with tool turns) and a tools array, ready for packing and training.

Outcomes#

  • Understand how the shipped config asks one llm_text column to emit a full JSON multi-turn trace in a single model call.

  • Preview, generate, and validate records before training.

  • Know how to retarget seeds, prompts, and schema for your own domain.

How It Works#

Compared with single-turn configs such as default.yaml, this setup drives the whole conversation from one llm_text column. The prompt tells the model to return a JSON object with tools and messages keys. The structured_messages output projection parses that JSON object, extracts messages and tools, adds metadata, and serializes nested tool payload objects into OpenAI-compatible string fields.

# Multi-turn customer-support SFT data with OpenAI-style tool calls.
#
# Output records are training-ready JSONL:
#   {"messages": [...], "tools": [...], ...metadata}
#
# Each generated conversation includes one assistant tool call, one matching
# tool response, and a final assistant answer grounded in that tool result.

output_dir: ${oc.env:SDG_OUTPUT_DIR,${oc.env:NEMO_RUN_DIR,${oc.env:PWD}/output}/sdg}
output_path: ${output_dir}/customer_support_tool_sft.jsonl
num_records: 100

seed_dataset:
  path: ${oc.env:PWD}/src/nemotron/steps/sdg/data_designer/data/customer_support_tool_seeds.jsonl
  strategy: shuffle
  fields: [customer_name, issue, order_id, product, policy_hint]

# Optional custom endpoint example:
#
# To route this pipeline through an OpenAI-compatible endpoint instead of the
# built-in NVIDIA provider, uncomment and edit both blocks below.
# Keep providers[].api_key as the environment variable name. Data Designer
# resolves it at request time; using `${oc.env:OPENAI_API_KEY}` here would put
# the secret into the resolved config.
#
# providers:
#   - name: my-provider
#     endpoint: ${oc.env:OPENAI_BASE_URL}
#     provider_type: openai
#     api_key: OPENAI_API_KEY
#
# models:
#   - alias: nvidia-text
#     model: google/gemma-4-31B-it
#     provider: my-provider
#     skip_health_check: true
#     inference_parameters:
#       temperature: 0.75
#       top_p: 0.95
#       max_tokens: 1800

models:
  - alias: nvidia-text
    model: openai/gpt-oss-20b
    provider: nvidia
    skip_health_check: false
    inference_parameters:
      temperature: 0.75
      top_p: 0.95
      max_tokens: 1800

columns:
  - name: urgency
    type: category
    values: [calm, frustrated, rushed, confused]

  - name: channel
    type: category
    values: [web_chat, mobile_app, email_followup]

  - name: conversation
    # Keep this as text instead of llm_structured: Designer writes intermediate
    # batches to parquet before this step can project records, and complex
    # nested objects can produce mixed object schemas across rows.
    type: llm_text
    model_alias: nvidia-text
    prompt: |
      Generate one realistic multi-turn ecommerce customer-support chat for SFT.

      Seed facts:
      - customer_name: {{ customer_name }}
      - customer_issue: {{ issue }}
      - order_id: {{ order_id }}
      - product: {{ product }}
      - policy_hint: {{ policy_hint }}
      - customer_tone: {{ urgency }}
      - channel: {{ channel }}

      Requirements:
      - Return ONLY a JSON object. The first character MUST be `{`, the last MUST be `}`.
      - No prose, no preamble, no apology, no commentary, no markdown fences (no ```).
      - Top-level keys MUST be exactly "tools" and "messages".
      - Produce 6 to 10 messages in OpenAI chat format.
      - Include a brief system message that defines the support-agent behavior.
      - The user should speak naturally and provide imperfect information at first.
      - The assistant should ask at least one clarifying question before using a tool.
      - Include exactly one assistant message with tool_calls.
      - Include exactly one tool message that has the matching tool_call_id.
      - The assistant's final answer must use the tool result and the policy_hint.
      - Do not include markdown in message content.
      - In the generated JSON object, function arguments must be JSON objects, not escaped JSON strings.
      - In the generated JSON object, tool message content must be a JSON object, not an escaped JSON string.

      Available tool functions:
      - lookup_order(order_id: string)
      - check_refund_eligibility(order_id: string, reason: string)
      - update_shipping_address(order_id: string, new_address: string)
      - verify_warranty(order_id: string, product: string)
      - get_subscription_status(order_id: string)

      Example assistant tool-call message shape:
      {
        "role": "assistant",
        "content": "",
        "tool_calls": [
          {
            "id": "call_lookup_001",
            "type": "function",
            "function": {
              "name": "lookup_order",
              "arguments": {"order_id": "ORD-10492"}
            }
          }
        ]
      }

      Example tool message shape:
      {
        "role": "tool",
        "tool_call_id": "call_lookup_001",
        "name": "lookup_order",
        "content": {"status": "delayed", "eta": "tomorrow"}
      }
      Schema:
      {
        "tools": [
          {
            "type": "function",
            "function": {
              "name": "lookup_order",
              "description": "Look up an order by ID.",
              "parameters": {
                "type": "object",
                "properties": {"order_id": {"type": "string"}},
                "required": ["order_id"]
              }
            }
          }
        ],
        "messages": [
          {"role": "system", "content": "You are a helpful support agent."},
          {"role": "user", "content": "User message"},
          {
            "role": "assistant",
            "content": "",
            "tool_calls": [
              {
                "id": "call_lookup_001",
                "type": "function",
                "function": {
                  "name": "lookup_order",
                  "arguments": {"order_id": "ORD-10492"}
                }
              }
            ]
          },
          {
            "role": "tool",
            "tool_call_id": "call_lookup_001",
            "name": "lookup_order",
            "content": {"status": "delayed", "eta": "tomorrow"}
          },
          {"role": "assistant", "content": "Final grounded answer"}
        ]
      }

output_projection:
  type: structured_messages
  source_field: conversation
  messages_field: messages
  tools_field: tools
  metadata_fields: [customer_name, issue, order_id, product, urgency, channel]

Each seed row supplies five anchor fields the prompt interpolates: customer_name, issue, order_id, product, and policy_hint. Two extra category columns (urgency, channel) add variety without multiplying seed rows for every combination.

Prerequisites#

  • Nemotron CLI available and working; if this is your first SDG run, complete Generate Your First Synthetic Dataset.

  • NVIDIA_API_KEY set in the environment.

  • The bundled seed file data/customer_support_tool_seeds.jsonl (shipped with the step). Add rows, or point the config at your own JSONL.

Procedure#

  1. Preview two records so structured output matches the schema:

    $ nemotron steps run sdg/data_designer -c customer_support_tools preview=true num_records=2
    

    In the preview, confirm:

    • Exactly one assistant message with tool_calls.

    • Exactly one tool message whose tool_call_id matches the call.

    • function.arguments and tool-message content are JSON strings after projection.

    • The assistant’s closing turn references the tool result (not a generic reply).

    • No markdown in message content if your trainer expects plain text.

  2. Generate the dataset:

    $ nemotron steps run sdg/data_designer -c customer_support_tools num_records=200
    

    Output path: ./output/sdg/customer_support_tool_sft.jsonl. Spot-check a few lines. Each record exposes top-level messages and tools plus metadata, like the following example:

    {
      "messages": [
        {"role": "system", "content": "You are a helpful ecommerce support agent..."},
        {"role": "user", "content": "Hi, I haven't received my headphones yet..."},
        {"role": "assistant", "content": "I'd be happy to help. Could you share your order number?"},
        {"role": "user", "content": "It's ORD-10492."},
        {"role": "assistant", "content": "", "tool_calls": [{"id": "call_001", "type": "function", "function": {"name": "lookup_order", "arguments": "{\"order_id\":\"ORD-10492\"}"}}]},
        {"role": "tool", "tool_call_id": "call_001", "name": "lookup_order", "content": "{\"status\":\"delayed\",\"eta\":\"tomorrow\"}"},
        {"role": "assistant", "content": "Your order is delayed and should arrive tomorrow. Per our policy, I can arrange an expedited replacement if you prefer."}
      ],
      "tools": [{"type": "function", "function": {"name": "lookup_order", "description": "...", "parameters": {...}}}],
      "customer_name": "Priya", "issue": "late delivery", "urgency": "frustrated", "channel": "web_chat"
    }
    

Adapt to Your Domain#

  1. Replace or extend the seed file so rows cover your entities. You may rename the five anchor fields as long as the prompt and YAML refer to the same names.

  2. Update seed_dataset.fields in the YAML to match those names.

  3. Rewrite the prompt for your scenario and tool surface.

  4. Adjust the JSON schema described in the prompt if the message layout changes, for example multiple tool calls per conversation.

Keep output_projection as structured_messages so the step extracts messages and tools from the structured column and merges category metadata onto each record.

Validation Checklist#

Before training, sample at least 50 records and verify:

  • Every tool_calls block has a matching tool message with the same tool_call_id.

  • function.arguments and tool-message content values are JSON strings in the projected JSONL.

  • The assistant’s final reply uses the tool result (not a canned answer that ignores it).

  • No unexpected markdown in content if the trainer assumes plain text.

  • tools is present and non-empty on every record.

Downstream Use#

customer_support_tool_sft.jsonl  →  data_prep/sft_packing  →  SFT training

The structured_messages projection writes messages and tools at the top level, matching formats common to AutoModel-style SFT and Megatron-Bridge-style workflows. Run data_prep/sft_packing in dry-run mode before a large training job to confirm the packer accepts your file.

Next Steps#

  • Output projection reference: Output Projections to learn the structured_messages schema.

  • Config schema: Config Schema for column types and output projections.