Create a Domain Dataset for Airlines Customer Service#

What You’ll Build: A domain-adapted SFT chat dataset modeled on fictional airlines customer-service conversations.

In this how-to guide, you will:

Create an airline-domain pipeline config.
Create a seed file of airline inquiry scenarios.
Swap the category columns for three airline-relevant dimensions.
Rewrite the LLM prompts for the airline domain.
Update the output projection and output path.
Run a preview to verify, then generate 100 records.

This guide requires between 20 and 30 minutes to complete.

Sample Prompt

Adapt the default SDG pipeline for Greenteme Airlines customer service with three category dimensions, run a 2-record preview, then generate 100 records and show me one output record.

Prerequisites#

✅ Completed Generate Your First Synthetic Dataset — at least one successful preview and full run of default.yaml so you know the pipeline works end-to-end.
✅ NVIDIA_API_KEY set in your environment.

How This Differs From the Default Pipeline#

The default pipeline mixes a single category dimension, persona, with seed topics. This example adds category dimensions, traveler_segment, inquiry_type, and channel, on top of seed scenarios so that diversity comes from explicit, controllable values.

Procedure#

Create a src/nemotron/steps/sdg/data_designer/config/greenteme.yaml (download) file like the following example:

output_dir: ${oc.env:SDG_OUTPUT_DIR,${oc.env:NEMO_RUN_DIR,${oc.env:PWD}/output}/sdg}
output_path: ${output_dir}/greenteme_sft.jsonl
num_records: 100

seed_dataset:
  path: ${oc.env:PWD}/src/nemotron/steps/sdg/data_designer/data/greenteme_inquiry_seeds.jsonl
  strategy: shuffle
  fields: [scenario]

models:
  - alias: nvidia-text
    model: nvidia/nemotron-3-nano-30b-a3b
    provider: nvidia
    skip_health_check: true
    inference_parameters:
      temperature: 0.8
      top_p: 1.0
      max_tokens: 1200

columns:
  - name: traveler_segment
    type: category
    values:
      - frequent_flyer
      - business_traveler
      - family_with_children
      - first_time_international
      - elite_loyalty_member
      - leisure_couple

  - name: inquiry_type
    type: category
    values:
      - rebooking
      - baggage_issue
      - refund_request
      - loyalty_status
      - fare_rules
      - flight_status

  - name: channel
    type: category
    values: [chat, phone, app]

  - name: user_query
    type: llm_text
    model_alias: nvidia-text
    prompt: |
      You are role-playing a {{ traveler_segment }} contacting Greenteme Airlines
      via {{ channel }} about a {{ inquiry_type }}. The scenario is:
      "{{ scenario }}"

      Write the customer's first message. Keep it natural, 1-3 sentences.
      Do not reference any real airline name, real flight number, or real
      loyalty program.

  - name: assistant_response
    type: llm_text
    model_alias: nvidia-text
    prompt: |
      You are a customer-service agent at Greenteme Airlines, a fictional airline.
      Reply to this customer message:

      "{{ user_query }}"

      Provide a concise, professional, compliant response, 2-4 sentences. Stay
      realistic and grounded in standard airline policy. Do not invent real
      airline names, real flight numbers, real PNR codes, or real loyalty
      program details. No markdown.

output_projection:
  type: openai_messages
  user_field: user_query
  assistant_field: assistant_response
  metadata_fields: [traveler_segment, inquiry_type, channel, scenario]

The key differences from the default pipeline:

The variation for traveler segment, inquiry type, and channel are all provided by category-type columns.
The variation for the scenarios is provided by the seed JSONL file from the next step.
The system-style instruction lives at the top of each prompt rather than as a separate field. The LLM text columns take a single prompt that includes the role for the LLM to assume.
The output_projection field includes the new metadata fields.

Create a seed file, src/nemotron/steps/sdg/data_designer/data/greenteme_inquiry_seeds.jsonl, (download) like the following example:

{"scenario": "Connecting flight cancelled due to weather; customer needs to arrive at destination by tomorrow morning for a wedding."}
{"scenario": "Checked baggage missing on arrival; flight landed two hours ago and the bag did not appear at the carousel."}
{"scenario": "Customer wants a refund on a non-refundable ticket due to a documented medical emergency."}
{"scenario": "Customer is unsure why their loyalty status was downgraded this year and wants to understand the qualifying criteria."}
{"scenario": "Customer wants to change a fare class on an existing booking and needs to know the fare difference and any change fees."}
{"scenario": "Flight is showing a four-hour delay and the customer wants to know whether they will make their connection."}
{"scenario": "Customer was double-charged for a seat upgrade and wants the duplicate charge reversed."}
{"scenario": "Customer needs to add a service animal to an upcoming international flight and wants to know what documentation is required."}
{"scenario": "Bag damaged in transit; customer needs to file a claim and wants the timeline and required documentation."}
{"scenario": "Customer rebooked through self-service and is now seated apart from a travel companion; they want to be reseated together."}
{"scenario": "Customer wants to use a travel credit from a previous cancellation but cannot find the credit number in their account."}
{"scenario": "Customer's payment method was declined when trying to complete a booking and they want to know what to do."}

Run a preview by specifying preview=true num_records=2 to verify the pipeline before scaling:

$ nemotron steps run sdg/data_designer -c greenteme preview=true num_records=2

Generate the dataset by raising num_records after the preview output looks correct:
```
$ nemotron steps run sdg/data_designer -c greenteme num_records=100
```

Going Further#

Locale-aware persona profiles. The current YAML schema supports category, seed, and LLM column types. To replace the static traveler_segment category with Census-grounded persona profiles using Data Designer’s person sampler, you can include locale, age range, and synthetic-personas integration.

Multi-turn conversations. The example shows a single user and assistant exchange. For multi-turn dialogue, follow the customer_support_tools.yaml pattern: ask one llm_text column to return a JSON object with messages and optional tools, then use the structured_messages output projection to write training-ready JSONL.

Dispatch to a cluster. Generation runs locally against the NVIDIA-hosted endpoint by default. To run on Lepton or Slurm, see Dispatch SDG to a Cluster — env.toml profiles, container images, and the gotchas that bite first-time cluster runs.

Schema and Downstream Use#

The openai_messages projection emits records with a messages array plus the metadata fields you list. These flow directly into:

data_prep/sft_packing for Megatron-Bridge-style training, or
AutoModel SFT, which consumes the chat format directly.

For a full reference of available projection shapes, see Output Projections.

Next Steps#

Generate preference pairs for DPO: Generate Preference Data for DPO — the rl_pref.yaml pattern.
Generate tool-calling SFT data: Generate Tool-Calling Data for SFT — multi-turn messages and tools with structured_messages.
CLI flags and overrides: CLI Reference.
Config schema: Config Schema — full reference for column types, samplers, and projections.
Pipeline overview: About Synthetic Data Generation.