Create a Domain Dataset for Airlines Customer Service#

What You’ll Build: A domain-adapted SFT chat dataset modeled on fictional airlines customer-service conversations.

In this how-to guide, you will:

  1. Create an airline-domain pipeline config.

  2. Create a seed file of airline inquiry scenarios.

  3. Swap the category columns for three airline-relevant dimensions.

  4. Rewrite the LLM prompts for the airline domain.

  5. Update the output projection and output path.

  6. Run a preview to verify, then generate 100 records.

This guide requires between 20 and 30 minutes to complete.

Sample Prompt

Adapt the default SDG pipeline for Greenteme Airlines customer service with three category dimensions, run a 2-record preview, then generate 100 records and show me one output record.

Prerequisites#

  • ✅ Completed Generate Your First Synthetic Dataset — at least one successful preview and full run of default.yaml so you know the pipeline works end-to-end.

  • NVIDIA_API_KEY set in your environment.

How This Differs From the Default Pipeline#

The default pipeline mixes a single category dimension, persona, with seed topics. This example adds category dimensions, traveler_segment, inquiry_type, and channel, on top of seed scenarios so that diversity comes from explicit, controllable values.

Procedure#

  1. Create a src/nemotron/steps/sdg/data_designer/config/greenteme.yaml (download) file like the following example:

    output_dir: ${oc.env:SDG_OUTPUT_DIR,${oc.env:NEMO_RUN_DIR,${oc.env:PWD}/output}/sdg}
    output_path: ${output_dir}/greenteme_sft.jsonl
    num_records: 100
    
    seed_dataset:
      path: ${oc.env:PWD}/src/nemotron/steps/sdg/data_designer/data/greenteme_inquiry_seeds.jsonl
      strategy: shuffle
      fields: [scenario]
    
    models:
      - alias: nvidia-text
        model: nvidia/nemotron-3-nano-30b-a3b
        provider: nvidia
        skip_health_check: true
        inference_parameters:
          temperature: 0.8
          top_p: 1.0
          max_tokens: 1200
    
    columns:
      - name: traveler_segment
        type: category
        values:
          - frequent_flyer
          - business_traveler
          - family_with_children
          - first_time_international
          - elite_loyalty_member
          - leisure_couple
    
      - name: inquiry_type
        type: category
        values:
          - rebooking
          - baggage_issue
          - refund_request
          - loyalty_status
          - fare_rules
          - flight_status
    
      - name: channel
        type: category
        values: [chat, phone, app]
    
      - name: user_query
        type: llm_text
        model_alias: nvidia-text
        prompt: |
          You are role-playing a {{ traveler_segment }} contacting Greenteme Airlines
          via {{ channel }} about a {{ inquiry_type }}. The scenario is:
          "{{ scenario }}"
    
          Write the customer's first message. Keep it natural, 1-3 sentences.
          Do not reference any real airline name, real flight number, or real
          loyalty program.
    
      - name: assistant_response
        type: llm_text
        model_alias: nvidia-text
        prompt: |
          You are a customer-service agent at Greenteme Airlines, a fictional airline.
          Reply to this customer message:
    
          "{{ user_query }}"
    
          Provide a concise, professional, compliant response, 2-4 sentences. Stay
          realistic and grounded in standard airline policy. Do not invent real
          airline names, real flight numbers, real PNR codes, or real loyalty
          program details. No markdown.
    
    output_projection:
      type: openai_messages
      user_field: user_query
      assistant_field: assistant_response
      metadata_fields: [traveler_segment, inquiry_type, channel, scenario]
    

    The key differences from the default pipeline:

    • The variation for traveler segment, inquiry type, and channel are all provided by category-type columns.

    • The variation for the scenarios is provided by the seed JSONL file from the next step.

    • The system-style instruction lives at the top of each prompt rather than as a separate field. The LLM text columns take a single prompt that includes the role for the LLM to assume.

    • The output_projection field includes the new metadata fields.

  2. Create a seed file, src/nemotron/steps/sdg/data_designer/data/greenteme_inquiry_seeds.jsonl, (download) like the following example:

    {"scenario": "Connecting flight cancelled due to weather; customer needs to arrive at destination by tomorrow morning for a wedding."}
    {"scenario": "Checked baggage missing on arrival; flight landed two hours ago and the bag did not appear at the carousel."}
    {"scenario": "Customer wants a refund on a non-refundable ticket due to a documented medical emergency."}
    {"scenario": "Customer is unsure why their loyalty status was downgraded this year and wants to understand the qualifying criteria."}
    {"scenario": "Customer wants to change a fare class on an existing booking and needs to know the fare difference and any change fees."}
    {"scenario": "Flight is showing a four-hour delay and the customer wants to know whether they will make their connection."}
    {"scenario": "Customer was double-charged for a seat upgrade and wants the duplicate charge reversed."}
    {"scenario": "Customer needs to add a service animal to an upcoming international flight and wants to know what documentation is required."}
    {"scenario": "Bag damaged in transit; customer needs to file a claim and wants the timeline and required documentation."}
    {"scenario": "Customer rebooked through self-service and is now seated apart from a travel companion; they want to be reseated together."}
    {"scenario": "Customer wants to use a travel credit from a previous cancellation but cannot find the credit number in their account."}
    {"scenario": "Customer's payment method was declined when trying to complete a booking and they want to know what to do."}
    
  3. Run a preview by specifying preview=true num_records=2 to verify the pipeline before scaling:

    $ nemotron steps run sdg/data_designer -c greenteme preview=true num_records=2
    
    Example Output
    {
      "messages": [
        {
          "role": "user",
          "content": "Hi! I'm traveling internationally for the first time soon and need to add a service animal to my booking. Could you let me know what documentation I need to provide?"
        },
        {
          "role": "assistant",
          "content": "Thank you for contacting Greenteme Airlines. Please provide a valid government-issued health certificate and a signed service animal relief form for your destination country. You can upload these documents through the Manage Booking section of our website at least 48 hours before departure."
        }
      ],
      "traveler_segment": "first_time_international",
      "inquiry_type": "rebooking",
      "channel": "app",
      "scenario": "Customer needs to add a service animal to an upcoming international flight and wants to know what documentation is required."
    }
    {
      "messages": [
        {
          "role": "user",
          "content": "\"Hi, I just rebooked my flight online and realized my companion and I are now seated in different rows. Is there any way you can help us get seats together?\""
        },
        {
          "role": "assistant",
          "content": "Hello, thank you for reaching out to Greenteme Airlines. Please provide your booking reference, and we will check for available adjacent seats to move you and your companion together. Note that depending on your fare class, a seat selection fee may apply."
        }
      ],
      "traveler_segment": "first_time_international",
      "inquiry_type": "refund_request",
      "channel": "phone",
      "scenario": "Customer rebooked through self-service and is now seated apart from a travel companion; they want to be reseated together."
    }
    
  4. Generate the dataset by raising num_records after the preview output looks correct:

    $ nemotron steps run sdg/data_designer -c greenteme num_records=100
    

Going Further#

Locale-aware persona profiles. The current YAML schema supports category, seed, and LLM column types. To replace the static traveler_segment category with Census-grounded persona profiles using Data Designer’s person sampler, you can include locale, age range, and synthetic-personas integration.

Multi-turn conversations. The example shows a single user and assistant exchange. For multi-turn dialogue, follow the customer_support_tools.yaml pattern: ask one llm_text column to return a JSON object with messages and optional tools, then use the structured_messages output projection to write training-ready JSONL.

Dispatch to a cluster. Generation runs locally against the NVIDIA-hosted endpoint by default. To run on Lepton or Slurm, see Dispatch SDG to a Cluster — env.toml profiles, container images, and the gotchas that bite first-time cluster runs.

Schema and Downstream Use#

The openai_messages projection emits records with a messages array plus the metadata fields you list. These flow directly into:

  • data_prep/sft_packing for Megatron-Bridge-style training, or

  • AutoModel SFT, which consumes the chat format directly.

For a full reference of available projection shapes, see Output Projections.

Next Steps#