Create a Domain Dataset for Airlines Customer Service#
What You’ll Build: A domain-adapted SFT chat dataset modeled on fictional airlines customer-service conversations.
In this how-to guide, you will:
Create an airline-domain pipeline config.
Create a seed file of airline inquiry scenarios.
Swap the category columns for three airline-relevant dimensions.
Rewrite the LLM prompts for the airline domain.
Update the output projection and output path.
Run a preview to verify, then generate 100 records.
This guide requires between 20 and 30 minutes to complete.
Sample Prompt
Adapt the default SDG pipeline for Greenteme Airlines customer service with three category dimensions, run a 2-record preview, then generate 100 records and show me one output record.
Prerequisites#
✅ Completed Generate Your First Synthetic Dataset — at least one successful preview and full run of
default.yamlso you know the pipeline works end-to-end.✅
NVIDIA_API_KEYset in your environment.
How This Differs From the Default Pipeline#
The default pipeline mixes a single category dimension, persona, with seed topics.
This example adds category dimensions, traveler_segment, inquiry_type, and channel, on top of seed scenarios so that diversity comes from explicit, controllable values.
Procedure#
Create a
src/nemotron/steps/sdg/data_designer/config/greenteme.yaml(download) file like the following example:output_dir: ${oc.env:SDG_OUTPUT_DIR,${oc.env:NEMO_RUN_DIR,${oc.env:PWD}/output}/sdg} output_path: ${output_dir}/greenteme_sft.jsonl num_records: 100 seed_dataset: path: ${oc.env:PWD}/src/nemotron/steps/sdg/data_designer/data/greenteme_inquiry_seeds.jsonl strategy: shuffle fields: [scenario] models: - alias: nvidia-text model: nvidia/nemotron-3-nano-30b-a3b provider: nvidia skip_health_check: true inference_parameters: temperature: 0.8 top_p: 1.0 max_tokens: 1200 columns: - name: traveler_segment type: category values: - frequent_flyer - business_traveler - family_with_children - first_time_international - elite_loyalty_member - leisure_couple - name: inquiry_type type: category values: - rebooking - baggage_issue - refund_request - loyalty_status - fare_rules - flight_status - name: channel type: category values: [chat, phone, app] - name: user_query type: llm_text model_alias: nvidia-text prompt: | You are role-playing a {{ traveler_segment }} contacting Greenteme Airlines via {{ channel }} about a {{ inquiry_type }}. The scenario is: "{{ scenario }}" Write the customer's first message. Keep it natural, 1-3 sentences. Do not reference any real airline name, real flight number, or real loyalty program. - name: assistant_response type: llm_text model_alias: nvidia-text prompt: | You are a customer-service agent at Greenteme Airlines, a fictional airline. Reply to this customer message: "{{ user_query }}" Provide a concise, professional, compliant response, 2-4 sentences. Stay realistic and grounded in standard airline policy. Do not invent real airline names, real flight numbers, real PNR codes, or real loyalty program details. No markdown. output_projection: type: openai_messages user_field: user_query assistant_field: assistant_response metadata_fields: [traveler_segment, inquiry_type, channel, scenario]
The key differences from the default pipeline:
The variation for traveler segment, inquiry type, and channel are all provided by category-type columns.
The variation for the scenarios is provided by the seed JSONL file from the next step.
The system-style instruction lives at the top of each prompt rather than as a separate field. The LLM text columns take a single prompt that includes the role for the LLM to assume.
The
output_projectionfield includes the new metadata fields.
Create a seed file,
src/nemotron/steps/sdg/data_designer/data/greenteme_inquiry_seeds.jsonl, (download) like the following example:{"scenario": "Connecting flight cancelled due to weather; customer needs to arrive at destination by tomorrow morning for a wedding."} {"scenario": "Checked baggage missing on arrival; flight landed two hours ago and the bag did not appear at the carousel."} {"scenario": "Customer wants a refund on a non-refundable ticket due to a documented medical emergency."} {"scenario": "Customer is unsure why their loyalty status was downgraded this year and wants to understand the qualifying criteria."} {"scenario": "Customer wants to change a fare class on an existing booking and needs to know the fare difference and any change fees."} {"scenario": "Flight is showing a four-hour delay and the customer wants to know whether they will make their connection."} {"scenario": "Customer was double-charged for a seat upgrade and wants the duplicate charge reversed."} {"scenario": "Customer needs to add a service animal to an upcoming international flight and wants to know what documentation is required."} {"scenario": "Bag damaged in transit; customer needs to file a claim and wants the timeline and required documentation."} {"scenario": "Customer rebooked through self-service and is now seated apart from a travel companion; they want to be reseated together."} {"scenario": "Customer wants to use a travel credit from a previous cancellation but cannot find the credit number in their account."} {"scenario": "Customer's payment method was declined when trying to complete a booking and they want to know what to do."}
Run a preview by specifying
preview=true num_records=2to verify the pipeline before scaling:$ nemotron steps run sdg/data_designer -c greenteme preview=true num_records=2
Example Output
{ "messages": [ { "role": "user", "content": "Hi! I'm traveling internationally for the first time soon and need to add a service animal to my booking. Could you let me know what documentation I need to provide?" }, { "role": "assistant", "content": "Thank you for contacting Greenteme Airlines. Please provide a valid government-issued health certificate and a signed service animal relief form for your destination country. You can upload these documents through the Manage Booking section of our website at least 48 hours before departure." } ], "traveler_segment": "first_time_international", "inquiry_type": "rebooking", "channel": "app", "scenario": "Customer needs to add a service animal to an upcoming international flight and wants to know what documentation is required." } { "messages": [ { "role": "user", "content": "\"Hi, I just rebooked my flight online and realized my companion and I are now seated in different rows. Is there any way you can help us get seats together?\"" }, { "role": "assistant", "content": "Hello, thank you for reaching out to Greenteme Airlines. Please provide your booking reference, and we will check for available adjacent seats to move you and your companion together. Note that depending on your fare class, a seat selection fee may apply." } ], "traveler_segment": "first_time_international", "inquiry_type": "refund_request", "channel": "phone", "scenario": "Customer rebooked through self-service and is now seated apart from a travel companion; they want to be reseated together." }
Generate the dataset by raising
num_recordsafter the preview output looks correct:$ nemotron steps run sdg/data_designer -c greenteme num_records=100
Going Further#
Locale-aware persona profiles. The current YAML schema supports category, seed, and LLM column types. To replace the static traveler_segment category with Census-grounded persona profiles using Data Designer’s person sampler, you can include locale, age range, and synthetic-personas integration.
Multi-turn conversations. The example shows a single user and assistant exchange.
For multi-turn dialogue, follow the customer_support_tools.yaml pattern: ask one llm_text column to return a JSON object with messages and optional tools, then use the structured_messages output projection to write training-ready JSONL.
Dispatch to a cluster. Generation runs locally against the NVIDIA-hosted endpoint by default. To run on Lepton or Slurm, see Dispatch SDG to a Cluster — env.toml profiles, container images, and the gotchas that bite first-time cluster runs.
Schema and Downstream Use#
The openai_messages projection emits records with a messages array plus the metadata fields you list. These flow directly into:
data_prep/sft_packingfor Megatron-Bridge-style training, orAutoModel SFT, which consumes the chat format directly.
For a full reference of available projection shapes, see Output Projections.
Next Steps#
Generate preference pairs for DPO: Generate Preference Data for DPO — the
rl_pref.yamlpattern.Generate tool-calling SFT data: Generate Tool-Calling Data for SFT — multi-turn
messagesandtoolswithstructured_messages.CLI flags and overrides: CLI Reference.
Config schema: Config Schema — full reference for column types, samplers, and projections.
Pipeline overview: About Synthetic Data Generation.