Generate Multilingual Q&A Data
This guide shows how to generate synthetic Q&A pairs across multiple languages using NeMo Curator’s QAMultilingualSyntheticStage. You’ll learn to configure an LLM client, create a generation pipeline, and optionally filter the output.
What You’ll Build
A pipeline that:
- Generates Q&A pairs in multiple languages using an LLM
- Optionally filters results by language
- Writes output to JSONL format
Prerequisites
- NVIDIA API Key: Obtain from NVIDIA Build
- NeMo Curator: Installed with text extras
Quick Start
Step-by-Step Guide
Step 1: Configure the LLM Client
The AsyncOpenAIClient enables concurrent API requests for efficient batch generation:
Step 2: Define the Prompt Template
The prompt template must include a {language} placeholder. The stage randomly selects a language for each sample:
Step 3: Create the Pipeline
Step 4: Add Language Filtering (Optional)
If your prompt includes language prefixes, you can filter to keep only specific languages:
Step 5: Configure Output
Write results to JSONL or Parquet format:
Step 6: Run the Pipeline
CLI Usage
The tutorial script supports command-line arguments:
Available Arguments
Sample Output
Generated documents contain a text field with the LLM response:
Tips for Diverse Output
- Use higher temperature (0.7-1.0) for more varied outputs
- Avoid fixed seeds for non-deterministic generation
- Include clear instructions in the prompt for consistent formatting
- Filter post-generation to ensure quality standards
Next Steps
- LLM client: Advanced client configuration and performance tuning
- Nemotron-CC: Advanced pipelines for text transformation and knowledge extraction