Generate Multilingual Q&A Data
Generate Multilingual Q&A Data
Generate Multilingual Q&A Data
This guide shows how to generate synthetic Q&A pairs across multiple languages using NeMo Curator’s QAMultilingualSyntheticStage. You’ll learn to configure an LLM client, create a generation pipeline, and optionally filter the output.
A pipeline that:
The AsyncOpenAIClient enables concurrent API requests for efficient batch generation:
The prompt template must include a {language} placeholder. The stage randomly selects a language for each sample:
If your prompt includes language prefixes, you can filter to keep only specific languages:
Write results to JSONL or Parquet format:
The tutorial script supports command-line arguments:
Generated documents contain a text field with the LLM response: