Generate Multilingual Q&A Data#
This guide shows how to generate synthetic Q&A pairs across multiple languages using NeMo Curator’s QAMultilingualSyntheticStage. You’ll learn to configure an LLM client, create a generation pipeline, and optionally filter the output.
What You’ll Build#
A pipeline that:
Generates Q&A pairs in multiple languages using an LLM
Optionally filters results by language
Writes output to JSONL format
Prerequisites#
NVIDIA API Key: Obtain from NVIDIA Build
NeMo Curator: Installed with text extras
export NVIDIA_API_KEY="nvapi-..."
Quick Start#
import os
from nemo_curator.core.client import RayClient
from nemo_curator.models.client.openai_client import AsyncOpenAIClient
from nemo_curator.models.client.llm_client import GenerationConfig
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.synthetic.qa_multilingual_synthetic import QAMultilingualSyntheticStage
from nemo_curator.stages.text.io.writer.jsonl import JsonlWriter
# Initialize Ray
client = RayClient(include_dashboard=False)
client.start()
# Create LLM client
llm_client = AsyncOpenAIClient(
api_key=os.environ["NVIDIA_API_KEY"],
base_url="https://integrate.api.nvidia.com/v1",
max_concurrent_requests=5,
)
# Create pipeline
pipeline = Pipeline(name="multilingual_qa")
# Add synthetic generation stage
pipeline.add_stage(
QAMultilingualSyntheticStage(
prompt="Generate a Q&A pair about science in {language}.",
languages=["English", "French", "German", "Spanish"],
client=llm_client,
model_name="meta/llama-3.3-70b-instruct",
num_samples=50,
generation_config=GenerationConfig(temperature=0.9),
)
)
# Write output
pipeline.add_stage(JsonlWriter(path="./synthetic_qa/"))
# Run pipeline
results = pipeline.run()
client.stop()
Step-by-Step Guide#
Step 1: Configure the LLM Client#
The AsyncOpenAIClient enables concurrent API requests for efficient batch generation:
from nemo_curator.models.client.openai_client import AsyncOpenAIClient
from nemo_curator.models.client.llm_client import GenerationConfig
llm_client = AsyncOpenAIClient(
api_key=os.environ["NVIDIA_API_KEY"],
base_url="https://integrate.api.nvidia.com/v1",
max_concurrent_requests=5, # Adjust based on rate limits
max_retries=3, # Retry on transient failures
base_delay=1.0, # Backoff delay in seconds
)
# Configure generation parameters
generation_config = GenerationConfig(
temperature=0.9, # Higher for more diverse outputs (default: 0.0)
top_p=0.95,
max_tokens=2048,
seed=None, # Set to None for non-deterministic (default: 0 for reproducibility)
)
Step 2: Define the Prompt Template#
The prompt template must include a {language} placeholder. The stage randomly selects a language for each sample:
# Simple Q&A prompt
prompt = "Generate a Q&A pair about science in {language}."
# Structured prompt with language prefixes
prompt = """
Generate a short question and a short answer in the general science domain in {language}.
Begin with the language name using the 2-letter code in square brackets,
for example, [EN] for English, [FR] for French, [DE] for German.
"""
Step 3: Create the Pipeline#
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.synthetic.qa_multilingual_synthetic import QAMultilingualSyntheticStage
pipeline = Pipeline(
name="multilingual_qa_generation",
description="Generate synthetic Q&A pairs in multiple languages",
)
pipeline.add_stage(
QAMultilingualSyntheticStage(
prompt=prompt,
languages=["English", "French", "German", "Spanish", "Italian"],
client=llm_client,
model_name="meta/llama-3.3-70b-instruct",
num_samples=100,
generation_config=generation_config,
)
)
Step 4: Add Language Filtering (Optional)#
If your prompt includes language prefixes, you can filter to keep only specific languages:
from nemo_curator.stages.text.filters.doc_filter import DocumentFilter
from nemo_curator.stages.text.modules.score_filter import ScoreFilter
class BeginsWithLanguageFilter(DocumentFilter):
"""Filter documents based on language prefix codes."""
def __init__(self, languages: list[str]):
super().__init__()
self._name = "begins_with_language_filter"
self.languages = languages
def score_document(self, text: str) -> float:
if not self.languages:
return 1.0
return 1.0 if text.startswith(tuple(self.languages)) else 0.0
def keep_document(self, score: float) -> bool:
return score == 1.0
# Add filter to keep only English outputs
pipeline.add_stage(
ScoreFilter(
BeginsWithLanguageFilter(languages=["[EN]"]),
text_field="text",
),
)
Step 5: Configure Output#
Write results to JSONL or Parquet format:
from nemo_curator.stages.text.io.writer.jsonl import JsonlWriter
from nemo_curator.stages.text.io.writer.parquet import ParquetWriter
# JSONL output
pipeline.add_stage(JsonlWriter(path="./output/synthetic_qa/"))
# Or Parquet output
# pipeline.add_stage(ParquetWriter(path="./output/synthetic_qa/"))
Step 6: Run the Pipeline#
from nemo_curator.core.client import RayClient
# Initialize Ray
client = RayClient(include_dashboard=False)
client.start()
# Execute pipeline
print(pipeline.describe())
results = pipeline.run()
# Print results summary
if results:
for result in results:
if hasattr(result, "data") and result.data:
for file_path in result.data:
print(f"Generated: {file_path}")
client.stop()
CLI Usage#
The tutorial script supports command-line arguments:
cd tutorials/synthetic
# Basic usage
python synthetic_data_generation_example.py --num-samples 50
# Custom languages and model
python synthetic_data_generation_example.py \
--num-samples 100 \
--languages English French German \
--model-name meta/llama-3.3-70b-instruct \
--temperature 0.9
# Skip language filtering
python synthetic_data_generation_example.py \
--num-samples 50 \
--no-filter-languages
Available Arguments#
Argument |
Default |
Description |
|---|---|---|
|
env var |
NVIDIA API key (or set NVIDIA_API_KEY) |
|
NVIDIA API |
Base URL for the API endpoint |
|
meta/llama-3.3-70b-instruct |
Model to use for generation |
|
English, French, German, Spanish, Italian |
Languages to generate Q&A pairs for (use full names) |
|
100 |
Number of samples to generate |
|
0.9 |
Sampling temperature |
|
./synthetic_output |
Output directory |
|
False |
Disable language filtering |
Sample Output#
Generated documents contain a text field with the LLM response:
{"text": "[EN] Question: What causes ocean tides? Answer: Ocean tides are primarily caused by the gravitational pull of the Moon and Sun on Earth's water bodies."}
{"text": "[FR] Question: Qu'est-ce que la photosynthèse? Answer: La photosynthèse est le processus par lequel les plantes convertissent la lumière du soleil en énergie."}
{"text": "[DE] Question: Was ist der größte Planet in unserem Sonnensystem? Answer: Jupiter ist der größte Planet in unserem Sonnensystem."}
Tips for Diverse Output#
Use higher temperature (0.7-1.0) for more varied outputs
Avoid fixed seeds for non-deterministic generation
Include clear instructions in the prompt for consistent formatting
Filter post-generation to ensure quality standards
Next Steps#
LLM Client Configuration: Advanced client configuration and performance tuning
Nemotron-CC Pipelines: Advanced pipelines for text transformation and knowledge extraction