*** description: >- Generate multilingual Q\&A pairs using LLMs with NeMo Curator's synthetic data pipeline categories: * tutorials tags: * multilingual * qa-generation * synthetic-data * quickstart personas: * data-scientist-focused * mle-focused difficulty: beginner content\_type: how-to modality: text-only *** # Generate Multilingual Q\&A Data This guide shows how to generate synthetic Q\&A pairs across multiple languages using NeMo Curator's `QAMultilingualSyntheticStage`. You'll learn to configure an LLM client, create a generation pipeline, and optionally filter the output. ## What You'll Build A pipeline that: 1. Generates Q\&A pairs in multiple languages using an LLM 2. Optionally filters results by language 3. Writes output to JSONL format ## Prerequisites * **NVIDIA API Key**: Obtain from [NVIDIA Build](https://build.nvidia.com/settings/api-keys) * **NeMo Curator**: Installed with text extras ```bash export NVIDIA_API_KEY="nvapi-..." ``` ## Quick Start ```python import os from nemo_curator.core.client import RayClient from nemo_curator.models.client.openai_client import AsyncOpenAIClient from nemo_curator.models.client.llm_client import GenerationConfig from nemo_curator.pipeline import Pipeline from nemo_curator.stages.synthetic.qa_multilingual_synthetic import QAMultilingualSyntheticStage from nemo_curator.stages.text.io.writer.jsonl import JsonlWriter # Initialize Ray client = RayClient(include_dashboard=False) client.start() # Create LLM client llm_client = AsyncOpenAIClient( api_key=os.environ["NVIDIA_API_KEY"], base_url="https://integrate.api.nvidia.com/v1", max_concurrent_requests=5, ) # Create pipeline pipeline = Pipeline(name="multilingual_qa") # Add synthetic generation stage pipeline.add_stage( QAMultilingualSyntheticStage( prompt="Generate a Q&A pair about science in {language}.", languages=["English", "French", "German", "Spanish"], client=llm_client, model_name="meta/llama-3.3-70b-instruct", num_samples=50, generation_config=GenerationConfig(temperature=0.9), ) ) # Write output pipeline.add_stage(JsonlWriter(path="./synthetic_qa/")) # Run pipeline results = pipeline.run() client.stop() ``` ## Step-by-Step Guide ### Step 1: Configure the LLM Client The `AsyncOpenAIClient` enables concurrent API requests for efficient batch generation: ```python from nemo_curator.models.client.openai_client import AsyncOpenAIClient from nemo_curator.models.client.llm_client import GenerationConfig llm_client = AsyncOpenAIClient( api_key=os.environ["NVIDIA_API_KEY"], base_url="https://integrate.api.nvidia.com/v1", max_concurrent_requests=5, # Adjust based on rate limits max_retries=3, # Retry on transient failures base_delay=1.0, # Backoff delay in seconds ) # Configure generation parameters generation_config = GenerationConfig( temperature=0.9, # Higher for more diverse outputs (default: 0.0) top_p=0.95, max_tokens=2048, seed=None, # Set to None for non-deterministic (default: 0 for reproducibility) ) ``` ### Step 2: Define the Prompt Template The prompt template must include a `{language}` placeholder. The stage randomly selects a language for each sample: ```python # Simple Q&A prompt prompt = "Generate a Q&A pair about science in {language}." # Structured prompt with language prefixes prompt = """ Generate a short question and a short answer in the general science domain in {language}. Begin with the language name using the 2-letter code in square brackets, for example, [EN] for English, [FR] for French, [DE] for German. """ ``` ### Step 3: Create the Pipeline ```python from nemo_curator.pipeline import Pipeline from nemo_curator.stages.synthetic.qa_multilingual_synthetic import QAMultilingualSyntheticStage pipeline = Pipeline( name="multilingual_qa_generation", description="Generate synthetic Q&A pairs in multiple languages", ) pipeline.add_stage( QAMultilingualSyntheticStage( prompt=prompt, languages=["English", "French", "German", "Spanish", "Italian"], client=llm_client, model_name="meta/llama-3.3-70b-instruct", num_samples=100, generation_config=generation_config, ) ) ``` ### Step 4: Add Language Filtering (Optional) If your prompt includes language prefixes, you can filter to keep only specific languages: ```python from nemo_curator.stages.text.filters.doc_filter import DocumentFilter from nemo_curator.stages.text.modules.score_filter import ScoreFilter class BeginsWithLanguageFilter(DocumentFilter): """Filter documents based on language prefix codes.""" def __init__(self, languages: list[str]): super().__init__() self._name = "begins_with_language_filter" self.languages = languages def score_document(self, text: str) -> float: if not self.languages: return 1.0 return 1.0 if text.startswith(tuple(self.languages)) else 0.0 def keep_document(self, score: float) -> bool: return score == 1.0 # Add filter to keep only English outputs pipeline.add_stage( ScoreFilter( BeginsWithLanguageFilter(languages=["[EN]"]), text_field="text", ), ) ``` ### Step 5: Configure Output Write results to JSONL or Parquet format: ```python from nemo_curator.stages.text.io.writer.jsonl import JsonlWriter from nemo_curator.stages.text.io.writer.parquet import ParquetWriter # JSONL output pipeline.add_stage(JsonlWriter(path="./output/synthetic_qa/")) # Or Parquet output # pipeline.add_stage(ParquetWriter(path="./output/synthetic_qa/")) ``` ### Step 6: Run the Pipeline ```python from nemo_curator.core.client import RayClient # Initialize Ray client = RayClient(include_dashboard=False) client.start() # Execute pipeline print(pipeline.describe()) results = pipeline.run() # Print results summary if results: for result in results: if hasattr(result, "data") and result.data: for file_path in result.data: print(f"Generated: {file_path}") client.stop() ``` ## CLI Usage The tutorial script supports command-line arguments: ```bash cd tutorials/synthetic # Basic usage python synthetic_data_generation_example.py --num-samples 50 # Custom languages and model python synthetic_data_generation_example.py \ --num-samples 100 \ --languages English French German \ --model-name meta/llama-3.3-70b-instruct \ --temperature 0.9 # Skip language filtering python synthetic_data_generation_example.py \ --num-samples 50 \ --no-filter-languages ``` ### Available Arguments | Argument | Default | Description | | ----------------------- | ----------------------------------------- | ----------------------------------------------------- | | `--api-key` | env var | NVIDIA API key (or set NVIDIA\_API\_KEY) | | `--base-url` | NVIDIA API | Base URL for the API endpoint | | `--model-name` | meta/llama-3.3-70b-instruct | Model to use for generation | | `--languages` | English, French, German, Spanish, Italian | Languages to generate Q\&A pairs for (use full names) | | `--num-samples` | 100 | Number of samples to generate | | `--temperature` | 0.9 | Sampling temperature | | `--output-path` | ./synthetic\_output | Output directory | | `--no-filter-languages` | False | Disable language filtering | ## Sample Output Generated documents contain a `text` field with the LLM response: ```json {"text": "[EN] Question: What causes ocean tides? Answer: Ocean tides are primarily caused by the gravitational pull of the Moon and Sun on Earth's water bodies."} {"text": "[FR] Question: Qu'est-ce que la photosynthèse? Answer: La photosynthèse est le processus par lequel les plantes convertissent la lumière du soleil en énergie."} {"text": "[DE] Question: Was ist der größte Planet in unserem Sonnensystem? Answer: Jupiter ist der größte Planet in unserem Sonnensystem."} ``` ## Tips for Diverse Output 1. **Use higher temperature** (0.7-1.0) for more varied outputs 2. **Avoid fixed seeds** for non-deterministic generation 3. **Include clear instructions** in the prompt for consistent formatting 4. **Filter post-generation** to ensure quality standards *** ## Next Steps * [LLM client](/curate-text/synthetic/llm-client): Advanced client configuration and performance tuning * [Nemotron-CC](/curate-text/synthetic/nemotron-cc): Advanced pipelines for text transformation and knowledge extraction