Multilingual Q&A | NeMo Curator

This guide shows how to generate synthetic Q&A pairs across multiple languages using NeMo Curator’s QAMultilingualSyntheticStage. You’ll learn to configure an LLM client, create a generation pipeline, and optionally filter the output.

What You’ll Build

A pipeline that:

Generates Q&A pairs in multiple languages using an LLM
Optionally filters results by language
Writes output to JSONL format

Prerequisites

NVIDIA API Key: Obtain from NVIDIA Build
NeMo Curator: Installed with text extras

$ export NVIDIA_API_KEY="nvapi-..."

Quick Start

1 import os
2 from nemo_curator.core.client import RayClient
3 from nemo_curator.models.client.openai_client import AsyncOpenAIClient
4 from nemo_curator.models.client.llm_client import GenerationConfig
5 from nemo_curator.pipeline import Pipeline
6 from nemo_curator.stages.synthetic.qa_multilingual_synthetic import QAMultilingualSyntheticStage
7 from nemo_curator.stages.text.io.writer.jsonl import JsonlWriter
8 
9 # Initialize Ray
10 client = RayClient(include_dashboard=False)
11 client.start()
12 
13 # Create LLM client
14 llm_client = AsyncOpenAIClient(
15     api_key=os.environ["NVIDIA_API_KEY"],
16     base_url="https://integrate.api.nvidia.com/v1",
17     max_concurrent_requests=5,
18 )
19 
20 # Create pipeline
21 pipeline = Pipeline(name="multilingual_qa")
22 
23 # Add synthetic generation stage
24 pipeline.add_stage(
25     QAMultilingualSyntheticStage(
26         prompt="Generate a Q&A pair about science in {language}.",
27         languages=["English", "French", "German", "Spanish"],
28         client=llm_client,
29         model_name="meta/llama-3.3-70b-instruct",
30         num_samples=50,
31         generation_config=GenerationConfig(temperature=0.9),
32     )
33 )
34 
35 # Write output
36 pipeline.add_stage(JsonlWriter(path="./synthetic_qa/"))
37 
38 # Run pipeline
39 results = pipeline.run()
40 
41 client.stop()

Step-by-Step Guide

Step 1: Configure the LLM Client

The AsyncOpenAIClient enables concurrent API requests for efficient batch generation:

1 from nemo_curator.models.client.openai_client import AsyncOpenAIClient
2 from nemo_curator.models.client.llm_client import GenerationConfig
3 
4 llm_client = AsyncOpenAIClient(
5     api_key=os.environ["NVIDIA_API_KEY"],
6     base_url="https://integrate.api.nvidia.com/v1",
7     max_concurrent_requests=5,  # Adjust based on rate limits
8     max_retries=3,              # Retry on transient failures
9     base_delay=1.0,             # Backoff delay in seconds
10 )
11 
12 # Configure generation parameters
13 generation_config = GenerationConfig(
14     temperature=0.9,   # Higher for more diverse outputs (default: 0.0)
15     top_p=0.95,
16     max_tokens=2048,
17     seed=None,         # Set to None for non-deterministic (default: 0 for reproducibility)
18 )

Step 2: Define the Prompt Template

The prompt template must include a {language} placeholder. The stage randomly selects a language for each sample:

1 # Simple Q&A prompt
2 prompt = "Generate a Q&A pair about science in {language}."
3 
4 # Structured prompt with language prefixes
5 prompt = """
6 Generate a short question and a short answer in the general science domain in {language}.
7 Begin with the language name using the 2-letter code in square brackets,
8 for example, [EN] for English, [FR] for French, [DE] for German.
9 """

Step 3: Create the Pipeline

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.synthetic.qa_multilingual_synthetic import QAMultilingualSyntheticStage
3 
4 pipeline = Pipeline(
5     name="multilingual_qa_generation",
6     description="Generate synthetic Q&A pairs in multiple languages",
7 )
8 
9 pipeline.add_stage(
10     QAMultilingualSyntheticStage(
11         prompt=prompt,
12         languages=["English", "French", "German", "Spanish", "Italian"],
13         client=llm_client,
14         model_name="meta/llama-3.3-70b-instruct",
15         num_samples=100,
16         generation_config=generation_config,
17     )
18 )

Step 4: Add Language Filtering (Optional)

If your prompt includes language prefixes, you can filter to keep only specific languages:

1 from nemo_curator.stages.text.filters import DocumentFilter
2 from nemo_curator.stages.text.filters import ScoreFilter
3 
4 class BeginsWithLanguageFilter(DocumentFilter):
5     """Filter documents based on language prefix codes."""
6 
7     def __init__(self, languages: list[str]):
8         super().__init__()
9         self._name = "begins_with_language_filter"
10         self.languages = languages
11 
12     def score_document(self, text: str) -> float:
13         if not self.languages:
14             return 1.0
15         return 1.0 if text.startswith(tuple(self.languages)) else 0.0
16 
17     def keep_document(self, score: float) -> bool:
18         return score == 1.0
19 
20 # Add filter to keep only English outputs
21 pipeline.add_stage(
22     ScoreFilter(
23         BeginsWithLanguageFilter(languages=["[EN]"]),
24         text_field="text",
25     ),
26 )

Step 5: Configure Output

Write results to JSONL or Parquet format:

1 from nemo_curator.stages.text.io.writer.jsonl import JsonlWriter
2 from nemo_curator.stages.text.io.writer.parquet import ParquetWriter
3 
4 # JSONL output
5 pipeline.add_stage(JsonlWriter(path="./output/synthetic_qa/"))
6 
7 # Or Parquet output
8 # pipeline.add_stage(ParquetWriter(path="./output/synthetic_qa/"))

Step 6: Run the Pipeline

1 from nemo_curator.core.client import RayClient
2 
3 # Initialize Ray
4 client = RayClient(include_dashboard=False)
5 client.start()
6 
7 # Execute pipeline
8 print(pipeline.describe())
9 results = pipeline.run()
10 
11 # Print results summary
12 if results:
13     for result in results:
14         if hasattr(result, "data") and result.data:
15             for file_path in result.data:
16                 print(f"Generated: {file_path}")
17 
18 client.stop()

CLI Usage

The tutorial script supports command-line arguments:

$ cd tutorials/synthetic
$ 
$ # Basic usage
$ python synthetic_data_generation_example.py --num-samples 50
$ 
$ # Custom languages and model
$ python synthetic_data_generation_example.py \
>     --num-samples 100 \
>     --languages English French German \
>     --model-name meta/llama-3.3-70b-instruct \
>     --temperature 0.9
$ 
$ # Skip language filtering
$ python synthetic_data_generation_example.py \
>     --num-samples 50 \
>     --no-filter-languages

Available Arguments

Argument	Default	Description
`--api-key`	env var	NVIDIA API key (or set NVIDIA_API_KEY)
`--base-url`	NVIDIA API	Base URL for the API endpoint
`--model-name`	meta/llama-3.3-70b-instruct	Model to use for generation
`--languages`	English, French, German, Spanish, Italian	Languages to generate Q&A pairs for (use full names)
`--num-samples`	100	Number of samples to generate
`--temperature`	0.9	Sampling temperature
`--output-path`	./synthetic_output	Output directory
`--no-filter-languages`	False	Disable language filtering

Sample Output

Generated documents contain a text field with the LLM response:

1 {"text": "[EN] Question: What causes ocean tides? Answer: Ocean tides are primarily caused by the gravitational pull of the Moon and Sun on Earth's water bodies."}
2 {"text": "[FR] Question: Qu'est-ce que la photosynthèse? Answer: La photosynthèse est le processus par lequel les plantes convertissent la lumière du soleil en énergie."}
3 {"text": "[DE] Question: Was ist der größte Planet in unserem Sonnensystem? Answer: Jupiter ist der größte Planet in unserem Sonnensystem."}

Tips for Diverse Output

Use higher temperature (0.7-1.0) for more varied outputs
Avoid fixed seeds for non-deterministic generation
Include clear instructions in the prompt for consistent formatting
Filter post-generation to ensure quality standards

Next Steps

LLM client: Advanced client configuration and performance tuning
Nemotron-CC: Advanced pipelines for text transformation and knowledge extraction