Curate TextSynthetic Data

Generate Multilingual Q&A Data

View as Markdown

This guide shows how to generate synthetic Q&A pairs across multiple languages using NeMo Curator’s QAMultilingualSyntheticStage. You’ll learn to configure an LLM client, create a generation pipeline, and optionally filter the output.

What You’ll Build

A pipeline that:

  1. Generates Q&A pairs in multiple languages using an LLM
  2. Optionally filters results by language
  3. Writes output to JSONL format

Prerequisites

  • NVIDIA API Key: Obtain from NVIDIA Build
  • NeMo Curator: Installed with text extras
$export NVIDIA_API_KEY="nvapi-..."

Quick Start

1import os
2from nemo_curator.core.client import RayClient
3from nemo_curator.models.client.openai_client import AsyncOpenAIClient
4from nemo_curator.models.client.llm_client import GenerationConfig
5from nemo_curator.pipeline import Pipeline
6from nemo_curator.stages.synthetic.qa_multilingual_synthetic import QAMultilingualSyntheticStage
7from nemo_curator.stages.text.io.writer.jsonl import JsonlWriter
8
9# Initialize Ray
10client = RayClient(include_dashboard=False)
11client.start()
12
13# Create LLM client
14llm_client = AsyncOpenAIClient(
15 api_key=os.environ["NVIDIA_API_KEY"],
16 base_url="https://integrate.api.nvidia.com/v1",
17 max_concurrent_requests=5,
18)
19
20# Create pipeline
21pipeline = Pipeline(name="multilingual_qa")
22
23# Add synthetic generation stage
24pipeline.add_stage(
25 QAMultilingualSyntheticStage(
26 prompt="Generate a Q&A pair about science in {language}.",
27 languages=["English", "French", "German", "Spanish"],
28 client=llm_client,
29 model_name="meta/llama-3.3-70b-instruct",
30 num_samples=50,
31 generation_config=GenerationConfig(temperature=0.9),
32 )
33)
34
35# Write output
36pipeline.add_stage(JsonlWriter(path="./synthetic_qa/"))
37
38# Run pipeline
39results = pipeline.run()
40
41client.stop()

Step-by-Step Guide

Step 1: Configure the LLM Client

The AsyncOpenAIClient enables concurrent API requests for efficient batch generation:

1from nemo_curator.models.client.openai_client import AsyncOpenAIClient
2from nemo_curator.models.client.llm_client import GenerationConfig
3
4llm_client = AsyncOpenAIClient(
5 api_key=os.environ["NVIDIA_API_KEY"],
6 base_url="https://integrate.api.nvidia.com/v1",
7 max_concurrent_requests=5, # Adjust based on rate limits
8 max_retries=3, # Retry on transient failures
9 base_delay=1.0, # Backoff delay in seconds
10)
11
12# Configure generation parameters
13generation_config = GenerationConfig(
14 temperature=0.9, # Higher for more diverse outputs (default: 0.0)
15 top_p=0.95,
16 max_tokens=2048,
17 seed=None, # Set to None for non-deterministic (default: 0 for reproducibility)
18)

Step 2: Define the Prompt Template

The prompt template must include a {language} placeholder. The stage randomly selects a language for each sample:

1# Simple Q&A prompt
2prompt = "Generate a Q&A pair about science in {language}."
3
4# Structured prompt with language prefixes
5prompt = """
6Generate a short question and a short answer in the general science domain in {language}.
7Begin with the language name using the 2-letter code in square brackets,
8for example, [EN] for English, [FR] for French, [DE] for German.
9"""

Step 3: Create the Pipeline

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.synthetic.qa_multilingual_synthetic import QAMultilingualSyntheticStage
3
4pipeline = Pipeline(
5 name="multilingual_qa_generation",
6 description="Generate synthetic Q&A pairs in multiple languages",
7)
8
9pipeline.add_stage(
10 QAMultilingualSyntheticStage(
11 prompt=prompt,
12 languages=["English", "French", "German", "Spanish", "Italian"],
13 client=llm_client,
14 model_name="meta/llama-3.3-70b-instruct",
15 num_samples=100,
16 generation_config=generation_config,
17 )
18)

Step 4: Add Language Filtering (Optional)

If your prompt includes language prefixes, you can filter to keep only specific languages:

1from nemo_curator.stages.text.filters.doc_filter import DocumentFilter
2from nemo_curator.stages.text.modules.score_filter import ScoreFilter
3
4class BeginsWithLanguageFilter(DocumentFilter):
5 """Filter documents based on language prefix codes."""
6
7 def __init__(self, languages: list[str]):
8 super().__init__()
9 self._name = "begins_with_language_filter"
10 self.languages = languages
11
12 def score_document(self, text: str) -> float:
13 if not self.languages:
14 return 1.0
15 return 1.0 if text.startswith(tuple(self.languages)) else 0.0
16
17 def keep_document(self, score: float) -> bool:
18 return score == 1.0
19
20# Add filter to keep only English outputs
21pipeline.add_stage(
22 ScoreFilter(
23 BeginsWithLanguageFilter(languages=["[EN]"]),
24 text_field="text",
25 ),
26)

Step 5: Configure Output

Write results to JSONL or Parquet format:

1from nemo_curator.stages.text.io.writer.jsonl import JsonlWriter
2from nemo_curator.stages.text.io.writer.parquet import ParquetWriter
3
4# JSONL output
5pipeline.add_stage(JsonlWriter(path="./output/synthetic_qa/"))
6
7# Or Parquet output
8# pipeline.add_stage(ParquetWriter(path="./output/synthetic_qa/"))

Step 6: Run the Pipeline

1from nemo_curator.core.client import RayClient
2
3# Initialize Ray
4client = RayClient(include_dashboard=False)
5client.start()
6
7# Execute pipeline
8print(pipeline.describe())
9results = pipeline.run()
10
11# Print results summary
12if results:
13 for result in results:
14 if hasattr(result, "data") and result.data:
15 for file_path in result.data:
16 print(f"Generated: {file_path}")
17
18client.stop()

CLI Usage

The tutorial script supports command-line arguments:

$cd tutorials/synthetic
$
$# Basic usage
$python synthetic_data_generation_example.py --num-samples 50
$
$# Custom languages and model
$python synthetic_data_generation_example.py \
> --num-samples 100 \
> --languages English French German \
> --model-name meta/llama-3.3-70b-instruct \
> --temperature 0.9
$
$# Skip language filtering
$python synthetic_data_generation_example.py \
> --num-samples 50 \
> --no-filter-languages

Available Arguments

ArgumentDefaultDescription
--api-keyenv varNVIDIA API key (or set NVIDIA_API_KEY)
--base-urlNVIDIA APIBase URL for the API endpoint
--model-namemeta/llama-3.3-70b-instructModel to use for generation
--languagesEnglish, French, German, Spanish, ItalianLanguages to generate Q&A pairs for (use full names)
--num-samples100Number of samples to generate
--temperature0.9Sampling temperature
--output-path./synthetic_outputOutput directory
--no-filter-languagesFalseDisable language filtering

Sample Output

Generated documents contain a text field with the LLM response:

1{"text": "[EN] Question: What causes ocean tides? Answer: Ocean tides are primarily caused by the gravitational pull of the Moon and Sun on Earth's water bodies."}
2{"text": "[FR] Question: Qu'est-ce que la photosynthèse? Answer: La photosynthèse est le processus par lequel les plantes convertissent la lumière du soleil en énergie."}
3{"text": "[DE] Question: Was ist der größte Planet in unserem Sonnensystem? Answer: Jupiter ist der größte Planet in unserem Sonnensystem."}

Tips for Diverse Output

  1. Use higher temperature (0.7-1.0) for more varied outputs
  2. Avoid fixed seeds for non-deterministic generation
  3. Include clear instructions in the prompt for consistent formatting
  4. Filter post-generation to ensure quality standards

Next Steps

  • LLM client: Advanced client configuration and performance tuning
  • Nemotron-CC: Advanced pipelines for text transformation and knowledge extraction