Curate TextSynthetic Data

LLM Client Configuration

View as Markdown

NeMo Curator’s synthetic data generation uses OpenAI-compatible clients to communicate with LLM inference servers. This guide covers client configuration, performance tuning, and integration with various endpoints.

Overview

Two client types are available:

  • AsyncOpenAIClient: Recommended for high-throughput batch processing with concurrent requests
  • OpenAIClient: Synchronous client for simpler use cases or debugging

For most SDG workloads, use AsyncOpenAIClient to maximize throughput.

Basic Configuration

NVIDIA API Endpoints

1from nemo_curator.models.client.openai_client import AsyncOpenAIClient
2
3client = AsyncOpenAIClient(
4 api_key="your-nvidia-api-key", # Or use NVIDIA_API_KEY env var
5 base_url="https://integrate.api.nvidia.com/v1",
6 max_concurrent_requests=5,
7)

Environment Variables

Set your API key as an environment variable to avoid hardcoding credentials:

$export NVIDIA_API_KEY="nvapi-..."

The underlying OpenAI client automatically uses the OPENAI_API_KEY environment variable if no api_key is provided. For NVIDIA APIs, explicitly pass the key:

1import os
2
3client = AsyncOpenAIClient(
4 api_key=os.environ["NVIDIA_API_KEY"],
5 base_url="https://integrate.api.nvidia.com/v1",
6)

Generation Parameters

Configure LLM generation behavior using GenerationConfig:

1from nemo_curator.models.client.llm_client import GenerationConfig
2
3config = GenerationConfig(
4 max_tokens=2048,
5 temperature=0.7,
6 top_p=0.95,
7 seed=42, # For reproducibility
8)
ParameterTypeDefaultDescription
max_tokensint2048Maximum tokens to generate per request
temperaturefloat0.0Sampling temperature (0.0-2.0). Higher values increase randomness
top_pfloat0.95Nucleus sampling parameter (0.0-1.0)
top_kintNoneTop-k sampling (if supported by the endpoint)
seedint0Random seed for reproducibility
stopstr/listNoneStop sequences to end generation
streamboolFalseEnable streaming (not recommended for batch processing)
nint1Number of completions to generate per request

Performance Tuning

Concurrency vs. Parallelism

The max_concurrent_requests parameter controls how many API requests the client can have in-flight simultaneously. This interacts with Ray’s distributed workers:

  • Client-level concurrency: max_concurrent_requests limits concurrent API calls per worker
  • Worker-level parallelism: Ray distributes tasks across multiple workers
1# For NVIDIA API endpoints with rate limits
2client = AsyncOpenAIClient(
3 base_url="https://integrate.api.nvidia.com/v1",
4 max_concurrent_requests=3, # Conservative for cloud APIs
5)

Retry Configuration

The client includes automatic retry with exponential backoff for transient errors:

1client = AsyncOpenAIClient(
2 base_url="https://integrate.api.nvidia.com/v1",
3 max_retries=3, # Number of retry attempts
4 base_delay=1.0, # Base delay in seconds
5 timeout=120, # Request timeout
6)

The retry logic handles:

  • Rate limit errors (429): Automatic backoff with jitter
  • Connection errors: Retry with exponential delay
  • Transient failures: Configurable retry attempts

Using Other OpenAI-Compatible Endpoints

The AsyncOpenAIClient works with any OpenAI-compatible API endpoint. Simply configure the base_url and api_key parameters:

1# OpenAI API
2client = AsyncOpenAIClient(
3 base_url="https://api.openai.com/v1",
4 api_key="sk-...", # Or set OPENAI_API_KEY env var
5 max_concurrent_requests=5,
6)
7
8# Any OpenAI-compatible endpoint
9client = AsyncOpenAIClient(
10 base_url="http://your-endpoint/v1",
11 api_key="your-api-key",
12 max_concurrent_requests=5,
13)

Complete Example

1import os
2from nemo_curator.models.client.openai_client import AsyncOpenAIClient
3from nemo_curator.models.client.llm_client import GenerationConfig
4from nemo_curator.pipeline import Pipeline
5from nemo_curator.stages.synthetic.qa_multilingual_synthetic import QAMultilingualSyntheticStage
6
7# Configure client
8client = AsyncOpenAIClient(
9 api_key=os.environ.get("NVIDIA_API_KEY"),
10 base_url="https://integrate.api.nvidia.com/v1",
11 max_concurrent_requests=5,
12 max_retries=3,
13 base_delay=1.0,
14)
15
16# Configure generation
17config = GenerationConfig(
18 temperature=0.9,
19 top_p=0.95,
20 max_tokens=2048,
21)
22
23# Use in a pipeline stage
24pipeline = Pipeline(name="sdg_example")
25pipeline.add_stage(
26 QAMultilingualSyntheticStage(
27 prompt="Generate a Q&A pair about science in {language}.",
28 languages=["English", "French", "German"],
29 client=client,
30 model_name="meta/llama-3.3-70b-instruct",
31 num_samples=100,
32 generation_config=config,
33 )
34)

Troubleshooting

Rate Limit Errors

If you encounter frequent 429 errors:

  1. Reduce max_concurrent_requests
  2. Increase base_delay for longer backoff
  3. Consider using a local deployment for high-volume workloads

Connection Timeouts

For slow networks or high-latency endpoints:

1client = AsyncOpenAIClient(
2 base_url="...",
3 timeout=300, # Increase from default 120 seconds
4)

Next Steps