LLM Client Configuration#
NeMo Curator’s synthetic data generation uses OpenAI-compatible clients to communicate with LLM inference servers. This guide covers client configuration, performance tuning, and integration with various endpoints.
Overview#
Two client types are available:
AsyncOpenAIClient: Recommended for high-throughput batch processing with concurrent requestsOpenAIClient: Synchronous client for simpler use cases or debugging
For most SDG workloads, use AsyncOpenAIClient to maximize throughput.
Basic Configuration#
NVIDIA API Endpoints#
from nemo_curator.models.client.openai_client import AsyncOpenAIClient
client = AsyncOpenAIClient(
api_key="your-nvidia-api-key", # Or use NVIDIA_API_KEY env var
base_url="https://integrate.api.nvidia.com/v1",
max_concurrent_requests=5,
)
Environment Variables#
Set your API key as an environment variable to avoid hardcoding credentials:
export NVIDIA_API_KEY="nvapi-..."
The underlying OpenAI client automatically uses the OPENAI_API_KEY environment variable if no api_key is provided. For NVIDIA APIs, explicitly pass the key:
import os
client = AsyncOpenAIClient(
api_key=os.environ["NVIDIA_API_KEY"],
base_url="https://integrate.api.nvidia.com/v1",
)
Generation Parameters#
Configure LLM generation behavior using GenerationConfig:
from nemo_curator.models.client.llm_client import GenerationConfig
config = GenerationConfig(
max_tokens=2048,
temperature=0.7,
top_p=0.95,
seed=42, # For reproducibility
)
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
int |
2048 |
Maximum tokens to generate per request |
|
float |
0.0 |
Sampling temperature (0.0-2.0). Higher values increase randomness |
|
float |
0.95 |
Nucleus sampling parameter (0.0-1.0) |
|
int |
None |
Top-k sampling (if supported by the endpoint) |
|
int |
0 |
Random seed for reproducibility |
|
str/list |
None |
Stop sequences to end generation |
|
bool |
False |
Enable streaming (not recommended for batch processing) |
|
int |
1 |
Number of completions to generate per request |
Performance Tuning#
Concurrency vs. Parallelism#
The max_concurrent_requests parameter controls how many API requests the client can have in-flight simultaneously. This interacts with Ray’s distributed workers:
Client-level concurrency:
max_concurrent_requestslimits concurrent API calls per workerWorker-level parallelism: Ray distributes tasks across multiple workers
# For NVIDIA API endpoints with rate limits
client = AsyncOpenAIClient(
base_url="https://integrate.api.nvidia.com/v1",
max_concurrent_requests=3, # Conservative for cloud APIs
)
Retry Configuration#
The client includes automatic retry with exponential backoff for transient errors:
client = AsyncOpenAIClient(
base_url="https://integrate.api.nvidia.com/v1",
max_retries=3, # Number of retry attempts
base_delay=1.0, # Base delay in seconds
timeout=120, # Request timeout
)
The retry logic handles:
Rate limit errors (429): Automatic backoff with jitter
Connection errors: Retry with exponential delay
Transient failures: Configurable retry attempts
Using Other OpenAI-Compatible Endpoints#
The AsyncOpenAIClient works with any OpenAI-compatible API endpoint. Simply configure the base_url and api_key parameters:
# OpenAI API
client = AsyncOpenAIClient(
base_url="https://api.openai.com/v1",
api_key="sk-...", # Or set OPENAI_API_KEY env var
max_concurrent_requests=5,
)
# Any OpenAI-compatible endpoint
client = AsyncOpenAIClient(
base_url="http://your-endpoint/v1",
api_key="your-api-key",
max_concurrent_requests=5,
)
Complete Example#
import os
from nemo_curator.models.client.openai_client import AsyncOpenAIClient
from nemo_curator.models.client.llm_client import GenerationConfig
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.synthetic.qa_multilingual_synthetic import QAMultilingualSyntheticStage
# Configure client
client = AsyncOpenAIClient(
api_key=os.environ.get("NVIDIA_API_KEY"),
base_url="https://integrate.api.nvidia.com/v1",
max_concurrent_requests=5,
max_retries=3,
base_delay=1.0,
)
# Configure generation
config = GenerationConfig(
temperature=0.9,
top_p=0.95,
max_tokens=2048,
)
# Use in a pipeline stage
pipeline = Pipeline(name="sdg_example")
pipeline.add_stage(
QAMultilingualSyntheticStage(
prompt="Generate a Q&A pair about science in {language}.",
languages=["English", "French", "German"],
client=client,
model_name="meta/llama-3.3-70b-instruct",
num_samples=100,
generation_config=config,
)
)
Troubleshooting#
Rate Limit Errors#
If you encounter frequent 429 errors:
Reduce
max_concurrent_requestsIncrease
base_delayfor longer backoffConsider using a local deployment for high-volume workloads
Connection Timeouts#
For slow networks or high-latency endpoints:
client = AsyncOpenAIClient(
base_url="...",
timeout=300, # Increase from default 120 seconds
)
Next Steps#
Generate Multilingual Q&A Data: Generate multilingual Q&A pairs
Nemotron-CC Pipelines: Advanced text transformation pipelines