LLM Client Setup | NeMo Curator

NeMo Curator’s synthetic data generation uses OpenAI-compatible clients to communicate with LLM inference servers. This guide covers client configuration, performance tuning, and integration with various endpoints.

Overview

Two client types are available:

AsyncOpenAIClient: Recommended for high-throughput batch processing with concurrent requests
OpenAIClient: Synchronous client for simpler use cases or debugging

For most SDG workloads, use AsyncOpenAIClient to maximize throughput.

Basic Configuration

NVIDIA API Endpoints

1 from nemo_curator.models.client.openai_client import AsyncOpenAIClient
2 
3 client = AsyncOpenAIClient(
4     api_key="your-nvidia-api-key",  # Or use NVIDIA_API_KEY env var
5     base_url="https://integrate.api.nvidia.com/v1",
6     max_concurrent_requests=5,
7 )

Environment Variables

Set your API key as an environment variable to avoid hardcoding credentials:

$ export NVIDIA_API_KEY="nvapi-..."

The underlying OpenAI client automatically uses the OPENAI_API_KEY environment variable if no api_key is provided. For NVIDIA APIs, explicitly pass the key:

1 import os
2 
3 client = AsyncOpenAIClient(
4     api_key=os.environ["NVIDIA_API_KEY"],
5     base_url="https://integrate.api.nvidia.com/v1",
6 )

Generation Parameters

Configure LLM generation behavior using GenerationConfig:

1 from nemo_curator.models.client.llm_client import GenerationConfig
2 
3 config = GenerationConfig(
4     max_tokens=2048,
5     temperature=0.7,
6     top_p=0.95,
7     seed=42,  # For reproducibility
8 )

Parameter	Type	Default	Description
`max_tokens`	int	2048	Maximum tokens to generate per request
`temperature`	float	0.0	Sampling temperature (0.0-2.0). Higher values increase randomness
`top_p`	float	0.95	Nucleus sampling parameter (0.0-1.0)
`top_k`	int	None	Top-k sampling (if supported by the endpoint)
`seed`	int	0	Random seed for reproducibility
`stop`	str/list	None	Stop sequences to end generation
`stream`	bool	False	Enable streaming (not recommended for batch processing)
`n`	int	1	Number of completions to generate per request
`extra_kwargs`	dict	None	Additional keyword arguments passed through to the OpenAI `create()` call

Performance Tuning

Concurrency vs. Parallelism

The max_concurrent_requests parameter controls how many API requests the client can have in-flight simultaneously. This interacts with Ray’s distributed workers:

Client-level concurrency: max_concurrent_requests limits concurrent API calls per worker
Worker-level parallelism: Ray distributes tasks across multiple workers

1 # For NVIDIA API endpoints with rate limits
2 client = AsyncOpenAIClient(
3     base_url="https://integrate.api.nvidia.com/v1",
4     max_concurrent_requests=3,  # Conservative for cloud APIs
5 )

Retry Configuration

The client includes automatic retry with exponential backoff for transient errors:

1 client = AsyncOpenAIClient(
2     base_url="https://integrate.api.nvidia.com/v1",
3     max_retries=3,        # Number of retry attempts
4     base_delay=1.0,       # Base delay in seconds
5     timeout=120,          # Request timeout
6 )

The retry logic handles:

Rate limit errors (429): Automatic backoff with jitter
Connection errors: Retry with exponential delay
Transient failures: Configurable retry attempts

Using Other OpenAI-Compatible Endpoints

The AsyncOpenAIClient works with any OpenAI-compatible API endpoint. Simply configure the base_url and api_key parameters:

1 # OpenAI API
2 client = AsyncOpenAIClient(
3     base_url="https://api.openai.com/v1",
4     api_key="sk-...",  # Or set OPENAI_API_KEY env var
5     max_concurrent_requests=5,
6 )
7 
8 # Any OpenAI-compatible endpoint
9 client = AsyncOpenAIClient(
10     base_url="http://your-endpoint/v1",
11     api_key="your-api-key",
12     max_concurrent_requests=5,
13 )

Local Inference with InferenceServer

To serve models locally and connect them to AsyncOpenAIClient, use NeMo Curator’s built-in Inference Server (Ray Serve + vLLM):

1 from nemo_curator.core.serve import InferenceModelConfig, InferenceServer
2 from nemo_curator.models.client.openai_client import AsyncOpenAIClient
3 
4 config = InferenceModelConfig(
5     model_identifier="meta-llama/Llama-3-8B-Instruct",
6     engine_kwargs={"tensor_parallel_size": 2},
7 )
8 
9 with InferenceServer(models=[config]) as server:
10     client = AsyncOpenAIClient(
11         base_url=server.endpoint,
12         api_key="unused",
13         max_concurrent_requests=10,
14     )
15     # Use client in pipeline stages

Complete Example

1 import os
2 from nemo_curator.models.client.openai_client import AsyncOpenAIClient
3 from nemo_curator.models.client.llm_client import GenerationConfig
4 from nemo_curator.pipeline import Pipeline
5 from nemo_curator.stages.synthetic.qa_multilingual_synthetic import QAMultilingualSyntheticStage
6 
7 # Configure client
8 client = AsyncOpenAIClient(
9     api_key=os.environ.get("NVIDIA_API_KEY"),
10     base_url="https://integrate.api.nvidia.com/v1",
11     max_concurrent_requests=5,
12     max_retries=3,
13     base_delay=1.0,
14 )
15 
16 # Configure generation
17 config = GenerationConfig(
18     temperature=0.9,
19     top_p=0.95,
20     max_tokens=2048,
21 )
22 
23 # Use in a pipeline stage
24 pipeline = Pipeline(name="sdg_example")
25 pipeline.add_stage(
26     QAMultilingualSyntheticStage(
27         prompt="Generate a Q&A pair about science in {language}.",
28         languages=["English", "French", "German"],
29         client=client,
30         model_name="meta/llama-3.3-70b-instruct",
31         num_samples=100,
32         generation_config=config,
33     )
34 )

Troubleshooting

Rate Limit Errors

If you encounter frequent 429 errors:

Reduce max_concurrent_requests
Increase base_delay for longer backoff
Consider using a local deployment for high-volume workloads

Connection Timeouts

For slow networks or high-latency endpoints:

1 client = AsyncOpenAIClient(
2     base_url="...",
3     timeout=300,  # Increase from default 120 seconds
4 )

Next Steps

Multilingual Q&A: Generate multilingual Q&A pairs
Nemotron-CC: Advanced text transformation pipelines