***

description: >-
Configure LLM clients for synthetic data generation with NVIDIA APIs or custom
endpoints
categories:

* how-to-guides
  tags:
* llm-client
* openai
* nvidia-api
* configuration
  personas:
* data-scientist-focused
* mle-focused
  difficulty: beginner
  content\_type: how-to
  modality: text-only

***

# LLM Client Configuration

NeMo Curator's synthetic data generation uses OpenAI-compatible clients to communicate with LLM inference servers. This guide covers client configuration, performance tuning, and integration with various endpoints.

## Overview

Two client types are available:

* **`AsyncOpenAIClient`**: Recommended for high-throughput batch processing with concurrent requests
* **`OpenAIClient`**: Synchronous client for simpler use cases or debugging

For most SDG workloads, use `AsyncOpenAIClient` to maximize throughput.

## Basic Configuration

### NVIDIA API Endpoints

```python
from nemo_curator.models.client.openai_client import AsyncOpenAIClient

client = AsyncOpenAIClient(
    api_key="your-nvidia-api-key",  # Or use NVIDIA_API_KEY env var
    base_url="https://integrate.api.nvidia.com/v1",
    max_concurrent_requests=5,
)
```

### Environment Variables

Set your API key as an environment variable to avoid hardcoding credentials:

```bash
export NVIDIA_API_KEY="nvapi-..."
```

The underlying OpenAI client automatically uses the `OPENAI_API_KEY` environment variable if no `api_key` is provided. For NVIDIA APIs, explicitly pass the key:

```python
import os

client = AsyncOpenAIClient(
    api_key=os.environ["NVIDIA_API_KEY"],
    base_url="https://integrate.api.nvidia.com/v1",
)
```

## Generation Parameters

Configure LLM generation behavior using `GenerationConfig`:

```python
from nemo_curator.models.client.llm_client import GenerationConfig

config = GenerationConfig(
    max_tokens=2048,
    temperature=0.7,
    top_p=0.95,
    seed=42,  # For reproducibility
)
```

| Parameter     | Type     | Default | Description                                                       |
| ------------- | -------- | ------- | ----------------------------------------------------------------- |
| `max_tokens`  | int      | 2048    | Maximum tokens to generate per request                            |
| `temperature` | float    | 0.0     | Sampling temperature (0.0-2.0). Higher values increase randomness |
| `top_p`       | float    | 0.95    | Nucleus sampling parameter (0.0-1.0)                              |
| `top_k`       | int      | None    | Top-k sampling (if supported by the endpoint)                     |
| `seed`        | int      | 0       | Random seed for reproducibility                                   |
| `stop`        | str/list | None    | Stop sequences to end generation                                  |
| `stream`      | bool     | False   | Enable streaming (not recommended for batch processing)           |
| `n`           | int      | 1       | Number of completions to generate per request                     |

## Performance Tuning

### Concurrency vs. Parallelism

The `max_concurrent_requests` parameter controls how many API requests the client can have in-flight simultaneously. This interacts with Ray's distributed workers:

* **Client-level concurrency**: `max_concurrent_requests` limits concurrent API calls per worker
* **Worker-level parallelism**: Ray distributes tasks across multiple workers

```python
# For NVIDIA API endpoints with rate limits
client = AsyncOpenAIClient(
    base_url="https://integrate.api.nvidia.com/v1",
    max_concurrent_requests=3,  # Conservative for cloud APIs
)
```

### Retry Configuration

The client includes automatic retry with exponential backoff for transient errors:

```python
client = AsyncOpenAIClient(
    base_url="https://integrate.api.nvidia.com/v1",
    max_retries=3,        # Number of retry attempts
    base_delay=1.0,       # Base delay in seconds
    timeout=120,          # Request timeout
)
```

The retry logic handles:

* **Rate limit errors (429)**: Automatic backoff with jitter
* **Connection errors**: Retry with exponential delay
* **Transient failures**: Configurable retry attempts

## Using Other OpenAI-Compatible Endpoints

The `AsyncOpenAIClient` works with any OpenAI-compatible API endpoint. Simply configure the `base_url` and `api_key` parameters:

```python
# OpenAI API
client = AsyncOpenAIClient(
    base_url="https://api.openai.com/v1",
    api_key="sk-...",  # Or set OPENAI_API_KEY env var
    max_concurrent_requests=5,
)

# Any OpenAI-compatible endpoint
client = AsyncOpenAIClient(
    base_url="http://your-endpoint/v1",
    api_key="your-api-key",
    max_concurrent_requests=5,
)
```

## Complete Example

```python
import os
from nemo_curator.models.client.openai_client import AsyncOpenAIClient
from nemo_curator.models.client.llm_client import GenerationConfig
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.synthetic.qa_multilingual_synthetic import QAMultilingualSyntheticStage

# Configure client
client = AsyncOpenAIClient(
    api_key=os.environ.get("NVIDIA_API_KEY"),
    base_url="https://integrate.api.nvidia.com/v1",
    max_concurrent_requests=5,
    max_retries=3,
    base_delay=1.0,
)

# Configure generation
config = GenerationConfig(
    temperature=0.9,
    top_p=0.95,
    max_tokens=2048,
)

# Use in a pipeline stage
pipeline = Pipeline(name="sdg_example")
pipeline.add_stage(
    QAMultilingualSyntheticStage(
        prompt="Generate a Q&A pair about science in {language}.",
        languages=["English", "French", "German"],
        client=client,
        model_name="meta/llama-3.3-70b-instruct",
        num_samples=100,
        generation_config=config,
    )
)
```

## Troubleshooting

### Rate Limit Errors

If you encounter frequent 429 errors:

1. Reduce `max_concurrent_requests`
2. Increase `base_delay` for longer backoff
3. Consider using a local deployment for high-volume workloads

### Connection Timeouts

For slow networks or high-latency endpoints:

```python
client = AsyncOpenAIClient(
    base_url="...",
    timeout=300,  # Increase from default 120 seconds
)
```

***

## Next Steps

* [Multilingual Q\&A](/curate-text/synthetic/multilingual-qa): Generate multilingual Q\&A pairs
* [Nemotron-CC](/curate-text/synthetic/nemotron-cc): Advanced text transformation pipelines