NeMo Curator’s synthetic data generation uses OpenAI-compatible clients to communicate with LLM inference servers. This guide covers client configuration, performance tuning, and integration with various endpoints.
Two client types are available:
AsyncOpenAIClient: Recommended for high-throughput batch processing with concurrent requestsOpenAIClient: Synchronous client for simpler use cases or debuggingFor most SDG workloads, use AsyncOpenAIClient to maximize throughput.
Set your API key as an environment variable to avoid hardcoding credentials:
The underlying OpenAI client automatically uses the OPENAI_API_KEY environment variable if no api_key is provided. For NVIDIA APIs, explicitly pass the key:
Configure LLM generation behavior using GenerationConfig:
The max_concurrent_requests parameter controls how many API requests the client can have in-flight simultaneously. This interacts with Ray’s distributed workers:
max_concurrent_requests limits concurrent API calls per workerThe client includes automatic retry with exponential backoff for transient errors:
The retry logic handles:
The AsyncOpenAIClient works with any OpenAI-compatible API endpoint. Simply configure the base_url and api_key parameters:
To serve models locally and connect them to AsyncOpenAIClient, use NeMo Curator’s built-in Inference Server (Ray Serve + vLLM):
If you encounter frequent 429 errors:
max_concurrent_requestsbase_delay for longer backoffFor slow networks or high-latency endpoints: