NeMo Curator can serve LLMs locally using Ray Serve and vLLM, providing an OpenAI-compatible endpoint without external inference infrastructure. This is useful for synthetic data generation workflows where you co-locate model serving with your curation pipeline on the same GPU cluster.
Install the inference server dependencies:
This installs Ray Serve, vLLM, and supporting libraries. You need an NVIDIA GPU with sufficient VRAM for the model you intend to serve.
The InferenceServer deploys models onto the Ray cluster and exposes an OpenAI-compatible API at http://localhost:<port>/v1. When used as a context manager, it automatically starts and stops the server.
Each model you want to serve is described by an InferenceModelConfig:
Use deployment_config to control replica count and autoscaling:
You can use InferenceServer as a context manager or call start() and stop() manually:
Deploy multiple models in a single server. Clients select a model by name in the API request:
The /v1/models endpoint lists all available models.
Point NeMo Curator’s AsyncOpenAIClient at the inference server endpoint:
When an InferenceServer is active, Pipeline.run() automatically detects potential GPU contention:
RuntimeError if the pipeline has GPU stages. Xenna manages GPU assignment independently and would conflict with served models.If your pipeline has only CPU stages, either executor works.
By default (verbose=False), InferenceServer suppresses per-request logs from vLLM and Ray Serve access logs to reduce noise. Ray Serve logs still go to files under the Ray session log directory. Set verbose=True to restore full logging output for debugging.