Inference Server | NeMo Curator

NeMo Curator can serve LLMs locally using Ray Serve and vLLM, providing an OpenAI-compatible endpoint without external inference infrastructure. This is useful for synthetic data generation workflows where you co-locate model serving with your curation pipeline on the same GPU cluster.

Prerequisites

Install the inference server dependencies:

$ uv pip install nemo-curator[inference_server]

This installs Ray Serve, vLLM, and supporting libraries. You need an NVIDIA GPU with sufficient VRAM for the model you intend to serve.

Quick Start

1 from openai import OpenAI
2 from nemo_curator.core.client import RayClient
3 from nemo_curator.core.serve import InferenceModelConfig, InferenceServer
4 
5 # 1. Start Ray cluster
6 client = RayClient()
7 client.start()
8 
9 # 2. Configure and serve a model
10 config = InferenceModelConfig(
11     model_identifier="google/gemma-3-27b-it",
12     engine_kwargs={"tensor_parallel_size": 4},
13     deployment_config={
14         "autoscaling_config": {
15             "min_replicas": 1,
16             "max_replicas": 1,
17         },
18     },
19 )
20 
21 with InferenceServer(models=[config]) as server:
22     # 3. Query via OpenAI SDK
23     oai = OpenAI(base_url=server.endpoint, api_key="unused")
24     response = oai.chat.completions.create(
25         model="google/gemma-3-27b-it",
26         messages=[{"role": "user", "content": "Hello!"}],
27     )
28     print(response.choices[0].message.content)

The InferenceServer deploys models onto the Ray cluster and exposes an OpenAI-compatible API at http://localhost:<port>/v1. When used as a context manager, it automatically starts and stops the server.

InferenceModelConfig

Each model you want to serve is described by an InferenceModelConfig:

Parameter	Type	Default	Description
`model_identifier`	str	Required	HuggingFace model ID or local path
`model_name`	str	None	API-facing model name clients use in requests. Defaults to `model_identifier`
`deployment_config`	dict	`{}`	Ray Serve deployment configuration (autoscaling, replicas)
`engine_kwargs`	dict	`{}`	vLLM engine keyword arguments (`tensor_parallel_size`, etc.)
`runtime_env`	dict	`{}`	Ray runtime environment (pip packages, env vars, working directory)

Common Engine Arguments

1 config = InferenceModelConfig(
2     model_identifier="meta-llama/Llama-3-8B-Instruct",
3     engine_kwargs={
4         "tensor_parallel_size": 2,  # Split model across 2 GPUs
5     },
6 )

Autoscaling

Use deployment_config to control replica count and autoscaling:

1 config = InferenceModelConfig(
2     model_identifier="meta-llama/Llama-3-8B-Instruct",
3     deployment_config={
4         "autoscaling_config": {
5             "min_replicas": 1,
6             "max_replicas": 4,
7         },
8     },
9 )

InferenceServer

Parameter	Type	Default	Description
`models`	list[InferenceModelConfig]	Required	Models to deploy
`name`	str	`"default"`	Ray Serve application name
`port`	int	8000	HTTP port for the OpenAI-compatible endpoint
`health_check_timeout_s`	int	300	Seconds to wait for models to become healthy
`verbose`	bool	False	If True, keep Ray Serve and vLLM logging at default levels

Start and Stop

You can use InferenceServer as a context manager or call start() and stop() manually:

1 # Context manager (recommended)
2 with InferenceServer(models=[config]) as server:
3     # server.endpoint is available here
4     pass  # Server stops automatically
5 
6 # Manual lifecycle
7 server = InferenceServer(models=[config])
8 server.start()
9 # ... use server.endpoint ...
10 server.stop()

Multi-Model Serving

Deploy multiple models in a single server. Clients select a model by name in the API request:

1 models = [
2     InferenceModelConfig(
3         model_identifier="meta-llama/Llama-3-8B-Instruct",
4         model_name="llama-8b",
5         engine_kwargs={"tensor_parallel_size": 1},
6     ),
7     InferenceModelConfig(
8         model_identifier="google/gemma-3-27b-it",
9         model_name="gemma-27b",
10         engine_kwargs={"tensor_parallel_size": 4},
11     ),
12 ]
13 
14 with InferenceServer(models=models) as server:
15     oai = OpenAI(base_url=server.endpoint, api_key="unused")
16     # Select model by name
17     response = oai.chat.completions.create(
18         model="llama-8b",
19         messages=[{"role": "user", "content": "Hello!"}],
20     )

The /v1/models endpoint lists all available models.

Use with NeMo Curator Pipelines

With AsyncOpenAIClient

Point NeMo Curator’s AsyncOpenAIClient at the inference server endpoint:

1 from nemo_curator.models.client.openai_client import AsyncOpenAIClient
2 from nemo_curator.core.serve import InferenceModelConfig, InferenceServer
3 
4 config = InferenceModelConfig(
5     model_identifier="meta-llama/Llama-3-8B-Instruct",
6     engine_kwargs={"tensor_parallel_size": 2},
7 )
8 
9 with InferenceServer(models=[config]) as server:
10     client = AsyncOpenAIClient(
11         base_url=server.endpoint,
12         api_key="unused",
13         max_concurrent_requests=10,
14     )
15     # Use client in SDG pipeline stages

GPU Contention

When an InferenceServer is active, Pipeline.run() automatically detects potential GPU contention:

RayDataExecutor: Allowed. Ray’s resource scheduler coordinates GPU allocation between served models and pipeline stages.
XennaExecutor: Raises RuntimeError if the pipeline has GPU stages. Xenna manages GPU assignment independently and would conflict with served models.

If your pipeline has only CPU stages, either executor works.

Logging

By default (verbose=False), InferenceServer suppresses per-request logs from vLLM and Ray Serve access logs to reduce noise. Ray Serve logs still go to files under the Ray session log directory. Set verbose=True to restore full logging output for debugging.

Next Steps

LLM Client Setup: Configure client parameters and generation settings
Synthetic Data Generation: Overview of SDG capabilities