nemo_curator.core.serve
Module Contents
Classes
Functions
Data
API
Configuration for a single model to be served via Ray Serve.
Parameters:
HuggingFace model ID or local path (maps to model_source in LLMConfig).
API-facing model name clients use in requests. Defaults to model_identifier.
Ray Serve deployment configuration (autoscaling, replicas, etc.). Passed directly to LLMConfig.deployment_config.
vLLM engine keyword arguments (tensor_parallel_size, etc.). Passed directly to LLMConfig.engine_kwargs.
Ray runtime environment configuration (pip packages, env_vars, working_dir, etc.).
Merged with quiet logging overrides when verbose=False on the InferenceServer.
Merge two runtime_env dicts, with special handling for env_vars.
Top-level keys from override win, except env_vars which is
merged key-by-key (override env vars take precedence over base).
Convert to a Ray Serve LLMConfig.
Parameters:
Optional runtime environment with quiet/logging
overrides. Merged on top of self.runtime_env so that
quiet env vars take precedence while preserving user-provided
keys (e.g. pip, working_dir).
Serve one or more models via Ray Serve with an OpenAI-compatible endpoint.
Requires a running Ray cluster (e.g. via RayClient or RAY_ADDRESS env var).
Example::
from nemo_curator.core.serve import InferenceModelConfig, InferenceServer
config = InferenceModelConfig( model_identifier=“google/gemma-3-27b-it”, engine_kwargs={“tensor_parallel_size”: 4}, deployment_config={ “autoscaling_config”: { “min_replicas”: 1, “max_replicas”: 1, }, }, )
with InferenceServer(models=[config]) as server: print(server.endpoint) # http://localhost:8000/v1
Use with NeMo Curator’s OpenAIClient or AsyncOpenAIClient
Parameters:
List of InferenceModelConfig instances to deploy.
Ray Serve application name (default "default").
HTTP port for the OpenAI-compatible endpoint.
Seconds to wait for models to become healthy.
If True, keep Ray Serve and vLLM logging at default levels.
If False (default), suppress per-request logs from both vLLM
(VLLM_LOGGING_LEVEL=WARNING) and Ray Serve access logs
(RAY_SERVE_LOG_TO_STDERR=0). Serve logs still go to
files under the Ray session log directory.
OpenAI-compatible base URL for the served models.
When multiple models are deployed, clients select a model by passing
model="<model_name>" in the request body (standard OpenAI API
convention). The /v1/models endpoint lists all available models.
Best-effort cleanup after a failed deploy (e.g. health check timeout).
Shuts down Ray Serve so that GPU memory and other resources held by partially-deployed replicas are released.
Deploy models onto the connected Ray cluster (internal).
Must be called while a Ray connection is active.
Return a runtime_env dict that suppresses per-request logs.
Works around two upstream bugs in Ray Serve (as of Ray 2.44+):
-
vLLM request logs (
Added request chatcmpl-...):_start_async_llm_enginecreatesAsyncLLM()without passinglog_requests, so it defaults toTrue. Workaround:VLLM_LOGGING_LEVEL=WARNING. TODO: Once we upgrade past Ray 2.54 (see ray-project/ray#60824), pass"enable_log_requests": Falseinengine_kwargsinstead and remove theVLLM_LOGGING_LEVELenv var workaround. -
Ray Serve access logs (
POST /v1/... 200 Xms):configure_component_logger()only adds the access-log filter to the file handler, not the stderr stream handler, soLoggingConfig(enable_access_log=False)has no effect on console output. Workaround:RAY_SERVE_LOG_TO_STDERR=0(logs still go to files under the Ray session log directory). TODO: Ray might fix this in the future.
Reset Ray Serve’s cached controller client.
Ray Serve caches the controller actor handle in a module-level
_global_client. This handle becomes stale when the driver
disconnects and reconnects (e.g. via with ray.init()). The
built-in staleness check only catches RayActorError, not the
“different cluster” exception that occurs across driver sessions.
Resetting forces the next Serve API call to look up the controller by its well-known actor name, producing a fresh handle.
TODO: Remove this method once https://github.com/ray-project/ray/issues/61608 is fixed.
Poll the /v1/models endpoint until all models are ready.
Uses wall-clock time to enforce the timeout accurately, regardless of how long individual HTTP requests take.
Deploy all models and wait for them to become healthy.
The driver connects to the Ray cluster only for the duration of
deployment. Once models are healthy the driver disconnects, so that
the next ray.init() (e.g. from a pipeline executor) becomes the
first driver-level init and its runtime_env takes effect on
workers. Serve actors are detached and survive the disconnect.
Raises:
RuntimeError: If another InferenceServer is already active in this process. Only one InferenceServer can run at a time because Ray Serve uses a single HTTP proxy per cluster, and all models are deployed as a single application sharing the same/v1routes. You can deploy multiple models in one InferenceServer (via themodelslist) — clients select a model by passingmodel="<model_name>"in the API request body. Stop the existing server before starting a new one.
Shut down Ray Serve (all applications, controller, and HTTP proxy).
Reconnects to the Ray cluster to tear down Serve actors and release
GPU memory, then disconnects. If the cluster is already gone (e.g.
RayClient was stopped first), the shutdown is skipped silently.
Check whether any InferenceServer is currently running in this process.