*** layout: overview slug: nemo-curator/nemo\_curator/core/serve title: nemo\_curator.core.serve ------------------------------- ## Module Contents ### Classes | Name | Description | | ----------------------------------------------------------------------- | -------------------------------------------------------------------------- | | [`InferenceModelConfig`](#nemo_curator-core-serve-InferenceModelConfig) | Configuration for a single model to be served via Ray Serve. | | [`InferenceServer`](#nemo_curator-core-serve-InferenceServer) | Serve one or more models via Ray Serve with an OpenAI-compatible endpoint. | ### Functions | Name | Description | | --------------------------------------------------------------------- | ----------------------------------------------------------------------- | | [`is_ray_serve_active`](#nemo_curator-core-serve-is_ray_serve_active) | Check whether any InferenceServer is currently running in this process. | ### Data [`_active_servers`](#nemo_curator-core-serve-_active_servers) ### API ```python class nemo_curator.core.serve.InferenceModelConfig( model_identifier: str, model_name: str | None = None, deployment_config: dict[str, typing.Any] = dict(), engine_kwargs: dict[str, typing.Any] = dict(), runtime_env: dict[str, typing.Any] = dict() ) ``` Dataclass Configuration for a single model to be served via Ray Serve. **Parameters:** HuggingFace model ID or local path (maps to model\_source in LLMConfig). API-facing model name clients use in requests. Defaults to model\_identifier. Ray Serve deployment configuration (autoscaling, replicas, etc.). Passed directly to LLMConfig.deployment\_config. vLLM engine keyword arguments (tensor\_parallel\_size, etc.). Passed directly to LLMConfig.engine\_kwargs. Ray runtime environment configuration (pip packages, env\_vars, working\_dir, etc.). Merged with quiet logging overrides when `verbose=False` on the InferenceServer. ```python nemo_curator.core.serve.InferenceModelConfig._merge_runtime_envs( base: dict[str, typing.Any], override: dict[str, typing.Any] | None ) -> dict[str, typing.Any] ``` staticmethod Merge two runtime\_env dicts, with special handling for `env_vars`. Top-level keys from *override* win, except `env_vars` which is merged key-by-key (override env vars take precedence over base). ```python nemo_curator.core.serve.InferenceModelConfig.to_llm_config( quiet_runtime_env: dict[str, typing.Any] | None = None ) -> ray.serve.llm.LLMConfig ``` Convert to a Ray Serve LLMConfig. **Parameters:** Optional runtime environment with quiet/logging overrides. Merged on top of `self.runtime_env` so that quiet env vars take precedence while preserving user-provided keys (e.g. `pip`, `working_dir`). ```python class nemo_curator.core.serve.InferenceServer( models: list[nemo_curator.core.serve.InferenceModelConfig], name: str = 'default', port: int = DEFAULT_SERVE_PORT, health_check_timeout_s: int = DEFAULT_SERVE_HEALTH_TIMEOUT_S, verbose: bool = False ) ``` Dataclass Serve one or more models via Ray Serve with an OpenAI-compatible endpoint. Requires a running Ray cluster (e.g. via RayClient or RAY\_ADDRESS env var). Example:: from nemo\_curator.core.serve import InferenceModelConfig, InferenceServer config = InferenceModelConfig( model\_identifier="google/gemma-3-27b-it", engine\_kwargs=\{"tensor\_parallel\_size": 4}, deployment\_config=\{ "autoscaling\_config": \{ "min\_replicas": 1, "max\_replicas": 1, }, }, ) with InferenceServer(models=\[config]) as server: print(server.endpoint) # [http://localhost:8000/v1](http://localhost:8000/v1) # Use with NeMo Curator's OpenAIClient or AsyncOpenAIClient **Parameters:** List of InferenceModelConfig instances to deploy. Ray Serve application name (default `"default"`). HTTP port for the OpenAI-compatible endpoint. Seconds to wait for models to become healthy. If True, keep Ray Serve and vLLM logging at default levels. If False (default), suppress per-request logs from both vLLM (`VLLM_LOGGING_LEVEL=WARNING`) and Ray Serve access logs (`RAY_SERVE_LOG_TO_STDERR=0`). Serve logs still go to files under the Ray session log directory. OpenAI-compatible base URL for the served models. When multiple models are deployed, clients select a model by passing `model="<model_name>"` in the request body (standard OpenAI API convention). The `/v1/models` endpoint lists all available models. ```python nemo_curator.core.serve.InferenceServer.__enter__() ``` ```python nemo_curator.core.serve.InferenceServer.__exit__( exc = () ) ``` ```python nemo_curator.core.serve.InferenceServer.__post_init__() -> None ``` ```python nemo_curator.core.serve.InferenceServer._cleanup_failed_deploy() -> None ``` Best-effort cleanup after a failed deploy (e.g. health check timeout). Shuts down Ray Serve so that GPU memory and other resources held by partially-deployed replicas are released. ```python nemo_curator.core.serve.InferenceServer._deploy() -> None ``` Deploy models onto the connected Ray cluster (internal). Must be called while a Ray connection is active. ```python nemo_curator.core.serve.InferenceServer._quiet_runtime_env() -> dict[str, typing.Any] ``` staticmethod Return a `runtime_env` dict that suppresses per-request logs. Works around two upstream bugs in Ray Serve (as of Ray 2.44+): 1. **vLLM request logs** (`Added request chatcmpl-...`): `_start_async_llm_engine` creates `AsyncLLM()` without passing `log_requests`, so it defaults to `True`. Workaround: `VLLM_LOGGING_LEVEL=WARNING`. TODO: Once we upgrade past Ray 2.54 (see ray-project/ray#60824), pass `"enable_log_requests": False` in `engine_kwargs` instead and remove the `VLLM_LOGGING_LEVEL` env var workaround. 2. **Ray Serve access logs** (`POST /v1/... 200 Xms`): `configure_component_logger()` only adds the access-log filter to the *file* handler, not the stderr stream handler, so `LoggingConfig(enable_access_log=False)` has no effect on console output. Workaround: `RAY_SERVE_LOG_TO_STDERR=0` (logs still go to files under the Ray session log directory). TODO: Ray might fix this in the future. ```python nemo_curator.core.serve.InferenceServer._reset_serve_client_cache() -> None ``` staticmethod Reset Ray Serve's cached controller client. Ray Serve caches the controller actor handle in a module-level `_global_client`. This handle becomes stale when the driver disconnects and reconnects (e.g. via `with ray.init()`). The built-in staleness check only catches `RayActorError`, not the "different cluster" exception that occurs across driver sessions. Resetting forces the next Serve API call to look up the controller by its well-known actor name, producing a fresh handle. TODO: Remove this method once [https://github.com/ray-project/ray/issues/61608](https://github.com/ray-project/ray/issues/61608) is fixed. ```python nemo_curator.core.serve.InferenceServer._wait_for_healthy() -> None ``` Poll the /v1/models endpoint until all models are ready. Uses wall-clock time to enforce the timeout accurately, regardless of how long individual HTTP requests take. ```python nemo_curator.core.serve.InferenceServer.start() -> None ``` Deploy all models and wait for them to become healthy. The driver connects to the Ray cluster only for the duration of deployment. Once models are healthy the driver disconnects, so that the next `ray.init()` (e.g. from a pipeline executor) becomes the first driver-level init and its `runtime_env` takes effect on workers. Serve actors are detached and survive the disconnect. **Raises:** * `RuntimeError`: If another InferenceServer is already active in this process. Only one InferenceServer can run at a time because Ray Serve uses a single HTTP proxy per cluster, and all models are deployed as a single application sharing the same `/v1` routes. You can deploy multiple models in one InferenceServer (via the `models` list) — clients select a model by passing `model="<model_name>"` in the API request body. Stop the existing server before starting a new one. ```python nemo_curator.core.serve.InferenceServer.stop() -> None ``` Shut down Ray Serve (all applications, controller, and HTTP proxy). Reconnects to the Ray cluster to tear down Serve actors and release GPU memory, then disconnects. If the cluster is already gone (e.g. `RayClient` was stopped first), the shutdown is skipped silently. ```python nemo_curator.core.serve.is_ray_serve_active() -> bool ``` Check whether any InferenceServer is currently running in this process. ```python nemo_curator.core.serve._active_servers: set[str] = set() ```