nemo_deploy.deploy_ray#

Module Contents#

Classes#

DeployRay

A class for managing Ray deployment and serving of models.

Functions#

get_available_cpus

Get the total number of available CPUs in the system.

Data#

API#

nemo_deploy.deploy_ray.LOGGER = 'getLogger(...)'#
nemo_deploy.deploy_ray.get_available_cpus()#

Get the total number of available CPUs in the system.

class nemo_deploy.deploy_ray.DeployRay(
address: str = 'auto',
num_cpus: Optional[int] = None,
num_gpus: int = 1,
include_dashboard: bool = False,
ignore_reinit_error: bool = True,
runtime_env: dict = None,
host: str = '0.0.0.0',
port: Optional[int] = None,
)#

A class for managing Ray deployment and serving of models.

This class provides functionality to initialize Ray, start Ray Serve, deploy models, and manage the lifecycle of the Ray cluster. It supports both NeMo inframework models, Hugging Face models, and TensorRT-LLM models.

.. attribute:: address

The address of the Ray cluster to connect to.

Type:

str

.. attribute:: num_cpus

Number of CPUs to allocate for the Ray cluster.

Type:

int

.. attribute:: num_gpus

Number of GPUs to allocate for the Ray cluster.

Type:

int

.. attribute:: include_dashboard

Whether to include the Ray dashboard.

Type:

bool

.. attribute:: ignore_reinit_error

Whether to ignore errors when reinitializing Ray.

Type:

bool

.. attribute:: runtime_env

Runtime environment configuration for Ray.

Type:

dict

.. attribute:: host

Host address to bind the server to.

Type:

str

.. attribute:: port

Port number for the server.

Type:

int

.. method:: deploy_inframework_model

Deploy a NeMo inframework model using Ray Serve.

.. method:: deploy_huggingface_model

Deploy a Hugging Face model using Ray Serve.

.. method:: deploy_tensorrt_llm_model

Deploy a TensorRT-LLM model using Ray Serve.

Initialization

Initialize the DeployRay instance and set up the Ray cluster.

Parameters:
  • address (str, optional) – Address of the Ray cluster. Defaults to β€œauto”.

  • num_cpus (int, optional) – Number of CPUs to allocate. If None, uses all available. Defaults to None.

  • num_gpus (int, optional) – Number of GPUs to allocate. Defaults to 1.

  • include_dashboard (bool, optional) – Whether to include the dashboard. Defaults to False.

  • ignore_reinit_error (bool, optional) – Whether to ignore reinit errors. Defaults to True.

  • runtime_env (dict, optional) – Runtime environment configuration. Defaults to None.

  • host (str, optional) – Host address to bind the server to. Defaults to β€œ0.0.0.0”.

  • port (int, optional) – Port number for the server. If None, an available port will be found. Defaults to None.

Raises:

Exception – If Ray is not installed.

_signal_handler(signum, frame)#

Handle signal interrupts and gracefully shutdown the deployer.

_start()#

Start Ray Serve with the configured host and port.

Uses the host and port specified during DeployRay initialization. If port is None, an available port will be found automatically.

_stop()#

Stop the Ray Serve deployment and shutdown the Ray cluster.

This method attempts to gracefully shutdown both Ray Serve and the Ray cluster. If any errors occur during shutdown, they are logged as warnings.

deploy_inframework_model(
nemo_checkpoint: str,
num_gpus: int = 1,
tensor_model_parallel_size: int = 1,
pipeline_model_parallel_size: int = 1,
expert_model_parallel_size: int = 1,
context_parallel_size: int = 1,
model_id: str = 'nemo-model',
num_cpus_per_replica: float = 8,
num_replicas: int = 1,
enable_cuda_graphs: bool = False,
enable_flash_decode: bool = False,
legacy_ckpt: bool = False,
max_batch_size: int = 32,
random_seed: Optional[int] = None,
test_mode: bool = False,
megatron_checkpoint_filepath: str = None,
model_type: str = 'gpt',
model_format: str = 'nemo',
micro_batch_size: Optional[int] = None,
**model_config_kwargs,
)#

Deploy an inframework NeMo/Megatron model using Ray Serve.

This method handles the complete deployment lifecycle including:

  • Starting Ray Serve

  • Creating and deploying the MegatronRayDeployable

  • Setting up signal handlers for graceful shutdown

  • Keeping the deployment running until interrupted

Parameters:
  • nemo_checkpoint (str) – Path to the .nemo checkpoint file.

  • num_gpus (int, optional) – Number of GPUs per node. Defaults to 1.

  • tensor_model_parallel_size (int, optional) – Tensor model parallel size. Defaults to 1.

  • pipeline_model_parallel_size (int, optional) – Pipeline model parallel size. Defaults to 1.

  • expert_model_parallel_size (int, optional) – Expert model parallel size. Defaults to 1.

  • context_parallel_size (int, optional) – Context parallel size. Defaults to 1.

  • model_id (str, optional) – Model identifier for API responses. Defaults to β€œnemo-model”.

  • num_cpus_per_replica (float, optional) – CPUs per model replica. Defaults to 8.

  • num_replicas (int, optional) – Number of replicas for deployment. Defaults to 1.

  • enable_cuda_graphs (bool, optional) – Enable CUDA graphs. Defaults to False.

  • enable_flash_decode (bool, optional) – Enable Flash Attention decode. Defaults to False.

  • legacy_ckpt (bool, optional) – Use legacy checkpoint format. Defaults to False.

  • test_mode (bool, optional) – Enable test mode. Defaults to False.

  • megatron_checkpoint_filepath (str, optional) – Path to the Megatron checkpoint file. Defaults to None.

  • model_type (str, optional) – Type of model to load. Defaults to β€œgpt”.

  • model_format (str, optional) – Format of model to load. Defaults to β€œnemo”.

  • micro_batch_size (Optional[int], optional) – Micro batch size for model execution. Defaults to None.

Raises:
  • SystemExit – If parallelism configuration is invalid.

  • Exception – If deployment fails.

deploy_huggingface_model(
hf_model_id_path: str,
task: str = 'text-generation',
trust_remote_code: bool = True,
device_map: Optional[str] = None,
max_memory: Optional[str] = None,
model_id: str = 'hf-model',
num_replicas: int = 1,
num_cpus_per_replica: float = 8,
num_gpus_per_replica: int = 1,
max_ongoing_requests: int = 10,
use_vllm_backend: bool = False,
test_mode: bool = False,
)#

Deploy a Hugging Face model using Ray Serve.

This method handles the complete deployment lifecycle including:

  • Starting Ray Serve

  • Creating and deploying the HFRayDeployable

  • Setting up signal handlers for graceful shutdown

  • Keeping the deployment running until interrupted

Parameters:
  • hf_model_id_path (str) – Path to the HuggingFace model or model identifier. Can be a local path or a model ID from HuggingFace Hub.

  • task (str, optional) – HuggingFace task type. Defaults to β€œtext-generation”.

  • trust_remote_code (bool, optional) – Whether to trust remote code when loading the model. Defaults to True.

  • device_map (str, optional) – Device mapping strategy for model placement. Defaults to β€œauto”.

  • max_memory (str, optional) – Maximum memory allocation when using balanced device map. Defaults to None.

  • model_id (str, optional) – Model identifier for API responses. Defaults to β€œhf-model”.

  • num_replicas (int, optional) – Number of replicas for deployment. Defaults to 1.

  • num_cpus_per_replica (float, optional) – CPUs per model replica. Defaults to 8.

  • num_gpus_per_replica (int, optional) – GPUs per model replica. Defaults to 1.

  • max_ongoing_requests (int, optional) – Maximum number of ongoing requests per replica. Defaults to 10.

  • use_vllm_backend (bool, optional) – Whether to use vLLM backend for deployment. If True, exports the HF ckpt

  • False. (to vLLM format and uses vLLM backend for inference. Defaults to)

  • test_mode (bool, optional) – Enable test mode. Defaults to False.

Raises:

Exception – If Ray is not installed or deployment fails.

deploy_tensorrt_llm_model(
trt_llm_path: str,
model_id: str = 'tensorrt-llm-model',
use_python_runtime: bool = True,
multi_block_mode: bool = False,
lora_ckpt_list: Optional[list] = None,
enable_chunked_context: bool = False,
max_tokens_in_paged_kv_cache: Optional[int] = None,
num_replicas: int = 1,
num_cpus_per_replica: float = 8,
num_gpus_per_replica: int = 1,
max_ongoing_requests: int = 10,
test_mode: bool = False,
)#

Deploy a TensorRT-LLM model using Ray Serve.

This method handles the complete deployment lifecycle including:

  • Starting Ray Serve

  • Creating and deploying the TensorRTLLMRayDeployable

  • Setting up signal handlers for graceful shutdown

  • Keeping the deployment running until interrupted

Note: This method assumes the model is already converted to TensorRT-LLM format. The conversion should be done before calling this API.

Parameters:
  • trt_llm_path (str) – Path to the TensorRT-LLM model directory with pre-built engines.

  • model_id (str, optional) – Model identifier for API responses. Defaults to β€œtensorrt-llm-model”.

  • use_python_runtime (bool, optional) – Whether to use Python runtime (vs C++ runtime). Defaults to True.

  • multi_block_mode (bool, optional) – Whether to enable multi-block mode. Defaults to False.

  • lora_ckpt_list (list, optional) – List of LoRA checkpoint paths. Defaults to None.

  • enable_chunked_context (bool, optional) – Whether to enable chunked context (C++ runtime only). Defaults to False.

  • max_tokens_in_paged_kv_cache (int, optional) – Maximum tokens in paged KV cache (C++ runtime only). Defaults to None.

  • num_replicas (int, optional) – Number of replicas for deployment. Defaults to 1.

  • num_cpus_per_replica (float, optional) – CPUs per model replica. Defaults to 8.

  • num_gpus_per_replica (int, optional) – GPUs per model replica. Defaults to 1.

  • max_ongoing_requests (int, optional) – Maximum number of ongoing requests per replica. Defaults to 10.

  • test_mode (bool, optional) – Enable test mode. Defaults to False.

Raises:
  • Exception – If Ray is not installed or deployment fails.

  • ValueError – If C++ runtime specific options are used with Python runtime.