nemo_deploy.deploy_ray
#
Module Contents#
Classes#
A class for managing Ray deployment and serving of models. |
Functions#
Get the total number of available CPUs in the system. |
Data#
API#
- nemo_deploy.deploy_ray.LOGGER = 'getLogger(...)'#
- nemo_deploy.deploy_ray.get_available_cpus()#
Get the total number of available CPUs in the system.
- class nemo_deploy.deploy_ray.DeployRay(
- address: str = 'auto',
- num_cpus: Optional[int] = None,
- num_gpus: int = 1,
- include_dashboard: bool = False,
- ignore_reinit_error: bool = True,
- runtime_env: dict = None,
- host: str = '0.0.0.0',
- port: Optional[int] = None,
A class for managing Ray deployment and serving of models.
This class provides functionality to initialize Ray, start Ray Serve, deploy models, and manage the lifecycle of the Ray cluster. It supports both NeMo inframework models, Hugging Face models, and TensorRT-LLM models.
.. attribute:: address
The address of the Ray cluster to connect to.
- Type:
str
.. attribute:: num_cpus
Number of CPUs to allocate for the Ray cluster.
- Type:
int
.. attribute:: num_gpus
Number of GPUs to allocate for the Ray cluster.
- Type:
int
.. attribute:: include_dashboard
Whether to include the Ray dashboard.
- Type:
bool
.. attribute:: ignore_reinit_error
Whether to ignore errors when reinitializing Ray.
- Type:
bool
.. attribute:: runtime_env
Runtime environment configuration for Ray.
- Type:
dict
.. attribute:: host
Host address to bind the server to.
- Type:
str
.. attribute:: port
Port number for the server.
- Type:
int
.. method:: deploy_inframework_model
Deploy a NeMo inframework model using Ray Serve.
.. method:: deploy_huggingface_model
Deploy a Hugging Face model using Ray Serve.
.. method:: deploy_tensorrt_llm_model
Deploy a TensorRT-LLM model using Ray Serve.
Initialization
Initialize the DeployRay instance and set up the Ray cluster.
- Parameters:
address (str, optional) β Address of the Ray cluster. Defaults to βautoβ.
num_cpus (int, optional) β Number of CPUs to allocate. If None, uses all available. Defaults to None.
num_gpus (int, optional) β Number of GPUs to allocate. Defaults to 1.
include_dashboard (bool, optional) β Whether to include the dashboard. Defaults to False.
ignore_reinit_error (bool, optional) β Whether to ignore reinit errors. Defaults to True.
runtime_env (dict, optional) β Runtime environment configuration. Defaults to None.
host (str, optional) β Host address to bind the server to. Defaults to β0.0.0.0β.
port (int, optional) β Port number for the server. If None, an available port will be found. Defaults to None.
- Raises:
Exception β If Ray is not installed.
- _signal_handler(signum, frame)#
Handle signal interrupts and gracefully shutdown the deployer.
- _start()#
Start Ray Serve with the configured host and port.
Uses the host and port specified during DeployRay initialization. If port is None, an available port will be found automatically.
- _stop()#
Stop the Ray Serve deployment and shutdown the Ray cluster.
This method attempts to gracefully shutdown both Ray Serve and the Ray cluster. If any errors occur during shutdown, they are logged as warnings.
- deploy_inframework_model(
- nemo_checkpoint: str,
- num_gpus: int = 1,
- tensor_model_parallel_size: int = 1,
- pipeline_model_parallel_size: int = 1,
- expert_model_parallel_size: int = 1,
- context_parallel_size: int = 1,
- model_id: str = 'nemo-model',
- num_cpus_per_replica: float = 8,
- num_replicas: int = 1,
- enable_cuda_graphs: bool = False,
- enable_flash_decode: bool = False,
- legacy_ckpt: bool = False,
- max_batch_size: int = 32,
- random_seed: Optional[int] = None,
- test_mode: bool = False,
- megatron_checkpoint_filepath: str = None,
- model_type: str = 'gpt',
- model_format: str = 'nemo',
- micro_batch_size: Optional[int] = None,
- **model_config_kwargs,
Deploy an inframework NeMo/Megatron model using Ray Serve.
This method handles the complete deployment lifecycle including:
Starting Ray Serve
Creating and deploying the MegatronRayDeployable
Setting up signal handlers for graceful shutdown
Keeping the deployment running until interrupted
- Parameters:
nemo_checkpoint (str) β Path to the .nemo checkpoint file.
num_gpus (int, optional) β Number of GPUs per node. Defaults to 1.
tensor_model_parallel_size (int, optional) β Tensor model parallel size. Defaults to 1.
pipeline_model_parallel_size (int, optional) β Pipeline model parallel size. Defaults to 1.
expert_model_parallel_size (int, optional) β Expert model parallel size. Defaults to 1.
context_parallel_size (int, optional) β Context parallel size. Defaults to 1.
model_id (str, optional) β Model identifier for API responses. Defaults to βnemo-modelβ.
num_cpus_per_replica (float, optional) β CPUs per model replica. Defaults to 8.
num_replicas (int, optional) β Number of replicas for deployment. Defaults to 1.
enable_cuda_graphs (bool, optional) β Enable CUDA graphs. Defaults to False.
enable_flash_decode (bool, optional) β Enable Flash Attention decode. Defaults to False.
legacy_ckpt (bool, optional) β Use legacy checkpoint format. Defaults to False.
test_mode (bool, optional) β Enable test mode. Defaults to False.
megatron_checkpoint_filepath (str, optional) β Path to the Megatron checkpoint file. Defaults to None.
model_type (str, optional) β Type of model to load. Defaults to βgptβ.
model_format (str, optional) β Format of model to load. Defaults to βnemoβ.
micro_batch_size (Optional[int], optional) β Micro batch size for model execution. Defaults to None.
- Raises:
SystemExit β If parallelism configuration is invalid.
Exception β If deployment fails.
- deploy_huggingface_model(
- hf_model_id_path: str,
- task: str = 'text-generation',
- trust_remote_code: bool = True,
- device_map: Optional[str] = None,
- max_memory: Optional[str] = None,
- model_id: str = 'hf-model',
- num_replicas: int = 1,
- num_cpus_per_replica: float = 8,
- num_gpus_per_replica: int = 1,
- max_ongoing_requests: int = 10,
- use_vllm_backend: bool = False,
- test_mode: bool = False,
Deploy a Hugging Face model using Ray Serve.
This method handles the complete deployment lifecycle including:
Starting Ray Serve
Creating and deploying the HFRayDeployable
Setting up signal handlers for graceful shutdown
Keeping the deployment running until interrupted
- Parameters:
hf_model_id_path (str) β Path to the HuggingFace model or model identifier. Can be a local path or a model ID from HuggingFace Hub.
task (str, optional) β HuggingFace task type. Defaults to βtext-generationβ.
trust_remote_code (bool, optional) β Whether to trust remote code when loading the model. Defaults to True.
device_map (str, optional) β Device mapping strategy for model placement. Defaults to βautoβ.
max_memory (str, optional) β Maximum memory allocation when using balanced device map. Defaults to None.
model_id (str, optional) β Model identifier for API responses. Defaults to βhf-modelβ.
num_replicas (int, optional) β Number of replicas for deployment. Defaults to 1.
num_cpus_per_replica (float, optional) β CPUs per model replica. Defaults to 8.
num_gpus_per_replica (int, optional) β GPUs per model replica. Defaults to 1.
max_ongoing_requests (int, optional) β Maximum number of ongoing requests per replica. Defaults to 10.
use_vllm_backend (bool, optional) β Whether to use vLLM backend for deployment. If True, exports the HF ckpt
False. (to vLLM format and uses vLLM backend for inference. Defaults to)
test_mode (bool, optional) β Enable test mode. Defaults to False.
- Raises:
Exception β If Ray is not installed or deployment fails.
- deploy_tensorrt_llm_model(
- trt_llm_path: str,
- model_id: str = 'tensorrt-llm-model',
- use_python_runtime: bool = True,
- multi_block_mode: bool = False,
- lora_ckpt_list: Optional[list] = None,
- enable_chunked_context: bool = False,
- max_tokens_in_paged_kv_cache: Optional[int] = None,
- num_replicas: int = 1,
- num_cpus_per_replica: float = 8,
- num_gpus_per_replica: int = 1,
- max_ongoing_requests: int = 10,
- test_mode: bool = False,
Deploy a TensorRT-LLM model using Ray Serve.
This method handles the complete deployment lifecycle including:
Starting Ray Serve
Creating and deploying the TensorRTLLMRayDeployable
Setting up signal handlers for graceful shutdown
Keeping the deployment running until interrupted
Note: This method assumes the model is already converted to TensorRT-LLM format. The conversion should be done before calling this API.
- Parameters:
trt_llm_path (str) β Path to the TensorRT-LLM model directory with pre-built engines.
model_id (str, optional) β Model identifier for API responses. Defaults to βtensorrt-llm-modelβ.
use_python_runtime (bool, optional) β Whether to use Python runtime (vs C++ runtime). Defaults to True.
multi_block_mode (bool, optional) β Whether to enable multi-block mode. Defaults to False.
lora_ckpt_list (list, optional) β List of LoRA checkpoint paths. Defaults to None.
enable_chunked_context (bool, optional) β Whether to enable chunked context (C++ runtime only). Defaults to False.
max_tokens_in_paged_kv_cache (int, optional) β Maximum tokens in paged KV cache (C++ runtime only). Defaults to None.
num_replicas (int, optional) β Number of replicas for deployment. Defaults to 1.
num_cpus_per_replica (float, optional) β CPUs per model replica. Defaults to 8.
num_gpus_per_replica (int, optional) β GPUs per model replica. Defaults to 1.
max_ongoing_requests (int, optional) β Maximum number of ongoing requests per replica. Defaults to 10.
test_mode (bool, optional) β Enable test mode. Defaults to False.
- Raises:
Exception β If Ray is not installed or deployment fails.
ValueError β If C++ runtime specific options are used with Python runtime.