`nemo_eval.api`#

Module Contents#

Functions#

Deploys nemo model on either PyTriton server or Ray Serve.

Data#

`AnyPath`
`logger`

API#

nemo_eval.api.AnyPath#: None

nemo_eval.api.logger#: ‘getLogger(…)’

nemo_eval.api.deploy( nemo_checkpoint: Optional[nemo_eval.api.AnyPath] = None, hf_model_id_path: Optional[nemo_eval.api.AnyPath] = None, serving_backend: str = 'pytriton', model_name: str = 'megatron_model', server_port: int = 8080, server_address: str = '0.0.0.0', triton_address: str = '0.0.0.0', triton_port: int = 8000, num_gpus: int = 1, num_nodes: int = 1, tensor_parallelism_size: int = 1, pipeline_parallelism_size: int = 1, context_parallel_size: int = 1, expert_model_parallel_size: int = 1, max_input_len: int = 4096, max_batch_size: int = 8, enable_flash_decode: bool = True, enable_cuda_graphs: bool = True, legacy_ckpt: bool = False, use_vllm_backend: bool = True, num_replicas: int = 1, num_cpus: Optional[int] = None, include_dashboard: bool = True, model_config_kwargs: dict = None, )[source]#

Deploys nemo model on either PyTriton server or Ray Serve.

Parameters:

nemo_checkpoint (Path) – Path for nemo checkpoint.
hf_model_id_path (Path) – Huggingface model id or local path to the model. Supported only for Ray backend.
serving_backend (str) – Backend to use for serving (“pytriton” or “ray”). Default: “pytriton”.
model_name (str) – Name for the model that gets deployed on PyTriton or Ray.
server_port (int) – HTTP port for the FastAPI or Ray server. Default: 8080.
server_address (str) – HTTP address for the FastAPI or Ray server. Default: “0.0.0.0”.
triton_address (str) – HTTP address for Triton server. Default: “0.0.0.0”.
triton_port (int) – Port for Triton server. Default: 8000.
num_gpus (int) – Number of GPUs per node. Default: 1.
num_nodes (int) – Number of nodes. Default: 1.
tensor_parallelism_size (int) – Tensor parallelism size. Default: 1.
pipeline_parallelism_size (int) – Pipeline parallelism size. Default: 1.
context_parallel_size (int) – Context parallelism size. Default: 1.
expert_model_parallel_size (int) – Expert parallelism size. Default: 1.
max_input_len (int) – Max input length of the model. Default: 4096.
max_batch_size (int) – Max batch size of the model. Default: 8.
##### (##### Ray deployment specific args)
enable_flash_decode (bool) – If True runs inferencewith flash decode enabled. Default: True. Applicable only for
checkpoint. (for nemo)
enable_cuda_graphs (bool) – Whether to enable CUDA graphs for inference. Default: True. Applicable only for
checkpoint.
legacy_ckpt (bool) – Indicates whether the checkpoint is in the legacy format. Default: False. Applicable only
checkpoint.
#####
use_vllm_backend (bool) – Whether to use VLLM backend. Default: True. Applicable only for huggingface
checkpoint.
#####
num_replicas (int) – Number of model replicas for Ray deployment. Default: 1. Only applicable for Ray backend.
num_cpus (int) – Number of CPUs to allocate for the Ray cluster. If None, will use all available CPUs.
Default – None.
include_dashboard (bool) – Whether to include Ray dashboard. Default: True.
model_config_kwargs (dict) – Additional keyword arguments for Megatron model config.