nemo_eval.utils.ray_deploy
#
Module Contents#
Functions#
Get the number of available CPUs. |
|
Handle termination signals. |
|
Deploy the model using Ray Serve. |
Data#
API#
- nemo_eval.utils.ray_deploy.logger#
‘getLogger(…)’
- nemo_eval.utils.ray_deploy.signal_handler(
- signum,
- frame,
- ray_deployer: nemo_deploy.deploy_ray.DeployRay,
Handle termination signals.
- nemo_eval.utils.ray_deploy.deploy_with_ray(
- nemo_checkpoint: str,
- num_gpus: int,
- num_nodes: int,
- tensor_model_parallel_size: int,
- pipeline_model_parallel_size: int,
- context_parallel_size: int,
- expert_model_parallel_size: int,
- num_replicas: int,
- num_cpus_per_replica: Optional[int] = None,
- host: str = '0.0.0.0',
- port: int = 8000,
- model_id: str = 'megatron_model',
- enable_cuda_graphs: bool = False,
- enable_flash_decode: bool = True,
- legacy_ckpt: bool = False,
- include_dashboard: bool = True,
Deploy the model using Ray Serve.
- Parameters:
nemo_checkpoint – Path to the NeMo checkpoint
num_gpus – Number of GPUs per node
num_nodes – Number of nodes
tensor_model_parallel_size – Tensor parallelism size
pipeline_model_parallel_size – Pipeline parallelism size
context_parallel_size – Context parallelism size
expert_model_parallel_size – Expert parallelism size
num_replicas – Number of model replicas to deploy
num_cpus_per_replica – Number of CPUs per replica
host – Host address to serve on
port – Port to serve on
model_id – Model identifier
enable_cuda_graphs – Whether to enable CUDA graphs
enable_flash_decode – Whether to enable flash decode
legacy_ckpt – Whether using legacy checkpoint format
include_dashboard – Whether to include Ray dashboard