nemo_eval.utils.ray_deploy#

Module Contents#

Functions#

get_available_cpus

Get the number of available CPUs.

signal_handler

Handle termination signals.

deploy_with_ray

Deploy the model using Ray Serve.

Data#

API#

nemo_eval.utils.ray_deploy.logger#

‘getLogger(…)’

nemo_eval.utils.ray_deploy.get_available_cpus() int[source]#

Get the number of available CPUs.

nemo_eval.utils.ray_deploy.signal_handler(
signum,
frame,
ray_deployer: nemo_deploy.deploy_ray.DeployRay,
)[source]#

Handle termination signals.

nemo_eval.utils.ray_deploy.deploy_with_ray(
nemo_checkpoint: str,
num_gpus: int,
num_nodes: int,
tensor_model_parallel_size: int,
pipeline_model_parallel_size: int,
context_parallel_size: int,
expert_model_parallel_size: int,
num_replicas: int,
num_cpus_per_replica: Optional[int] = None,
host: str = '0.0.0.0',
port: int = 8000,
model_id: str = 'megatron_model',
enable_cuda_graphs: bool = False,
enable_flash_decode: bool = True,
legacy_ckpt: bool = False,
include_dashboard: bool = True,
) None[source]#

Deploy the model using Ray Serve.

Parameters:
  • nemo_checkpoint – Path to the NeMo checkpoint

  • num_gpus – Number of GPUs per node

  • num_nodes – Number of nodes

  • tensor_model_parallel_size – Tensor parallelism size

  • pipeline_model_parallel_size – Pipeline parallelism size

  • context_parallel_size – Context parallelism size

  • expert_model_parallel_size – Expert parallelism size

  • num_replicas – Number of model replicas to deploy

  • num_cpus_per_replica – Number of CPUs per replica

  • host – Host address to serve on

  • port – Port to serve on

  • model_id – Model identifier

  • enable_cuda_graphs – Whether to enable CUDA graphs

  • enable_flash_decode – Whether to enable flash decode

  • legacy_ckpt – Whether using legacy checkpoint format

  • include_dashboard – Whether to include Ray dashboard