nemo_eval.api#

Module Contents#

Functions#

deploy

Deploys nemo model on either PyTriton server or Ray Serve.

evaluate

Evaluates nemo model deployed on PyTriton server using nvidia-lm-eval

Data#

API#

nemo_eval.api.AnyPath#

None

nemo_eval.api.logger#

‘getLogger(…)’

nemo_eval.api.deploy(
nemo_checkpoint: Optional[nemo_eval.api.AnyPath] = None,
serving_backend: str = 'pytriton',
model_name: str = 'megatron_model',
server_port: int = 8080,
server_address: str = '0.0.0.0',
triton_address: str = '0.0.0.0',
triton_port: int = 8000,
num_gpus: int = 1,
num_nodes: int = 1,
tensor_parallelism_size: int = 1,
pipeline_parallelism_size: int = 1,
context_parallel_size: int = 1,
expert_model_parallel_size: int = 1,
max_input_len: int = 4096,
max_batch_size: int = 8,
enable_flash_decode: bool = True,
enable_cuda_graphs: bool = True,
num_replicas: int = 1,
num_cpus_per_replica: Optional[int] = None,
include_dashboard: bool = True,
legacy_ckpt: bool = False,
)[source]#

Deploys nemo model on either PyTriton server or Ray Serve.

Parameters:
  • nemo_checkpoint (Path) – Path for nemo checkpoint.

  • serving_backend (str) – Backend to use for serving (“pytriton” or “ray”). Default: “pytriton”.

  • model_name (str) – Name for the model that gets deployed on PyTriton or Ray.

  • server_port (int) – HTTP port for the FastAPI or Ray server. Default: 8080.

  • server_address (str) – HTTP address for the FastAPI or Ray server. Default: “0.0.0.0”.

  • triton_address (str) – HTTP address for Triton server. Default: “0.0.0.0”.

  • triton_port (int) – Port for Triton server. Default: 8000.

  • num_gpus (int) – Number of GPUs per node. Default: 1.

  • num_nodes (int) – Number of nodes. Default: 1.

  • tensor_parallelism_size (int) – Tensor parallelism size. Default: 1.

  • pipeline_parallelism_size (int) – Pipeline parallelism size. Default: 1.

  • context_parallel_size (int) – Context parallelism size. Default: 1.

  • expert_model_parallel_size (int) – Expert parallelism size. Default: 1.

  • max_input_len (int) – Max input length of the model. Default: 4096.

  • max_batch_size (int) – Max batch size of the model. Default: 8.

  • enable_flash_decode (bool) – If True runs inferencewith flash decode enabled. Default: True.

  • enable_cuda_graphs (bool) – Whether to enable CUDA graphs for inference. Default: True.

  • legacy_ckpt (bool) – Indicates whether the checkpoint is in the legacy format. Default: False.

  • ##### (##### Ray deployment specific args)

  • num_replicas (int) – Number of model replicas for Ray deployment. Default: 1. Only applicable for Ray backend.

  • num_cpus_per_replica (int) – Number of CPUs per replica for Ray deployment. Default: 8

  • include_dashboard (bool) – Whether to include Ray dashboard. Default: True.

  • legacy_ckpt – Indicates whether the checkpoint is in legacy format. Default: False.

nemo_eval.api.evaluate(
target_cfg: nemo_eval.utils.api.EvaluationTarget,
eval_cfg: nemo_eval.utils.api.EvaluationConfig = EvaluationConfig(type='gsm8k'),
adapter_cfg: nemo_eval.utils.api.AdapterConfig | None = None,
) dict[source]#

Evaluates nemo model deployed on PyTriton server using nvidia-lm-eval

Parameters:
  • target_cfg (EvaluationTarget) – target of the evaluation. Providing model_id and url in EvaluationTarget.api_endpoint is required to run evaluations.

  • eval_cfg (EvaluationConfig) – configuration for evaluations. Default type (task): gsm8k.

  • adapter_cfg (AdapterConfig) – configuration for adapters, the object between becnhmark and endpoint. Default: None.