CloudAI Benchmark Framework v1.5.0

AI Dynamo

This workload (test_template_name is AIDynamo) runs AI inference benchmarks using the Dynamo framework with distributed prefill and decode workers.

Prepare cluster

Before running the AI Dynamo workload on a Kubernetes cluster, ensure that the cluster is set up according to the instructions in the official documentation. Below is a short summary of the required steps:

Copy
Copied!
            

export NAMESPACE=dynamo-system export RELEASE_VERSION=0.7.0 # replace with the desired release version helm upgrade -n default -i dynamo-crds https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz helm upgrade -n default -i dynamo-platform https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz # The following components are required for multi node only. # Versions should be aligned with Dynamo version. helm upgrade -n default -i grove oci://ghcr.io/ai-dynamo/grove/grove-charts:v0.0.0-gd462e65 helm upgrade -n default -i kai-scheduler oci://ghcr.io/nvidia/kai-scheduler/kai-scheduler:0.0.0-4c29820

Launch and Monitor the Job

Note

Both CloudAI and Dynamo will try to access HuggingFace Hub. To avoid 429 Too Many Requests errors and access models under auth, it is recommended to define HF_TOKEN environment variable before invoking CloudAI. Once set, run uv run hf auth login to authenticate.

Copy
Copied!
            

uv run cloudai run --system-config <k8s system toml> \ --tests-dir conf/experimental/ai_dynamo/test \ --test-scenario conf/experimental/ai_dynamo/test_scenario/vllm_k8s.toml

Node Configuration for AI Dynamo

AI Dynamo jobs use three distinct types of nodes:

  • Frontend node: Hosts the coordination services (etcd, nats), the frontend server, the request generator (genai-perf), and the first decode worker

  • Prefill node(s): Handle the prefill stage of inference

  • Decode node(s): Handle the decode stage of inference (optional, depending on model and setup)

The total number of required nodes must be:

Copy
Copied!
            

num_prefill_nodes + num_decode_nodes

If there is a mismatch in the number of nodes between the schema and the test scenario, CloudAI will use the number of nodes specified in the test schema, ignoring the value in the test scenario.

All node role assignments and orchestration are automatically managed by CloudAI.

Launch and Monitor the Job

To run the job:

Copy
Copied!
            

uv run cloudai run --system-config <slurm system toml> \ --tests-dir conf/experimental/ai_dynamo/test \ --test-scenario conf/experimental/ai_dynamo/test_scenario/vllm_slurm.toml

One can monitor job progress using either of the following options:

Copy
Copied!
            

watch squeue --me

Copy
Copied!
            

watch tail -n 4 ./results/<scenario name>/*.txt

The frontend node will initially wait to allow weight loading on all nodes. Once ready, it will launch genai-perf, which begins generating requests to the frontend server. All servers cooperate to complete inference, and the output will appear in stdout.txt.

After job completion, CloudAI will place the output logs and result files in the designated results directory. To analyze performance metrics and validate inference outcomes:

  • Navigate to the results directory (e.g., ./results/...)

  • Most importantly, open the profile_genai_perf.csv file to examine the final benchmarking results

This CSV file includes detailed metrics collected by genai-perf, such as request latency, throughput, and system utilization statistics. Use this data to evaluate the model’s performance and identify potential bottlenecks or optimization opportunities.

Copy
Copied!
            

Metric,avg,min,max,p99,p95,p90,p75,p50,p25 Time To First Token (ms),"1,146.31",249.48,"3,485.23","3,457.97","3,349.56","3,215.06","1,330.93",640.07,286.52 Time To Second Token (ms),26.05,0.00,133.51,96.12,36.56,34.88,34.35,33.55,1.78 Request Latency (ms),"6,406.20","5,371.47","9,608.72","9,436.13","9,046.58","9,028.16","6,549.60","5,690.23","5,493.63" Inter Token Latency (ms),30.35,27.59,35.60,35.23,33.88,32.53,31.05,30.13,29.04 Output Sequence Length (tokens),174.45,164.00,187.00,186.22,183.10,180.10,177.00,174.00,171.75 Input Sequence Length (tokens),"3,000.05","2,999.00","3,001.00","3,001.00","3,001.00","3,000.00","3,000.00","3,000.00","3,000.00" Metric,Value Output Token Throughput (per sec),261.25 Request Throughput (per sec),1.50 Request Count (count),40.00

Command Arguments

class cloudai.workloads.ai_dynamo.ai_dynamo.AIDynamoCmdArgs(*, docker_image_url: str, huggingface_home_container_path: Path = PosixPath('/root/.cache/huggingface'), dynamo: AIDynamoArgs, genai_perf: GenAIPerfArgs, run_script: str = '', **extra_data: Any)[source]

Bases: CmdArgs

Arguments for AI Dynamo.

docker_image_url: str
huggingface_home_container_path: Path
dynamo: AIDynamoArgs
genai_perf: GenAIPerfArgs
run_script: str

Test Definition

class cloudai.workloads.ai_dynamo.ai_dynamo.AIDynamoTestDefinition(*, name: str, description: str, test_template_name: str, cmd_args: ~cloudai.workloads.ai_dynamo.ai_dynamo.AIDynamoCmdArgs, extra_env_vars: dict[str, str | ~typing.List[str]] = {}, extra_cmd_args: dict[str, str] = {}, extra_container_mounts: list[str] = [], git_repos: list[~cloudai._core.installables.GitRepo] = [], nsys: ~cloudai.models.workload.NsysConfiguration | None = None, predictor: ~cloudai.models.workload.PredictorConfig | None = None, agent: str = 'grid_search', agent_steps: int = 1, agent_metrics: list[str] = ['default'], agent_reward_function: str = 'inverse', script: ~cloudai._core.installables.File = File(src=PosixPath('/home/runner/work/cloudai/cloudai/src/cloudai/workloads/ai_dynamo/ai_dynamo.sh')), dynamo_repo: ~cloudai._core.installables.GitRepo = GitRepo(url=https://github.com/ai-dynamo/dynamo.git, commit=f7e468c7e8ff0d1426db987564e60572167e8464))[source]

Bases: TestDefinition

Test definition for AI Dynamo.

cmd_args: AIDynamoCmdArgs
script: File
dynamo_repo: GitRepo
property docker_image: DockerImage
property hf_model: HFModel
property installables: list[Installable]
was_run_successful(tr: TestRun) → JobStatusResult[source]
Previous Workloads Documentation
Next AIConfigurator
© Copyright 2026, NVIDIA CORPORATION & AFFILIATES. Last updated on Mar 3, 2026