CloudAI Benchmark Framework v1.6.1

vLLM

vLLM workload (test_template_name is vllm) allows users to execute vLLM benchmarks within the CloudAI framework.

vLLM is a high-throughput and memory-efficient inference engine for LLMs. This workload supports both aggregated and disaggregated prefill/decode modes.

Test and Scenario Examples

test.toml (test definition)

Copy
Copied!
            

name = "vllm_test" description = "Example vLLM test" test_template_name = "vllm" [cmd_args] docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0" model = "Qwen/Qwen3-0.6B" [bench_cmd_args] random_input_len = 16 random_output_len = 128 max_concurrency = 16 num_prompts = 30


scenario.toml (scenario with one test)

Copy
Copied!
            

name = "vllm-benchmark" [[Tests]] id = "vllm.1" num_nodes = 1 time_limit = "00:10:00" test_name = "vllm_test"


Test-in-Scenario example

scenario.toml (separate test toml is not needed)

Copy
Copied!
            

name = "vllm-benchmark" [[Tests]] id = "vllm.1" num_nodes = 1 time_limit = "00:10:00" name = "vllm_test" description = "Example vLLM test" test_template_name = "vllm" [Tests.cmd_args] docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0" model = "Qwen/Qwen3-0.6B" [Tests.bench_cmd_args] random_input_len = 16 random_output_len = 128 max_concurrency = 16 num_prompts = 30


The number of GPUs can be controlled using the options below, listed from lowest to highest priority: 1. gpus_per_node system property (scalar value) 2. CUDA_VISIBLE_DEVICES environment variable (comma-separated list of GPU IDs) 3. gpu_ids command argument for prefill and decode configurations (comma-separated list of GPU IDs). If disaggregated mode is used (prefill is set), both prefill and decode should define gpu_ids, or none of them should set it.

By default, vLLM will run without disaggregation as a single process. To enable disaggregation, one needs to set prefill configuration:

test.toml (disaggregated prefill/decode)

Copy
Copied!
            

[cmd_args] docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0" model = "Qwen/Qwen3-0.6B" [cmd_args.prefill] [extra_env_vars] CUDA_VISIBLE_DEVICES = "0,1,2,3"


The config above, will automatically split GPUs specified in CUDA_VISIBLE_DEVICES into two: - The first half will be used for prefill - The second half will be used for decode

For more control, users can specify the GPU IDs explicitly in prefill and decode configurations:

test.toml (disaggregated prefill/decode)

Copy
Copied!
            

[cmd_args.prefill] gpu_ids = "0,1" [cmd_args.decode] gpu_ids = "2,3"


In this case CUDA_VISIBLE_DEVICES will be ignored and only the GPUs specified in gpu_ids will be used.

proxy_script is used to proxy the requests from the client to the prefill and decode instances. It is ignored for non-disaggregated mode. Default value can be found below.

It can be overridden by setting proxy_script by using the latest version of the script from vLLM repository:

test_scenario.toml (override proxy_script)

Copy
Copied!
            

[[Tests.git_repos]] url = "https://github.com/vllm-project/vllm.git" commit = "main" mount_as = "/vllm_repo" [Tests.cmd_args] docker_image_url = "vllm/vllm-openai:v0.14.0-cu130" proxy_script = "/vllm_repo/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py"


In this case the proxy script will be mounted from the vLLM repository (cloned locally) as /vllm_repo and used for the test.

vLLM Serve Arguments

pydantic model cloudai.workloads.vllm.vllm.VllmArgs[source]

Base command arguments for vLLM instances.

field nixl_threads: int | list[int] | None = None

Set kv_connector_extra_config.num_threads for --kv-transfer-config CLI argument.

property serve_args_exclude: set[str]

Fields consumed internally and excluded from generic serve args.

serialize_serve_arg(key: str, value: object) → list[str][source]

Serialize a single serve argument to CLI tokens.

property serve_args: list[str]
field gpu_ids: str | list[str] | None = None

Comma-separated GPU IDs. If not set, all available GPUs will be used.

Command Arguments

class cloudai.workloads.vllm.vllm.VllmCmdArgs(*, docker_image_url: str, model: str = 'Qwen/Qwen3-0.6B', port: ~typing.Annotated[int, ~annotated_types.Ge(ge=1), ~annotated_types.Le(le=65535)] = 8300, host: str = '0.0.0.0', bench_host: str | None = None, healthcheck: str = '/healthcheck', serve_wait_seconds: int = 300, prefill: ~cloudai.workloads.vllm.vllm.VllmArgs | None = None, decode: ~cloudai.workloads.vllm.vllm.VllmArgs = <factory>, proxy_script: str = '/opt/vllm/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py')[source]

Bases: LLMServingCmdArgs[VllmArgs]

vLLM serve command arguments.

Benchmark Command Arguments

class cloudai.workloads.vllm.vllm.VllmBenchCmdArgs(*, random_input_len: int = 16, random_output_len: int = 128, max_concurrency: int = 16, num_prompts: int = 30, **extra_data: Any)[source]

Bases: CmdArgs

vLLM bench serve command arguments.

Test Definition

class cloudai.workloads.vllm.vllm.VllmTestDefinition(*, name: str, description: str, test_template_name: str, cmd_args: VllmCmdArgs, extra_env_vars: dict[str, str | List[str]] = {}, extra_cmd_args: dict[str, str] = {}, extra_container_mounts: list[str] = [], git_repos: list[GitRepo] = [], nsys: NsysConfiguration | None = None, predictor: PredictorConfig | None = None, agent: str = 'grid_search', agent_steps: int = 1, agent_metrics: list[str] = ['default'], agent_reward_function: str = 'inverse', agent_config: dict[str, Any] | None = None, bench_cmd_args: VllmBenchCmdArgs = VllmBenchCmdArgs(random_input_len=16, random_output_len=128, max_concurrency=16, num_prompts=30), proxy_script_repo: GitRepo | None = None)[source]

Bases: LLMServingTestDefinition[VllmCmdArgs]

Test object for vLLM.

Previous UCC
Next Installation Requirements
© Copyright 2026, NVIDIA CORPORATION & AFFILIATES. Last updated on Jun 3, 2026