vLLM - NVIDIA Docs

vLLM workload (test_template_name is vllm) allows users to execute vLLM benchmarks within the CloudAI framework.

vLLM is a high-throughput and memory-efficient inference engine for LLMs. This workload supports both aggregated and disaggregated prefill/decode modes.

Usage Examples

Test and Scenario Examples

test.toml (test definition)

Copy
Copied!

            
            name = "vllm_test"
description = "Example vLLM test"
test_template_name = "vllm"

[cmd_args]
docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0"
model = "Qwen/Qwen3-0.6B"

[bench_cmd_args]
random_input_len = 16
random_output_len = 128
max_concurrency = 16
num_prompts = 30

scenario.toml (scenario with one test)

Copy
Copied!

            
            name = "vllm-benchmark"

[[Tests]]
id = "vllm.1"
num_nodes = 1
time_limit = "00:10:00"
test_name = "vllm_test"

Test-in-Scenario example

scenario.toml (separate test toml is not needed)

Copy
Copied!

            
            name = "vllm-benchmark"

[[Tests]]
id = "vllm.1"
num_nodes = 1
time_limit = "00:10:00"

name = "vllm_test"
description = "Example vLLM test"
test_template_name = "vllm"

[Tests.cmd_args]
docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0"
model = "Qwen/Qwen3-0.6B"

[Tests.bench_cmd_args]
random_input_len = 16
random_output_len = 128
max_concurrency = 16
num_prompts = 30

Controlling the Number of GPUs

The number of GPUs can be controlled using the options below, listed from lowest to highest priority: 1. gpus_per_node system property (scalar value) 2. CUDA_VISIBLE_DEVICES environment variable (comma-separated list of GPU IDs) 3. gpu_ids command argument for prefill and decode configurations (comma-separated list of GPU IDs). If disaggregated mode is used (prefill is set), both prefill and decode should define gpu_ids, or none of them should set it.

Controlling Disaggregation

By default, vLLM will run without disaggregation as a single process. To enable disaggregation, one needs to set prefill configuration:

test.toml (disaggregated prefill/decode)

Copy
Copied!

            
            [cmd_args]
docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0"
model = "Qwen/Qwen3-0.6B"

[cmd_args.prefill]

[extra_env_vars]
CUDA_VISIBLE_DEVICES = "0,1,2,3"

The config above, will automatically split GPUs specified in CUDA_VISIBLE_DEVICES into two: - The first half will be used for prefill - The second half will be used for decode

For more control, users can specify the GPU IDs explicitly in prefill and decode configurations:

test.toml (disaggregated prefill/decode)

Copy
Copied!

            
            [cmd_args.prefill]
gpu_ids = "0,1"

[cmd_args.decode]
gpu_ids = "2,3"

In this case CUDA_VISIBLE_DEVICES will be ignored and only the GPUs specified in gpu_ids will be used.

Controlling proxy_script

proxy_script is used to proxy the requests from the client to the prefill and decode instances. It is ignored for non-disaggregated mode. Default value can be found below.

It can be overridden by setting proxy_script by using the latest version of the script from vLLM repository:

test_scenario.toml (override proxy_script)

Copy
Copied!

            
            [[Tests.git_repos]]
url = "https://github.com/vllm-project/vllm.git"
commit = "main"
mount_as = "/vllm_repo"

[Tests.cmd_args]
docker_image_url = "vllm/vllm-openai:v0.14.0-cu130"
proxy_script = "/vllm_repo/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py"

In this case the proxy script will be mounted from the vLLM repository (cloned locally) as /vllm_repo and used for the test.

API Documentation

vLLM Serve Arguments

pydantic model cloudai.workloads.vllm.vllm.VllmArgs[source]

Base command arguments for vLLM instances.

field nixl_threads: int | list[int] | None = None: Set kv_connector_extra_config.num_threads for --kv-transfer-config CLI argument.

property serve_args_exclude: set[str]: Fields consumed internally and excluded from generic serve args.

serialize_serve_arg(key: str, value: object) → list[str][source]: Serialize a single serve argument to CLI tokens.

property serve_args: list[str]

field gpu_ids: str | list[str] | None = None: Comma-separated GPU IDs. If not set, all available GPUs will be used.

Command Arguments

class cloudai.workloads.vllm.vllm.VllmCmdArgs(*, docker_image_url: str, model: str = 'Qwen/Qwen3-0.6B', port: ~typing.Annotated[int, ~annotated_types.Ge(ge=1), ~annotated_types.Le(le=65535)] = 8300, host: str = '0.0.0.0', bench_host: str | None = None, healthcheck: str = '/healthcheck', serve_wait_seconds: int = 300, prefill: ~cloudai.workloads.vllm.vllm.VllmArgs | None = None, decode: ~cloudai.workloads.vllm.vllm.VllmArgs = <factory>, proxy_script: str = '/opt/vllm/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py')[source]

Bases: LLMServingCmdArgs[VllmArgs]

vLLM serve command arguments.

Benchmark Command Arguments

class cloudai.workloads.vllm.vllm.VllmBenchCmdArgs(*, random_input_len: int = 16, random_output_len: int = 128, max_concurrency: int = 16, num_prompts: int = 30, **extra_data: Any)[source]

Bases: CmdArgs

vLLM bench serve command arguments.

Test Definition

class cloudai.workloads.vllm.vllm.VllmTestDefinition(*, name: str, description: str, test_template_name: str, cmd_args: VllmCmdArgs, extra_env_vars: dict[str, str | List[str]] = {}, extra_cmd_args: dict[str, str] = {}, extra_container_mounts: list[str] = [], git_repos: list[GitRepo] = [], nsys: NsysConfiguration | None = None, predictor: PredictorConfig | None = None, agent: str = 'grid_search', agent_steps: int = 1, agent_metrics: list[str] = ['default'], agent_reward_function: str = 'inverse', agent_config: dict[str, Any] | None = None, bench_cmd_args: VllmBenchCmdArgs = VllmBenchCmdArgs(random_input_len=16, random_output_len=128, max_concurrency=16, num_prompts=30), proxy_script_repo: GitRepo | None = None)[source]

Bases: LLMServingTestDefinition[VllmCmdArgs]

Test object for vLLM.