CloudAI Benchmark Framework v1.6.1

SGLang

This workload (test_template_name is sglang) allows users to execute SGLang benchmarks within the CloudAI framework.

SGLang is a high-throughput and memory-efficient inference engine for LLMs. This workload supports both aggregated and disaggregated prefill/decode modes.

Test + Scenario example

test.toml (test definition)

Copy
Copied!
            

name = "sglang_test" description = "Example SGLang benchmark" test_template_name = "sglang" [cmd_args] docker_image_url = "lmsysorg/sglang:dev-cu13" model = "Qwen/Qwen3-8B" [bench_cmd_args] random_input = 16 random_output = 128 max_concurrency = 16 num_prompts = 30


scenario.toml (scenario with one test)

Copy
Copied!
            

name = "sglang-benchmark" [[Tests]] id = "sglang.1" num_nodes = 1 time_limit = "00:10:00" test_name = "sglang_test"


Test-in-Scenario example

scenario.toml (separate test toml is not needed)

Copy
Copied!
            

name = "sglang-benchmark" [[Tests]] id = "sglang.1" num_nodes = 1 time_limit = "00:10:00" name = "sglang_test" description = "Example SGLang benchmark" test_template_name = "sglang" [Tests.cmd_args] docker_image_url = "lmsysorg/sglang:dev-cu13" model = "Qwen/Qwen3-8B" [Tests.bench_cmd_args] random_input = 16 random_output = 128 max_concurrency = 16 num_prompts = 30


The number of GPUs can be controlled using the options below, listed from lowest to highest priority: 1. gpus_per_node system property (scalar value) 2. CUDA_VISIBLE_DEVICES environment variable (comma-separated list of GPU IDs) 3. gpu_ids command argument for prefill and decode configurations (comma-separated list of GPU IDs). If disaggregated mode is used (prefill is set), both prefill and decode should define gpu_ids, or none of them should set it.

By default, SGLang will run without disaggregation as a single process. To enable disaggregation, one needs to set prefill configuration:

test.toml (disaggregated prefill/decode)

Copy
Copied!
            

[cmd_args] docker_image_url = "lmsysorg/sglang:dev-cu13" model = "Qwen/Qwen3-8B" [cmd_args.prefill] [extra_env_vars] CUDA_VISIBLE_DEVICES = "0,1,2,3"


The config above will automatically split GPUs specified in CUDA_VISIBLE_DEVICES into two halves, first half will be used for prefill and second half will be used for decode.

For more control, one can specify the GPU IDs explicitly in prefill and decode configurations:

test.toml (disaggregated prefill/decode)

Copy
Copied!
            

[cmd_args.prefill] gpu_ids = "0,1" [cmd_args.decode] gpu_ids = "2,3"


In this case CUDA_VISIBLE_DEVICES will be ignored and only the GPUs specified in gpu_ids will be used.

SGLang Serve Arguments

pydantic model cloudai.workloads.sglang.sglang.SglangArgs[source]

Base command arguments for SGLang instances.

field disaggregation_transfer_backend: str | list[str] | None = 'nixl'

Transfer backend used in disaggregated mode. It is consumed by command generation and not emitted as a generic serve argument.

property serve_args_exclude: set[str]

Fields consumed internally and excluded from generic serve args.

serialize_serve_arg(key: str, value: Any) → list[str]

Serialize a single serve argument to CLI tokens.

property serve_args: list[str]
field gpu_ids: str | list[str] | None = None

Comma-separated GPU IDs. If not set, all available GPUs will be used.

Command Arguments

pydantic model cloudai.workloads.sglang.sglang.SglangCmdArgs[source]

Bases: LLMServingCmdArgs[SglangArgs]

SGLang serve command arguments.

field model: str = 'Qwen/Qwen3-8B'
field serve_module: str = 'sglang.launch_server'
field router_module: str = 'sglang_router.launch_router'
field bench_module: str = 'sglang.bench_serving'
field healthcheck: str = '/v1/models'

Health check router endpoint.

field prefill: SglangArgs | None = None

Prefill instance arguments. If not set, a single instance without disaggregation is used.

field decode: SglangArgs [Optional]

Decode instance arguments.

field docker_image_url: str [Required]
field port: int = 8300
Constraints:
  • ge = 1

  • le = 65535

field host: str = '0.0.0.0'

Host/interface for serve or router processes to bind to.

field bench_host: str | None = None

Hostname used by the benchmark client. Defaults to the allocated node hostname.

field serve_wait_seconds: int = 300

Benchmark Command Arguments

pydantic model cloudai.workloads.sglang.sglang.SglangBenchCmdArgs[source]

Bases: CmdArgs

SGLang bench_serving command arguments.

field backend: str = 'sglang'
field dataset_name: str = 'random'
field num_prompts: int = 30
field max_concurrency: int = 16
field random_input: int = 16
field random_output: int = 128
field warmup_requests: int = 2
field random_range_ratio: float = 1.0
field output_details: bool = True

Test Definition

pydantic model cloudai.workloads.sglang.sglang.SglangTestDefinition[source]

Bases: LLMServingTestDefinition[SglangCmdArgs]

Test object for SGLang.

field bench_cmd_args: SglangBenchCmdArgs = SglangBenchCmdArgs(backend='sglang', dataset_name='random', num_prompts=30, max_concurrency=16, random_input=16, random_output=128, warmup_requests=2, random_range_ratio=1.0, output_details=True)
was_run_successful(tr: TestRun) → JobStatusResult[source]
property cmd_args_dict: Dict[str, str | List[str]]
constraint_check(tr: TestRun, system: System | None) → bool
property docker_image: DockerImage
property extra_args_str: str
property extra_installables: list[Installable]
property hf_model: HFModel
property installables: list[Installable]
property is_dse_job: bool
field cmd_args: LLMServingCmdArgsT [Required]
field name: str [Required]
field description: str [Required]
field test_template_name: str [Required]
field extra_env_vars: dict[str, str | List[str]] = {}
field extra_cmd_args: dict[str, str] = {}
field extra_container_mounts: list[str] = []
field git_repos: list[GitRepo] = []
field nsys: NsysConfiguration | None = None
field predictor: PredictorConfig | None = None
field agent: str = 'grid_search'
field agent_steps: int = 1
field agent_metrics: list[str] = ['default']
field agent_reward_function: str = 'inverse'
field agent_config: dict[str, Any] | None = None

Agent configuration.

Previous OSU
Next Sleep
© Copyright 2026, NVIDIA CORPORATION & AFFILIATES. Last updated on Jun 3, 2026