SGLang - NVIDIA Docs

This workload (test_template_name is sglang) allows users to execute SGLang benchmarks within the CloudAI framework.

SGLang is a high-throughput and memory-efficient inference engine for LLMs. This workload supports both aggregated and disaggregated prefill/decode modes.

Usage Examples

Test + Scenario example

test.toml (test definition)

Copy
Copied!

            
            name = "sglang_test"
description = "Example SGLang benchmark"
test_template_name = "sglang"

[cmd_args]
docker_image_url = "lmsysorg/sglang:dev-cu13"
model = "Qwen/Qwen3-8B"

[bench_cmd_args]
random_input = 16
random_output = 128
max_concurrency = 16
num_prompts = 30

scenario.toml (scenario with one test)

Copy
Copied!

            
            name = "sglang-benchmark"

[[Tests]]
id = "sglang.1"
num_nodes = 1
time_limit = "00:10:00"
test_name = "sglang_test"

Test-in-Scenario example

scenario.toml (separate test toml is not needed)

Copy
Copied!

            
            name = "sglang-benchmark"

[[Tests]]
id = "sglang.1"
num_nodes = 1
time_limit = "00:10:00"

name = "sglang_test"
description = "Example SGLang benchmark"
test_template_name = "sglang"

[Tests.cmd_args]
docker_image_url = "lmsysorg/sglang:dev-cu13"
model = "Qwen/Qwen3-8B"

[Tests.bench_cmd_args]
random_input = 16
random_output = 128
max_concurrency = 16
num_prompts = 30

The number of GPUs can be controlled using the options below, listed from lowest to highest priority: 1. gpus_per_node system property (scalar value) 2. CUDA_VISIBLE_DEVICES environment variable (comma-separated list of GPU IDs) 3. gpu_ids command argument for prefill and decode configurations (comma-separated list of GPU IDs). If disaggregated mode is used (prefill is set), both prefill and decode should define gpu_ids, or none of them should set it.

Control disaggregation

By default, SGLang will run without disaggregation as a single process. To enable disaggregation, one needs to set prefill configuration:

test.toml (disaggregated prefill/decode)

Copy
Copied!

            
            [cmd_args]
docker_image_url = "lmsysorg/sglang:dev-cu13"
model = "Qwen/Qwen3-8B"

[cmd_args.prefill]

[extra_env_vars]
CUDA_VISIBLE_DEVICES = "0,1,2,3"

The config above will automatically split GPUs specified in CUDA_VISIBLE_DEVICES into two halves, first half will be used for prefill and second half will be used for decode.

For more control, one can specify the GPU IDs explicitly in prefill and decode configurations:

test.toml (disaggregated prefill/decode)

Copy
Copied!

            
            [cmd_args.prefill]
gpu_ids = "0,1"

[cmd_args.decode]
gpu_ids = "2,3"

In this case CUDA_VISIBLE_DEVICES will be ignored and only the GPUs specified in gpu_ids will be used.

API Documentation

SGLang Serve Arguments

pydantic model cloudai.workloads.sglang.sglang.SglangArgs[source]

Base command arguments for SGLang instances.

field disaggregation_transfer_backend: str | list[str] | None = 'nixl': Transfer backend used in disaggregated mode. It is consumed by command generation and not emitted as a generic serve argument.

property serve_args_exclude: set[str]: Fields consumed internally and excluded from generic serve args.

serialize_serve_arg(key: str, value: Any) → list[str]: Serialize a single serve argument to CLI tokens.

property serve_args: list[str]

field gpu_ids: str | list[str] | None = None: Comma-separated GPU IDs. If not set, all available GPUs will be used.

Command Arguments

pydantic model cloudai.workloads.sglang.sglang.SglangCmdArgs[source]

Bases: LLMServingCmdArgs[SglangArgs]

SGLang serve command arguments.

field model: str = 'Qwen/Qwen3-8B'

field serve_module: str = 'sglang.launch_server'

field router_module: str = 'sglang_router.launch_router'

field bench_module: str = 'sglang.bench_serving'

field healthcheck: str = '/v1/models': Health check router endpoint.

field prefill: SglangArgs | None = None: Prefill instance arguments. If not set, a single instance without disaggregation is used.

field decode: SglangArgs [Optional]: Decode instance arguments.

field docker_image_url: str [Required]

field port: int = 8300

Constraints:

ge = 1
le = 65535

field host: str = '0.0.0.0': Host/interface for serve or router processes to bind to.

field bench_host: str | None = None: Hostname used by the benchmark client. Defaults to the allocated node hostname.

field serve_wait_seconds: int = 300

Benchmark Command Arguments

pydantic model cloudai.workloads.sglang.sglang.SglangBenchCmdArgs[source]

Bases: CmdArgs

SGLang bench_serving command arguments.

field backend: str = 'sglang'

field dataset_name: str = 'random'

field num_prompts: int = 30

field max_concurrency: int = 16

field random_input: int = 16

field random_output: int = 128

field warmup_requests: int = 2

field random_range_ratio: float = 1.0

field output_details: bool = True

Test Definition

pydantic model cloudai.workloads.sglang.sglang.SglangTestDefinition[source]

Bases: LLMServingTestDefinition[SglangCmdArgs]

Test object for SGLang.

field bench_cmd_args: SglangBenchCmdArgs = SglangBenchCmdArgs(backend='sglang', dataset_name='random', num_prompts=30, max_concurrency=16, random_input=16, random_output=128, warmup_requests=2, random_range_ratio=1.0, output_details=True)

was_run_successful(tr: TestRun) → JobStatusResult[source]

property cmd_args_dict: Dict[str, str | List[str]]

constraint_check(tr: TestRun, system: System | None) → bool

property docker_image: DockerImage

property extra_args_str: str

property extra_installables: list[Installable]

property hf_model: HFModel

property installables: list[Installable]

property is_dse_job: bool

field cmd_args: LLMServingCmdArgsT [Required]

field name: str [Required]

field description: str [Required]

field test_template_name: str [Required]

field extra_env_vars: dict[str, str | List[str]] = {}

field extra_cmd_args: dict[str, str] = {}

field extra_container_mounts: list[str] = []

field git_repos: list[GitRepo] = []

field nsys: NsysConfiguration | None = None

field predictor: PredictorConfig | None = None

field agent: str = 'grid_search'

field agent_steps: int = 1

field agent_metrics: list[str] = ['default']

field agent_reward_function: str = 'inverse'

field agent_config: dict[str, Any] | None = None: Agent configuration.