CloudAI Benchmark Framework v1.6.1

NIXL EP

This workload (test_template_name is NixlEP) runs the NIXL Elastic EP benchmark through a Slurm-managed multi-node elastic launcher flow.

The Slurm launch model is:

  • one elastic.py process per node, started in sequence as the plan progresses

  • the master node starts first and exposes a TCPStore for rank coordination

  • follower nodes connect via --tcp-server $master_ip once the master is ready

  • the benchmark runtime comes from the container image

  • each run serializes its plan JSON into the output directory

The plan field is a JSON-encoded list of phases. Each phase is a list of rank indices passed directly to the benchmark. CloudAI uses the following convention to drive the elastic launcher:

  • Positive rank index — the rank is active. A rank that is new relative to the previous phase causes CloudAI to fire an additional srun for that worker.

  • Negative rank index (e.g. -6) — signals a contraction: the benchmark sees the absolute value and treats it as temporarily removed. No new srun is launched for negative indices.

  • Omitted rank — a rank present in an earlier phase but absent from the current phase list is not relaunched. The benchmark’s own phase logic handles its inactivity.

Example:

Copy
Copied!
            

[[0, 1, 2, 3], # phase 0: ranks 0–3 start [0, 1, 2, 3, 4, 5, 6, 7], # phase 1: ranks 4–7 join (expansion) [0, 1, 2, 3, 4, -6, 7], # phase 2: rank 6 contracted (no new launch) [0, 1, 2, 3, 4, 5, 6, 7]] # phase 3: rank 6 rejoins (new launch for rank 6)

Phase completion is detected by polling the primary log for -> end phase N markers.

Test TOML example:

Copy
Copied!
            

name = "nixl-ep-expansion-contraction" description = "NIXL Elastic EP expansion/contraction benchmark" test_template_name = "NixlEP" [cmd_args] docker_image_url = "<docker container url here>" elastic_script = "/workspace/nixl/examples/device/ep/tests/elastic/elastic.py" plan = "[[0, 1, 2, 3], [0, 1, 2, 3, 4, 5, 6, 7], [0, 1, 2, 3, 4, -6, 7], [0, 1, 2, 3, 4, 5, 6, 7]]" num_processes_per_node = 4 num_tokens = 256 num_experts_per_rank = 4 hidden_dim = 8192 num_topk = 6 disable_ll_nvlink = true

Test-in-Scenario example:

Copy
Copied!
            

name = "nixl-ep-expansion-contraction" [[Tests]] id = "nixl_ep.expansion_contraction" num_nodes = 3 time_limit = "00:30:00" name = "nixl-ep-expansion-contraction" description = "NIXL Elastic EP expansion/contraction benchmark" test_template_name = "NixlEP" [Tests.cmd_args] docker_image_url = "<docker container url here>" elastic_script = "/workspace/nixl/examples/device/ep/tests/elastic/elastic.py" plan = "[[0, 1, 2, 3], [0, 1, 2, 3, 4, 5, 6, 7], [0, 1, 2, 3, 4, -6, 7], [0, 1, 2, 3, 4, 5, 6, 7]]" num_processes_per_node = 4 num_tokens = 256 num_experts_per_rank = 4 hidden_dim = 8192 num_topk = 6 disable_ll_nvlink = true

After a run completes, CloudAI prints a single table with one row per (node, rank) measurement. The Phases column shows each phase index colour-coded green (passed) or red (failed). Bandwidth columns report dispatch+combine throughput and timing per rank.

The reported metric (default) is the mean dispatch+combine bandwidth in GB/s across all ranks.

Command Arguments

pydantic model cloudai.workloads.nixl_ep.nixl_ep.NixlEPCmdArgs[source]

Command line arguments for the NIXL Elastic EP benchmark.

field docker_image_url: str [Required]

URL of the Docker image that contains the NIXL EP benchmark.

field elastic_script: str = '/workspace/nixl/examples/device/ep/tests/elastic/elastic.py'

Path to the benchmark entrypoint, relative to the container’s NIXL runtime root or absolute in the container.

field python_executable: str = 'python3'

Python executable to use inside the container.

field plan: str | list[str] [Required]

Serialized phase plan to write into a per-run JSON file. Use a single string such as “[[0, 1], [0, 1, 2, 3]]” for a single run, or a list of such strings to enable DSE mode (one run per plan).

field num_processes_per_node: int | list[int] [Required]

Number of local worker processes to spawn on each allocated node.

field service_startup_timeout_seconds: int = 60

Seconds to wait for the master node’s TCPStore to accept connections.

Constraints:
  • ge = 1

field store_port: int = 9999

TCPStore port used by the benchmark.

Constraints:
  • ge = 1

  • le = 65535

parse_plan() → list[list[int]][source]

Test Definition

class cloudai.workloads.nixl_ep.nixl_ep.NixlEPTestDefinition(*, name: str, description: str, test_template_name: str, cmd_args: NixlEPCmdArgs, extra_env_vars: dict[str, str | List[str]] = {}, extra_cmd_args: dict[str, str] = {}, extra_container_mounts: list[str] = [], git_repos: list[GitRepo] = [], nsys: NsysConfiguration | None = None, predictor: PredictorConfig | None = None, agent: str = 'grid_search', agent_steps: int = 1, agent_metrics: list[str] = ['default'], agent_reward_function: str = 'inverse', agent_config: dict[str, Any] | None = None)[source]

Bases: TestDefinition

Test definition for the NIXL Elastic EP benchmark.

Previous NIXL Bench
Next NIXL KVBench
© Copyright 2026, NVIDIA CORPORATION & AFFILIATES. Last updated on Jun 3, 2026