NIXL EP
This workload (test_template_name is NixlEP) runs the NIXL Elastic EP benchmark through a Slurm-managed multi-node elastic launcher flow.
The Slurm launch model is:
one
elastic.pyprocess per node, started in sequence as the plan progressesthe master node starts first and exposes a TCPStore for rank coordination
follower nodes connect via
--tcp-server $master_iponce the master is readythe benchmark runtime comes from the container image
each run serializes its plan JSON into the output directory
The plan field is a JSON-encoded list of phases. Each phase is a list of rank indices passed directly to the benchmark. CloudAI uses the following convention to drive the elastic launcher:
Positive rank index — the rank is active. A rank that is new relative to the previous phase causes CloudAI to fire an additional
srunfor that worker.Negative rank index (e.g.
-6) — signals a contraction: the benchmark sees the absolute value and treats it as temporarily removed. No newsrunis launched for negative indices.Omitted rank — a rank present in an earlier phase but absent from the current phase list is not relaunched. The benchmark’s own phase logic handles its inactivity.
Example:
[[0, 1, 2, 3], # phase 0: ranks 0–3 start
[0, 1, 2, 3, 4, 5, 6, 7], # phase 1: ranks 4–7 join (expansion)
[0, 1, 2, 3, 4, -6, 7], # phase 2: rank 6 contracted (no new launch)
[0, 1, 2, 3, 4, 5, 6, 7]] # phase 3: rank 6 rejoins (new launch for rank 6)
Phase completion is detected by polling the primary log for -> end phase N markers.
Test TOML example:
name = "nixl-ep-expansion-contraction"
description = "NIXL Elastic EP expansion/contraction benchmark"
test_template_name = "NixlEP"
[cmd_args]
docker_image_url = "<docker container url here>"
elastic_script = "/workspace/nixl/examples/device/ep/tests/elastic/elastic.py"
plan = "[[0, 1, 2, 3], [0, 1, 2, 3, 4, 5, 6, 7], [0, 1, 2, 3, 4, -6, 7], [0, 1, 2, 3, 4, 5, 6, 7]]"
num_processes_per_node = 4
num_tokens = 256
num_experts_per_rank = 4
hidden_dim = 8192
num_topk = 6
disable_ll_nvlink = true
Test-in-Scenario example:
name = "nixl-ep-expansion-contraction"
[[Tests]]
id = "nixl_ep.expansion_contraction"
num_nodes = 3
time_limit = "00:30:00"
name = "nixl-ep-expansion-contraction"
description = "NIXL Elastic EP expansion/contraction benchmark"
test_template_name = "NixlEP"
[Tests.cmd_args]
docker_image_url = "<docker container url here>"
elastic_script = "/workspace/nixl/examples/device/ep/tests/elastic/elastic.py"
plan = "[[0, 1, 2, 3], [0, 1, 2, 3, 4, 5, 6, 7], [0, 1, 2, 3, 4, -6, 7], [0, 1, 2, 3, 4, 5, 6, 7]]"
num_processes_per_node = 4
num_tokens = 256
num_experts_per_rank = 4
hidden_dim = 8192
num_topk = 6
disable_ll_nvlink = true
After a run completes, CloudAI prints a single table with one row per (node, rank) measurement. The Phases column shows each phase index colour-coded green (passed) or red (failed). Bandwidth columns report dispatch+combine throughput and timing per rank.
The reported metric (default) is the mean dispatch+combine bandwidth in GB/s across all ranks.
Command Arguments
- pydantic model cloudai.workloads.nixl_ep.nixl_ep.NixlEPCmdArgs[source]
Command line arguments for the NIXL Elastic EP benchmark.
- field docker_image_url: str [Required]
URL of the Docker image that contains the NIXL EP benchmark.
- field elastic_script: str = '/workspace/nixl/examples/device/ep/tests/elastic/elastic.py'
Path to the benchmark entrypoint, relative to the container’s NIXL runtime root or absolute in the container.
- field python_executable: str = 'python3'
Python executable to use inside the container.
- field plan: str | list[str] [Required]
Serialized phase plan to write into a per-run JSON file. Use a single string such as “[[0, 1], [0, 1, 2, 3]]” for a single run, or a list of such strings to enable DSE mode (one run per plan).
- field num_processes_per_node: int | list[int] [Required]
Number of local worker processes to spawn on each allocated node.
- field service_startup_timeout_seconds: int = 60
Seconds to wait for the master node’s TCPStore to accept connections.
- Constraints:
ge = 1
- field store_port: int = 9999
TCPStore port used by the benchmark.
- Constraints:
ge = 1
le = 65535
- parse_plan() → list[list[int]][source]
Test Definition
- class cloudai.workloads.nixl_ep.nixl_ep.NixlEPTestDefinition(*, name: str, description: str, test_template_name: str, cmd_args: NixlEPCmdArgs, extra_env_vars: dict[str, str | List[str]] = {}, extra_cmd_args: dict[str, str] = {}, extra_container_mounts: list[str] = [], git_repos: list[GitRepo] = [], nsys: NsysConfiguration | None = None, predictor: PredictorConfig | None = None, agent: str = 'grid_search', agent_steps: int = 1, agent_metrics: list[str] = ['default'], agent_reward_function: str = 'inverse', agent_config: dict[str, Any] | None = None)[source]
Bases:
TestDefinitionTest definition for the NIXL Elastic EP benchmark.