Topology Aware Scheduling

View as Markdown

Topology Aware Scheduling (TAS) lets you control where Dynamo places inference workload pods relative to the cluster’s network topology. By packing related pods within the same rack, block, or other topology domain, you reduce inter-node latency and improve throughput — especially for disaggregated serving where prefill, decode, and routing components communicate frequently.

TAS is opt-in. Existing deployments without topology constraints continue to work unchanged.

Prerequisites

RequirementDetails
GroveInstalled on the cluster. See the Grove Installation Guide.
ClusterTopology CRA cluster-scoped ClusterTopology resource configured by the cluster admin, mapping topology domain names to node labels. See Grove documentation for setup instructions.
KAI SchedulerKAI Scheduler is required by Grove for topology-aware pod placement.
Dynamo operatorThe latest Dynamo operator Helm chart includes read-only RBAC for clustertopologies.grove.io via a dedicated ClusterRole. This works for both cluster-wide and namespace-restricted operator deployments — no extra configuration is needed.

Topology Domains

Topology domains are free-form identifiers defined by the cluster admin in the ClusterTopology CR. Common examples include region, zone, datacenter, block, rack, host, and numa, but any name matching the pattern ^[a-z0-9]([a-z0-9-]*[a-z0-9])?$ is valid (no leading or trailing hyphens).

Domain names must match exactly what is configured in the ClusterTopology CR referenced by topologyProfile. During DGD creation, the Dynamo webhook validates that every packDomain exists in the referenced ClusterTopology.

When you specify a packDomain, the scheduler packs all replicas of the constrained component within a single instance of that domain. For example, packDomain: rack means “place all pods within the same rack.”

Topology Profile

Every DGD that uses topology constraints must reference a ClusterTopology CR by name via the topologyProfile field. This field is set at spec.topologyConstraint (the deployment level) and is inherited by all services — services must not set topologyProfile themselves.

The topologyProfile tells the Dynamo operator and the underlying framework which topology hierarchy to use for scheduling and validation.

Enabling TAS on a DGD

Add a topologyConstraint field to your DynamoGraphDeployment at the deployment level, at the service level, or both. The deployment level must include a topologyProfile. Each constraint specifies a packDomain.

Example 1: Deployment-Level Constraint (Services Inherit)

All services inherit the deployment-level constraint. This is the simplest configuration when you want uniform topology packing.

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3metadata:
4 name: my-llm
5spec:
6 topologyConstraint:
7 topologyProfile: my-cluster-topology
8 packDomain: zone
9 services:
10 VllmWorker:
11 dynamoNamespace: my-llm
12 componentType: worker
13 replicas: 2
14 envFromSecret: hf-token-secret
15 resources:
16 limits:
17 gpu: "1"
18 extraPodSpec:
19 mainContainer:
20 image: my-image
21 command: ["/bin/sh", "-c"]
22 args:
23 - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
24 Frontend:
25 dynamoNamespace: my-llm
26 componentType: frontend
27 replicas: 1
28 extraPodSpec:
29 mainContainer:
30 image: my-image
31 command: ["/bin/sh", "-c"]
32 args:
33 - python3 -m dynamo.frontend

Example 2: Service-Level Constraint Only

Only the specified service gets topology packing. Other services are scheduled without topology constraints. The deployment level must still set topologyProfile.

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3metadata:
4 name: my-llm
5spec:
6 topologyConstraint:
7 topologyProfile: my-cluster-topology
8 services:
9 VllmWorker:
10 dynamoNamespace: my-llm
11 componentType: worker
12 replicas: 2
13 multinode:
14 nodeCount: 4
15 topologyConstraint:
16 packDomain: rack
17 envFromSecret: hf-token-secret
18 resources:
19 limits:
20 gpu: "8"
21 extraPodSpec:
22 mainContainer:
23 image: my-image
24 command: ["/bin/sh", "-c"]
25 args:
26 - python3 -m dynamo.vllm --model meta-llama/Llama-4-Maverick-17B-128E
27 Frontend:
28 dynamoNamespace: my-llm
29 componentType: frontend
30 replicas: 1
31 extraPodSpec:
32 mainContainer:
33 image: my-image
34 command: ["/bin/sh", "-c"]
35 args:
36 - python3 -m dynamo.frontend

Example 3: Mixed (Deployment-Level Default + Per-Service Override)

Set a broad constraint at the deployment level and a narrower override on specific services. Service-level constraints must be equal to or narrower than the deployment-level constraint (determined by the ordering in the ClusterTopology CR).

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3metadata:
4 name: my-llm
5spec:
6 topologyConstraint:
7 topologyProfile: my-cluster-topology
8 packDomain: zone
9 services:
10 VllmWorker:
11 dynamoNamespace: my-llm
12 componentType: worker
13 replicas: 2
14 multinode:
15 nodeCount: 4
16 topologyConstraint:
17 packDomain: block # narrower than zone — valid
18 envFromSecret: hf-token-secret
19 resources:
20 limits:
21 gpu: "8"
22 extraPodSpec:
23 mainContainer:
24 image: my-image
25 command: ["/bin/sh", "-c"]
26 args:
27 - python3 -m dynamo.vllm --model meta-llama/Llama-4-Maverick-17B-128E
28 Frontend:
29 dynamoNamespace: my-llm
30 componentType: frontend
31 replicas: 1
32 # inherits zone from spec.topologyConstraint
33 extraPodSpec:
34 mainContainer:
35 image: my-image
36 command: ["/bin/sh", "-c"]
37 args:
38 - python3 -m dynamo.frontend

Hierarchy Rules

When both a deployment-level and a service-level topologyConstraint are set, the service’s packDomain must be equal to or narrower than the deployment-level packDomain. “Narrower” is determined by the ordering of levels in the referenced ClusterTopology CR — levels appearing later in the spec.levels array are considered narrower.

The Dynamo webhook rejects the DGD at creation time if a service constraint is broader than the deployment constraint (when validating against a ClusterTopology CR).

When only one level is set (deployment-level only or service-level only), no hierarchy check applies.

ConfigurationBehavior
spec.topologyConstraint set, service has noneService inherits the deployment-level constraint
spec.topologyConstraint set, service also setBoth applied; service must be narrower or equal
spec.topologyConstraint.topologyProfile set, no packDomain at specProfile is provided for service-level constraints only
Neither setNo topology constraints (default)

Field Reference

FieldLevelRequiredDescription
topologyProfilespec.topologyConstraintYes (when any constraint is set)Name of the ClusterTopology CR defining the topology hierarchy.
topologyProfileservice-level topologyConstraintN/A (not in schema)Inherited from spec.topologyConstraint. The service-level type does not include this field.
packDomainspec.topologyConstraintOptionalDefault pack domain for services that don’t specify their own.
packDomainservice-level topologyConstraintRequiredPack domain for this service. Must match a level in the ClusterTopology CR.

Multinode Considerations

For multinode services (services with a multinode section), the topology constraint is applied at the scaling group level rather than on individual worker pods. This is important because a multinode service spawns replicas × nodeCount pods — for example, 2 replicas with nodeCount: 4 produces 8 pods across 8 nodes. Applying the constraint at the scaling group level means the scheduler packs each replica’s set of nodes within the requested domain, without over-constraining individual pods to a single host.

For example, with this configuration:

1VllmWorker:
2 replicas: 2
3 multinode:
4 nodeCount: 4
5 topologyConstraint:
6 packDomain: rack

Each replica’s 4 nodes are packed within a single rack. The two replicas may land in different racks (the constraint applies per-replica, not across all replicas).

Recommendation: For multinode services, use rack or block as the packDomain to keep workers within a high-bandwidth domain while still allowing the scheduler to spread them across hosts within that domain. Avoid host for multinode services, as packing multiple nodes onto one host is not meaningful.

Immutability

Topology constraints cannot be changed after the DGD is created. This includes:

  • Adding a topology constraint to a DGD or service that did not have one
  • Removing an existing topology constraint
  • Changing the topologyProfile value
  • Changing the packDomain value

To change topology constraints, delete and recreate the DGD. This matches the behavior of the underlying framework, which enforces immutability on topology constraints for generated resources.

Monitoring Topology Enforcement

When any topology constraint is set, the DGD status includes a TopologyLevelsAvailable condition that reports whether the topology levels referenced by your constraints still exist in the cluster topology.

Healthy state:

1status:
2 conditions:
3 - type: Ready
4 status: "True"
5 - type: TopologyLevelsAvailable
6 status: "True"
7 reason: AllTopologyLevelsAvailable
8 message: "All required topology levels are available in the cluster topology"

Degraded state (e.g., an admin removed a topology level from the ClusterTopology CR after deployment):

1status:
2 conditions:
3 - type: Ready
4 status: "True"
5 - type: TopologyLevelsAvailable
6 status: "False"
7 reason: TopologyLevelsUnavailable
8 message: "Topology level 'rack' is no longer available in the cluster topology"

When topology levels become unavailable, Dynamo emits a Warning event on the DGD. The deployment may still appear Ready because the underlying framework keeps pods running, but topology placement is no longer guaranteed.

Troubleshooting

DGD rejected: “ClusterTopology not found”

The Dynamo webhook validates that the ClusterTopology CR referenced by topologyProfile exists when any topology constraint is set. If it cannot read the ClusterTopology CR:

  • Verify that the cluster admin has created the ClusterTopology resource named in topologyProfile. See the Grove documentation for setup.
  • Verify that the Dynamo operator has RBAC to read clustertopologies.grove.io (included in the default Helm chart).

DGD rejected: “packDomain not found in cluster topology”

The specified packDomain does not exist as a level in the referenced ClusterTopology CR. Check which domains are defined:

$kubectl get clustertopology <topology-profile-name> -o yaml

Ensure the domain you are requesting (e.g., rack) is configured in the ClusterTopology with a corresponding node label.

DGD rejected: “topologyProfile is required”

Any DGD that has a topology constraint (at the spec or service level) must set spec.topologyConstraint.topologyProfile to the name of a ClusterTopology CR. Add the topologyProfile field to spec.topologyConstraint.

Pods stuck in Pending

The scheduler cannot satisfy the topology constraint. Common causes:

  • Not enough nodes within a single instance of the requested domain (e.g., requesting 8 GPUs packed in one rack, but no rack has 8 available GPUs).
  • Node labels do not match the ClusterTopology configuration.

Inspect scheduler events for details:

$kubectl describe pod <pod-name> -n <namespace>

TopologyLevelsAvailable is False

The DGD was deployed successfully, but the topology definition has since changed. The underlying framework detected that one or more required topology levels are no longer available.

  • Check the condition message for specifics.
  • Inspect the ClusterTopology CR to see if a domain was removed or renamed.
  • If the topology was intentionally changed, delete and recreate the DGD to pick up the new topology.

DGD rejected: hierarchy violation

A service-level packDomain is broader than the deployment-level packDomain. “Broader” and “narrower” are determined by the order of levels in the ClusterTopology CR — levels appearing earlier in spec.levels are broader.

Ensure service-level constraints are equal to or narrower than the deployment-level constraint.