Topograph Slinky Engine

View as Markdown

Overview

The slinky engine is Topograph’s engine for SLURM clusters running on Kubernetes. It is designed to work with the Slinky project - an open-source set of integration tools by SchedMD that brings SLURM capabilities into Kubernetes environments.

While the Slinky project provides comprehensive SLURM-on-Kubernetes orchestration (operators, schedulers, exporters, etc.), Topograph’s slinky engine complements this ecosystem by providing topology discovery and configuration management for SLURM clusters running in Kubernetes.

The Slinky engine bridges the gap between Kubernetes infrastructure and SLURM workload management by updating SLURM topology configurations stored in Kubernetes ConfigMaps.

How It Works

  1. Node Discovery: Queries Kubernetes nodes and SLURM pods to build a topology map
  2. Topology Generation: Creates SLURM topology configuration (tree or block format)
  3. ConfigMap Management: Updates the specified ConfigMap with new topology data including metadata annotations for tracking and debugging

Design

Configuration

Topograph is deployed as a standard Kubernetes application using a Helm chart. Topograph is configured using a configuration file stored in a ConfigMap and mounted to the Topograph container at /etc/topograph/topograph-config.yaml. In addition, when sending a topology request, the request payload includes additional parameters. The parameters for the configuration file and topology request are defined in the global section of the Helm values file, as shown below:

Shared with the Kubernetes engine: because the Topograph API server runs as a Kubernetes workload regardless of the engine, anything about the chart’s deployment surface — values-schema validation, helm test hooks, access patterns (ClusterIP port-forward, Ingress, Gateway API HTTPRoute), Prometheus ServiceMonitor, NetworkPolicy guidance, and the chart’s README.md — is shared with the Kubernetes engine and documented authoritatively in engines/k8s.md and engines/k8s.md#exposing-the-topograph-api. Those sections apply equally to Slinky deployments.

1global:
2 # provider – name of the cloud provider or on-prem environment.
3 # Supported values: "aws", "gcp", "oci", "nebius", "netq", "infiniband-k8s", "dra".
4 provider: aws
5 engine: slinky
6 engineParams:
7 namespace: ns-slinky # Namespace where Slinky is running
8 podSelector: # Label selector for pods running SLURM nodes
9 matchLabels:
10 app.kubernetes.io/component: compute
11 plugin: topology/block # Name of the topology plugin
12 blockSizes: [4] # (Optional) Block size for the block topology plugin
13 topologyConfigmapName: slurm-config # Name of the ConfigMap containing the topology config
14 topologyConfigPath: topology.conf # Key in the ConfigMap for the topology config

Per-partition topologies

When per-partition topologies are configured, each entry may declare how its node membership is resolved:

FieldBehavior
nodesExplicit SLURM node list. Takes precedence over podSelector.
podSelectorKubernetes LabelSelector matching the slurmd pods in the partition. The engine lists pods in the engine’s namespace, filters to Ready pods, and reads each pod’s SLURM name from the slurm.node.name label (falling back to pod.spec.hostname).
neitherThe engine falls back to running scontrol show partition <name> inside a login pod (legacy behavior).

nodes and podSelector are mutually exclusive on the same entry; configuring both returns a validation error at engine load time.

1global:
2 engineParams:
3 namespace: ns-slinky
4 podSelector:
5 matchLabels:
6 app.kubernetes.io/component: compute
7 topologies:
8 gpu-partition:
9 plugin: topology/block
10 blockSizes: [8, 16]
11 podSelector: # partition membership by pod labels
12 matchLabels:
13 app.kubernetes.io/component: compute
14 slurm.partition: gpu
15 cpu-partition:
16 plugin: topology/tree
17 nodes: ["cpu-[001-032]"] # explicit list
18 default:
19 plugin: topology/flat
20 clusterDefault: true # no podSelector, no nodes → scontrol fallback

ConfigMap Annotations

Slinky automatically adds metadata annotations to managed ConfigMaps for improved observability:

1apiVersion: v1
2kind: ConfigMap
3metadata:
4 name: slurm-config
5 annotations:
6 # Topograph metadata
7 topograph.nvidia.com/engine: "slinky"
8 topograph.nvidia.com/topology-managed-by: "topograph"
9 topograph.nvidia.com/last-updated: "2024-01-01T10:11:00Z"
10 topograph.nvidia.com/slurm-namespace: "slurm"
11 topograph.nvidia.com/plugin: "topology/tree"
12 topograph.nvidia.com/block-sizes: "8,16,32"
13
14 # Original annotations preserved
15 meta.helm.sh/release-name: slurm
16 meta.helm.sh/release-namespace: slurm
17data:
18 topology.conf: |
19 SwitchName=sw1 Switches=sw[2-3]
20 SwitchName=sw2 Nodes=node[1-4]
21 SwitchName=sw3 Nodes=node[5-8]

Annotation Reference

AnnotationDescription
topograph.nvidia.com/engineEngine that manages this ConfigMap
topograph.nvidia.com/topology-managed-byIndicates topograph manages topology data
topograph.nvidia.com/last-updatedRFC3339 timestamp of last update
topograph.nvidia.com/slurm-namespaceSLURM cluster namespace
topograph.nvidia.com/pluginTopology plugin used (tree/block)
topograph.nvidia.com/block-sizesBlock sizes for block topology

Usage Examples

Topograph runs autonomously in Kubernetes environments, including Slinky. When the Node Observer detects that a node has been added or removed, it sends topology requests to the Topograph API server, which then triggers an update to the network topology information within the cluster. However, if you want to manually trigger network topology discovery, you can send HTTP requests to the API server, as shown below.

Topology Configuration in the Tree Format

$curl -X POST -H "Content-Type: application/json" \
> -d '{
> "provider": {"name": "aws"},
> "engine": {
> "name": "slinky",
> "params": {
> "namespace": "ns-slinky",
> "podSelector": {
> "matchLabels": {
> "app.kubernetes.io/component": "compute"
> }
> },
> "topologyConfigPath": "topology.conf",
> "topologyConfigmapName": "slurm-config"
> }
> }
> }' \
> http://localhost:49021/v1/generate

Topology Configuration in the Block Format

$curl -X POST -H "Content-Type: application/json" \
> -d '{
> "provider": {"name": "aws"},
> "engine": {
> "name": "slinky",
> "params": {
> "namespace": "ns-slinky",
> "podSelector": {
> "matchLabels": {
> "app.kubernetes.io/component": "compute"
> }
> },
> "topologyConfigPath": "topology.conf",
> "topologyConfigmapName": "slurm-config",
> "plugin": "topology/block",
> "blockSizes": [8,16,32]
> }
> }
> }' \
> http://localhost:49021/v1/generate