Modeling and API Simulation

View as Markdown

Topograph models are YAML files used to simulate discovered topology without querying a real cloud API, NetQ instance, InfiniBand fabric, or Kubernetes cluster. They are primarily used by tests and local development, but they are also useful when validating a scheduler integration against known topology shapes.

A model describes the same canonical topology that real providers eventually produce:

  • A switch tree, used for Slurm topology/tree output and Kubernetes leaf / spine / core labels
  • Node membership in accelerated domains, used for block topology and accelerator labels
  • Optional per-node attributes used by provider simulations

Model loading lives in pkg/models. Model fixtures live under tests/models/.

Where Models Are Used

Models are consumed in two different simulation flows.

Test Provider

The test provider simulates the Topograph API lifecycle itself. It can return successful topology output, delayed completion, malformed-request failures, provider failures, or a request that remains pending.

Use it when testing clients that call:

  • POST /v1/generate
  • GET /v1/topology?uid=<request-id>

For the complete API status-code simulation behavior, see Test Mode and Test Provider.

Provider Simulations

Several providers also have simulation variants, such as:

  • aws-sim
  • gcp-sim
  • oci-sim
  • nebius-sim
  • lambdai-sim
  • dsx-sim

These providers load a model file and then simulate that provider’s API responses. This is useful when you want to exercise the normal provider translation logic without real provider credentials or infrastructure.

Simulation providers share these common parameters:

ParameterRequiredDescription
modelFileNameYesModel file to load. A basename such as medium.yaml is loaded from tests/models/; absolute and relative paths are also supported.
api_errorNoProvider-specific test hook used by unit tests to simulate API failures.
trimTiersNoNumber of topology tiers to trim where supported by the simulated provider.

Example request:

1{
2 "provider": {
3 "name": "aws-sim",
4 "params": {
5 "modelFileName": "medium.yaml"
6 }
7 },
8 "engine": {
9 "name": "slurm",
10 "params": {
11 "plugin": "topology/block"
12 }
13 }
14}

Model File Shape

A model has three top-level sections:

1switches:
2 ...
3nodes:
4 ...
5capacity_blocks:
6 ...

All three sections are maps. nodes and capacity_blocks are flexible: you can specify node membership in either section, and Topograph completes the missing side during model loading.

Switches

The switches map describes the network hierarchy. Each key is the switch ID. Each value may contain:

FieldDescription
metadataKey-value metadata inherited by descendant nodes. Common keys are region, availability_zone, and group.
switchesChild switch IDs.
nodesCompute node names attached to this switch. Compact node ranges are supported.

Example:

1switches:
2 core:
3 metadata:
4 region: us-west
5 switches: [spine]
6 spine:
7 metadata:
8 availability_zone: zone1
9 switches: [leaf1, leaf2]
10 leaf1:
11 metadata:
12 group: cb1
13 nodes: ["n[1-2]"]
14 leaf2:
15 metadata:
16 group: cb2
17 nodes: [n3]

Switch rules:

  • A switch can have at most one parent switch.
  • A node can be attached to at most one switch.
  • If a switch references a node, that node must exist either in nodes or be generated from capacity_blocks.
  • Switch nodes entries are expanded through the same compact range syntax used elsewhere.

Nodes

The nodes map describes compute nodes directly. Each key is the node name. The value may contain:

FieldDescription
nameOptional. If set, it must match the map key. Usually omitted.
capacity_block_idOptional accelerated domain ID. If set and capacity_blocks is omitted, Topograph creates the corresponding capacity block entry.
attributes.nvlinkOptional accelerated-domain / NVLink identifier. Used by block topology simulation paths.
attributes.statusOptional node status metadata.
attributes.timestampOptional timestamp metadata.
attributes.gpusOptional GPU inventory details.

Example:

1nodes:
2 n1:
3 capacity_block_id: cb1
4 attributes:
5 nvlink: nvl1
6 n2:
7 attributes:
8 nvlink: nvl1

Node rules:

  • capacity_block_id is optional.
  • Nodes without capacity_block_id are still valid compute nodes.
  • If capacity_block_id is set and capacity_blocks is omitted, Topograph creates the capacity block and adds the node to it.
  • If a node is listed under capacity_blocks.<id>.nodes, Topograph fills in the node’s missing capacity_block_id.
  • If both sides specify different capacity block IDs for the same node, model loading fails.

Capacity Blocks

The capacity_blocks map describes accelerated domains. Each key is the capacity block ID. The value may contain:

FieldDescription
nodesOptional list of node names in this capacity block. Compact ranges are supported.
attributes.nvlinkOptional NVLink / accelerator domain identifier applied to nodes generated from this capacity block, and to listed top-level nodes when provided.

Example:

1capacity_blocks:
2 cb1:
3 nodes: ["n[1-2]"]
4 attributes:
5 nvlink: nvl1
6 cb2: {}

Capacity block rules:

  • The entire capacity_blocks section may be omitted.
  • Individual capacity block entries may omit nodes.
  • Capacity block entries with no corresponding nodes are allowed and preserved.
  • If top-level nodes is omitted, capacity_blocks.<id>.nodes creates node entries automatically.
  • If top-level nodes is present, capacity_blocks.<id>.nodes must reference nodes in the top-level nodes map.

Compact Ranges

Model node lists support compact ranges:

1nodes: ["n[1-4]", "gpu[001-004]", node9]

These expand to:

n1, n2, n3, n4, gpu001, gpu002, gpu003, gpu004, node9

Ranges are accepted in:

  • switches.<switch>.nodes
  • capacity_blocks.<id>.nodes

Derived Data

After YAML parsing, Topograph completes the model before simulation uses it:

  • Switch node ranges are expanded.
  • Capacity block node ranges are expanded.
  • Node names are copied from their map keys.
  • Switch names are copied from their map keys.
  • Missing nodes can be created from capacity_blocks.<id>.nodes.
  • Missing capacity block entries can be created from node capacity_block_id values.
  • Node NetLayers is derived from the switch path from leaf to root.
  • Node Metadata is built by merging switch metadata along the same path.
  • Instances is derived from node names and grouped by metadata.region; nodes without a region use none.

These derived fields are not written in YAML.

Complete Examples

Nodes From Capacity Blocks

This compact model omits the nodes section. Nodes are created from capacity block membership.

1switches:
2 core:
3 switches: [leaf]
4 leaf:
5 nodes: ["n[1-2]", n3]
6
7capacity_blocks:
8 cb1:
9 nodes: ["n[1-2]"]
10 attributes:
11 nvlink: nvl1
12 cb2:
13 nodes: [n3]
14 attributes:
15 nvlink: nvl2

After loading:

  • n1 and n2 belong to cb1 and have attributes.nvlink: nvl1
  • n3 belongs to cb2 and has attributes.nvlink: nvl2
  • All three nodes have network layers [leaf, core]

Capacity Blocks From Nodes

This model omits capacity_blocks. Topograph creates cb1 from n1.capacity_block_id.

1nodes:
2 n1:
3 capacity_block_id: cb1
4 attributes:
5 nvlink: nvl1
6 n2:
7 attributes:
8 nvlink: nvl2

After loading:

  • cb1.nodes contains n1
  • cb1.attributes.nvlink is populated from n1.attributes.nvlink
  • n2 remains a valid node without capacity block membership

Orphan Capacity Block

This is valid. It declares a capacity block that currently has no nodes.

1nodes:
2 n1:
3 capacity_block_id: cb1
4
5capacity_blocks:
6 cb1: {}
7 cb2: {}

After loading:

  • cb1.nodes contains n1
  • cb2 remains present with no nodes

Simulating the API

To simulate the Topograph API lifecycle, configure the test provider:

1http:
2 port: 49021
3 ssl: false
4
5provider: test
6engine: slurm
7
8requestAggregationDelay: 2s

Then submit a request that names a model:

1{
2 "provider": {
3 "name": "test",
4 "params": {
5 "generateResponseCode": 202,
6 "topologyResponseCode": 200,
7 "modelFileName": "small-tree.yaml"
8 }
9 },
10 "engine": {
11 "name": "slurm"
12 }
13}

Expected flow:

  1. POST /v1/generate returns 202 Accepted and a request ID.
  2. GET /v1/topology?uid=<request-id> returns 202 Accepted while the request is queued or processing.
  3. When processing completes, /v1/topology returns 200 OK with the selected engine output.

To simulate API failures, set generateResponseCode, topologyResponseCode, and errorMessage in provider.params. For example:

1{
2 "provider": {
3 "name": "test",
4 "params": {
5 "generateResponseCode": 202,
6 "topologyResponseCode": 500,
7 "errorMessage": "simulated provider failure"
8 }
9 },
10 "engine": {
11 "name": "slurm"
12 }
13}

Choosing the Right Simulation Path

Use the test provider when you want to validate API-client behavior:

  • Request IDs
  • Polling
  • Pending responses
  • Error status codes
  • Retry behavior

Use a *-sim provider when you want to validate provider-specific topology translation:

  • AWS, GCP, OCI, Nebius, Lambda AI, or DSX topology paths
  • Pagination behavior in simulated provider APIs
  • Engine output generated from provider-shaped data
  • Tree and block topology output from the same model

Validation Checklist

Before using a new model in a regression test:

  • Confirm every switch child has only one parent.
  • Confirm every switched node is defined in nodes or generated from capacity_blocks.
  • Confirm no node appears under two switches.
  • Confirm capacity block membership does not conflict with node capacity_block_id.
  • Run the relevant provider simulation test or API flow with the target engine.