Node Operations

DGX Cloud Lepton provides node level operations for you to manage nodes better.

Actions

Pause Schedule

This action allows you to pause the schedule of a specific node. Once the node is paused, there won't be any new workloads scheduled to this node.

Resume Schedule

This action allows you to resume the schedule of a specific node. Once the node is resumed, the node will be available for new workloads.

Force Reboot

This action allows you to force reboot a specific node.

Note

The node may remain unhealthy temporarily during the reboot process.

Drain

We provide two drain policies for you to choose from.

  • Graceful Drain(recommended): This policy will help migrate current workloads to other nodes, and then terminate all workloads on the node.
  • Drain: All pods from the node will be evicted and any other active workloads will be terminated as well.

Change Node Stage

Sometimes you may want to manually change the node stage for better management. Here is a list of all the actions you can take:

  • Recall Node: Recall node will put the node to Recalled stage. After you have drained all workloads on the node, Lepton will automatically move the node to Ready to Repair stage. Once in this stage, you can safely begin the repair.
  • Repair Node: You can repair a node only when it is in the Ready to Repair stage, and it will put the node to Repairing stage. The node will be unavailable during the repair. Once the repair is complete, you can set the node to Production stage.
  • Set to Production: After the node is repaired, you can set it to Production stage so that workloads can be scheduled to this node again. Make sure the repair is complete before doing so.

Release Protection

This feature allows you to protect a node from being released. Once it's enabled, the node can't be released via dashboard or api.

Release

We provide two release policies for you to choose from.

  • Graceful Release(recommended): Migrate workloads to an idle node, then terminate all workloads on the node and release it.
  • Release: All pods from the node will be evicted and any other active workloads will be terminated as well, and then release the node.

Health Check

DGX Cloud Lepton provides health check for you to monitor the health of every node for the following dimensions:

  • GPU: Including GPU temperature, utilization, power usage, etc.
  • General Hardware: Including CPU, memory, disk, network, pci and so on.
  • System: Including fuse, os, library status and so on.
  • Others: Including tailscale, kubelet, etc.
health check 0.8x

Trigger Health Check

As health checks will run periodically, you can also trigger one of the health checks for a specific node.

On each health check card, there is a trigger button with a play icon. Hover on the icon, you will see a tooltip showing when this is checked last time.

Press the button to trigger the health check.

Copyright @ 2025, NVIDIA Corporation.