Node Operations

DGX Cloud Lepton provides node level operations for you to manage nodes better.

Actions

Pause Schedule

This action allows you to pause the schedule of a specific node. Once the node is paused, there won't be any new workloads scheduled to this node.

Resume Schedule

This action allows you to resume the schedule of a specific node. Once the node is resumed, the node will be available for new workloads.

Force Reboot

This action allows you to force reboot a specific node.

Note

The node may remain unhealthy temporarily during the reboot process.

Drain

We provide two drain policies for you to choose from.

Graceful Drain(recommended): This policy will help migrate current workloads to other nodes, and then terminate all workloads on the node.
Drain: All pods from the node will be evicted and any other active workloads will be terminated as well.

Release Protection

This feature allows you to protect a node from being released. Once it's enabled, the node can't be released via dashboard or api.

Release

We provide two release policies for you to choose from.

Graceful Release(recommended): Migrate workloads to an idle node, then terminate all workloads on the node and release it.
Release: All pods from the node will be evicted and any other active workloads will be terminated as well, and then release the node.

Health Check

DGX Cloud Lepton provides health check for you to monitor the health of every node for the following dimensions:

GPU: Including GPU temperature, utilization, power usage, etc.
General Hardware: Including CPU, memory, disk, network, pci and so on.
System: Including fuse, os, library status and so on.
Others: Including tailscale, kubelet, etc.

Trigger Health Check

As health checks will run periodically, you can also trigger one of the health checks for a specific node.

On each health check card, there is a trigger button with a play icon. Hover on the icon, you will see a tooltip showing when this is checked last time.

Press the button to trigger the health check.

1. Bring Your Own Compute

1. Endpoint

2. Dev Pod

3. Batch Job

4. Node Group

8. Workspace

1. Dev Pod

2. Batch Job

1. API Reference

2. CLI Reference

3. Limits