dpsctl gpu-policy

dpsctl gpu-policy Usage Guide

Set GPU power policies on of an active resource group.

Note: Today, when GPU power policies are applied, their values are summed into a single node-level GPU policy. This means that the applied set limit per GPU will differ slightly from the user’s specified input values.

Usage

dpsctl [global options] gpu-policy [options]

Flags

Includes global dpsctl options.

   --node value  (can be multiple) nodeName=gpu0,gpu1,gpu2,...
   --help, -h    show help

Examples

Basic Usage

The sum of all GPU policy values plus the COMP_CPU and COMP_MEMORY components of the resource group default or entity policy must sum to be no less than the capability minimum of the default or entity node policy.

In this example, we have previously created and activated a resource group, example1, with a default Node-High policy, and three nodes: node001, node002, and node003. We want to set the GPU policy of node001 to the following values:

  • GPU0: 500W
  • GPU1: 550W
  • GPU2: 600W
  • GPU3: 700W
  • GPU4: 650W
  • GPU5: 700W
  • GPU6: 550W
  • GPU7: 600W

By looking at the Node-High policy (enforced on node001), we can see that the minimum node power budget is set to 5,600W, with a maximum of 10,200W. Additionally, our limits for COMP_CPU and COMP_MEMORY are 1,530W and 1,020W, respectively.

By summing our limits together, we can see that:

500W + 550W + 600W + 700W + 650W + 700W + 550W + 600W + 1,530W + 1,020W = 7410W

which is within our node budget and is therefore a valid power configuration for node001.

Now we can apply our updates GPU policy with dpsctl:

$ dpsctl gpu-policy --node node001=500,550,600,700,650,700,550,600
{
  "results": [
    {
      "resource_name": "node001",
      "gpu_id": 0,
      "ok": true,
      "set_limit": 500.0,
      "diag_msg": "Success"
    },
    {
      "resource_name": "node001",
      "gpu_id": 1,
      "ok": true,
      "set_limit": 550.0,
      "diag_msg": "Success"
    },
    {
      "resource_name": "node001",
      "gpu_id": 2,
      "ok": true,
      "set_limit": 600.0,
      "diag_msg": "Success"
    },
    {
      "resource_name": "node001",
      "gpu_id": 3,
      "ok": true,
      "set_limit": 700.0,
      "diag_msg": "Success"
    },
    {
      "resource_name": "node001",
      "gpu_id": 4,
      "ok": true,
      "set_limit": 650.0,
      "diag_msg": "Success"
    },
    {
      "resource_name": "node001",
      "gpu_id": 5,
      "ok": true,
      "set_limit": 700.0,
      "diag_msg": "Success"
    },
    {
      "resource_name": "node001",
      "gpu_id": 6,
      "ok": true,
      "set_limit": 550.0,
      "diag_msg": "Success"
    },
    {
      "resource_name": "node001",
      "gpu_id": 7,
      "ok": true,
      "set_limit": 600.0,
      "diag_msg": "Success"
    }
  ],
  "status": {
    "ok": true,
    "diag_msg": "Success"
  }
}