Redfish API

Redfish API

Overview

Redfish is a standard RESTful API specification developed by the Distributed Management Task Force (DMTF) for managing and monitoring server hardware, storage, networking, and converged infrastructure. It provides a standardized way to interact with Baseboard Management Controllers (BMCs) and other hardware management interfaces.

DPS uses Redfish as its primary interface for communicating with compute nodes and power distribution equipment in the datacenter. Through Redfish, DPS can interact with:

  • Power Management - Set and monitor power limits for nodes, GPUs, CPUs, and memory
  • System Monitoring - Collect power consumption and performance metrics
  • Hardware Control - Manage system states and configurations

Example Endpoints by System

# H100 GPU power limit
GET /redfish/v1/Systems/HGX_Baseboard_0/Processors/GPU_SXM_0/EnvironmentMetrics

# B200 GPU power limit
GET /redfish/v1/Systems/HGX_Baseboard_0/Processors/GPU_0/EnvironmentMetrics

# GB200 processor module metrics
GET /redfish/v1/Chassis/HGX_ProcessorModule_0/EnvironmentMetrics

Component-Level Control

Redfish enables granular power management:

  • Node Power - Overall system power limits
  • GPU Power - Per-GPU power control and monitoring
  • CPU Power - Processor power management
  • Memory Power - DRAM power consumption control

Power Domains

Redfish supports power domain management for complex power distribution:

{
  "PowerDomains": [
    {
      "@odata.id": "/redfish/v1/PowerDomains/PD-A",
      "Id": "PD-A",
      "Name": "Power Domain A",
      "PowerCapacityWatts": 1150000,
      "PowerConsumedWatts": 850000
    }
  ]
}

Workload Power Profile Support (WPPS)

NVIDIA DGX systems support Workload Power Profile Support (WPPS) through Redfish OEM extensions. WPPS allows dynamic power profile switching based on workload characteristics:

WPPS Endpoints

# Enable workload profiles
POST /redfish/v1/Systems/HGX_Baseboard_0/Processors/{gpu_id}/Oem/Nvidia/WorkloadPowerProfile/Actions/NvidiaWorkloadPower.EnableProfiles

# Disable workload profiles
POST /redfish/v1/Systems/HGX_Baseboard_0/Processors/{gpu_id}/Oem/Nvidia/WorkloadPowerProfile/Actions/NvidiaWorkloadPower.DisableProfile

WPPS Request Example

{
  "ProfileMask": "0x7"
}

WPPS Response Example

{
  "ProfileStatus": {
    "Training": "Enabled",
    "Inference": "Enabled",
    "Development": "Enabled"
  },
  "Message": "Workload profiles enabled successfully"
}

Supported Workload Profiles

  • Training - Optimized for machine learning training workloads
  • Inference - Optimized for inference and serving workloads
  • Development - Balanced profile for development and testing
  • Custom - User-defined power profiles

DPS Redfish Integration

The Redfish API is the primary method DPS uses to configure data center hardware and apply power configurations. DPS supports different Redfish endpoints based on hardware platform and operation type.

Power Policy Application

DPS applies power policies through different Redfish endpoints depending on the hardware platform:

H100 Systems

Uses Redfish Node Manager Policy Service:

  • Endpoint: /redfish/v1/Managers/BMC/NodeManager/Domains/{domain_id}
  • Method: PATCH
  • Payload: Domain capabilities and policies
{
  "Capabilities": {
    "Min": 5000,
    "Max": 10000
  },
  "Policies": {
    "Members": [
      {
        "ComponentId": "gpu",
        "Limit": 4000,
        "PercentageOfDomainBudget": 80
      }
    ]
  }
}

B200/GB200 Systems

Use Redfish EnvironmentMetrics endpoints for granular control:

GPU Power Limits:

  • Endpoint: /redfish/v1/Systems/HGX_Baseboard_0/Processors/GPU_{index}/EnvironmentMetrics
  • Method: PATCH
  • Payload:
{
  "PowerLimitWatts": {
    "SetPoint": 1200
  }
}

CPU Power Limits:

  • Endpoint: /redfish/v1/Systems/HGX_Baseboard_0/Processors/CPU_{index}/EnvironmentMetrics
  • Method: PATCH
  • Payload:
{
  "PowerLimitWatts": {
    "SetPoint": 420
  }
}

Multi-Module Systems (Bianca):

  • Endpoint: /redfish/v1/Chassis/HGX_ProcessorModule_{index}/EnvironmentMetrics
  • Method: PATCH
  • Payload:
{
  "PowerLimitWatts": {
    "SetPoint": 2820
  }
}

Metrics Collection

DPS collects power and performance metrics through different Redfish endpoints based on hardware platform:

H100/B200

Uses Redfish TelemetryService MetricReports:

  • Endpoint: /redfish/v1/TelemetryService/MetricReports/NvidiaNMMetrics_0
  • Method: GET
  • Response:
{
  "@odata.id": "/redfish/v1/TelemetryService/MetricReports/NvidiaNMMetrics_0",
  "Id": "NvidiaNMMetrics_0",
  "MetricValues": [
    {
      "MetricId": "dcPlatformPower_avg",
      "MetricValue": "2097.00",
      "Timestamp": "2024-10-15T15:07:35+00:00"
    },
    {
      "MetricId": "gpuPower_avg_0",
      "MetricValue": "158.657",
      "Timestamp": "2024-10-15T15:07:35+00:00"
    },
    {
      "MetricId": "cpuPackagePower_avg_0",
      "MetricValue": "195.00",
      "Timestamp": "2024-10-15T15:07:35+00:00"
    }
  ]
}

GB200

Uses EnvironmentMetrics endpoints for component-level monitoring:

Processor Module Metrics:

  • Endpoint: /redfish/v1/Chassis/HGX_ProcessorModule_{index}/EnvironmentMetrics
  • Method: GET
  • Response:
{
  "@odata.id": "/redfish/v1/Chassis/HGX_ProcessorModule_0/EnvironmentMetrics",
  "Name": "Processor Module 0 Environment Metrics",
  "PowerWatts": {
    "Reading": 158.657,
    "SetPoint": 1200,
    "AllowableMin": 200,
    "AllowableMax": 1200
  },
  "TemperatureCelsius": {
    "Reading": 45.2
  },
  "EnergykWh": {
    "Reading": 0.042
  }
}

Individual GPU/CPU Metrics:

  • Endpoint: /redfish/v1/Systems/HGX_Baseboard_0/Processors/GPU_{index}/EnvironmentMetrics
  • Method: GET
  • Response:
{
  "@odata.id": "/redfish/v1/Systems/HGX_Baseboard_0/Processors/GPU_0/EnvironmentMetrics",
  "Name": "GPU 0 Environment Metrics",
  "PowerWatts": {
    "Reading": 158.657,
    "SetPoint": 1200,
    "AllowableMin": 200,
    "AllowableMax": 1200
  },
  "PowerLimitWatts": {
    "Reading": 1200,
    "SetPoint": 1200,
    "AllowableMin": 200,
    "AllowableMax": 1200,
    "ControlMode": "Automatic"
  }
}

Workload Power Profile Support (WPPS)

For B200 systems with WPPS support, DPS uses Redfish OEM extensions:

Enable Workload Profiles:

  • Endpoint: /redfish/v1/Systems/HGX_Baseboard_0/Processors/GPU_{id}/Oem/Nvidia/WorkloadPowerProfile/Actions/NvidiaWorkloadPower.EnableProfiles
  • Method: POST
  • Payload:
{
  "ProfileMask": "0x7"
}

Disable Workload Profiles:

  • Endpoint: /redfish/v1/Systems/HGX_Baseboard_0/Processors/GPU_{id}/Oem/Nvidia/WorkloadPowerProfile/Actions/NvidiaWorkloadPower.DisableProfiles
  • Method: POST
  • Payload:
{
  "ProfileMask": "0x0"
}

The following endpoints are those relevant to DPS operations:

  • Core Service Root
    • /redfish/v1/: Service root entry point
  • Manager Resources
    • /redfish/v1/Managers: Manager collection
    • /redfish/v1/Managers/BMC: BMC manager
  • Power Domain Management
    • H100 / B200
    • /redfish/v1/Managers/BMC/NodeManager: Node management
    • /redfish/v1/Managers/BMC/NodeManager/Domains: Power domains collection
    • /redfish/v1/Managers/BMC/NodeManager/Domains/{id}: Specific domain (0, 1, 2…)
    • /redfish/v1/Managers/BMC/NodeManager/Domains/{id}/Policies: Domain policies collection
    • /redfish/v1/Managers/BMC/NodeManager/Domains/{id}/Policies/{id}: Specific policy
    • /redfish/v1/Managers/BMC/NodeManager/PSUPolicies: PSU policies collection
    • /redfish/v1/Managers/BMC/NodeManager/PSUPolicies/{id}: Specific PSU policy
  • System Management
    • /redfish/v1/Systems: Systems collection
    • /redfish/v1/Systems/HGX_Baseboard_0: HGX system baseboard
  • Processor Management
    • /redfish/v1/Systems/HGX_Baseboard_0/Processors: Processors collection
    • B200 / GB200:
      • /redfish/v1/Systems/HGX_Baseboard_0/Processors/CPU_{id}: CPU processors (CPU_0)
      • /redfish/v1/Systems/HGX_Baseboard_0/Processors/GPU_{id}: GPU processors (GPU_0, GPU_3)
    • H100:
      • /redfish/v1/Systems/HGX_Baseboard_0/Processors/GPU_SXM_{id}: SXM GPU processors (GPU_SXM_1, GPU_SXM_2)
      • /redfish/v1/Systems/HGX_Baseboard_0/Processors/FPGA_{id}: FPGA processors (FPGA_0)
  • Environment Metrics
    • GB200:
      • /redfish/v1/Systems/HGX_Baseboard_0/Processors/{processor_id}/EnvironmentMetrics
  • Chassis-level Metrics
    • /redfish/v1/Chassis/Chassis_0/EnvironmentMetrics: Enhanced in GB200
    • /redfish/v1/Chassis/HGX_Chassis_0/EnvironmentMetrics
    • GB200:
      • /redfish/v1/Chassis/HGX_ProcessorModule_{id}/EnvironmentMetrics: Processor modules
  • OEM Extensions
    • /redfish/v1/Systems/HGX_Baseboard_0/Processors/{gpu_id}/Oem: OEM extensions
    • /redfish/v1/Systems/HGX_Baseboard_0/Processors/{gpu_id}/Oem/Nvidia: NVIDIA-specific
    • /redfish/v1/Systems/HGX_Baseboard_0/Processors/{gpu_id}/Oem/Nvidia/WorkloadPowerProfile: GPU workload profiles - WPPS support
  • Workload Power Profile Actions
    • /redfish/v1/Systems/HGX_Baseboard_0/Processors/{gpu_id}/Oem/Nvidia/WorkloadPowerProfile/Actions/NvidiaWorkloadPower.EnableProfiles
    • /redfish/v1/Systems/HGX_Baseboard_0/Processors/{gpu_id}/Oem/Nvidia/WorkloadPowerProfile/Actions/NvidiaWorkloadPower.DisableProfile
    • /redfish/v1/Systems/HGX_Baseboard_0/Processors/{gpu_id}/Oem/Nvidia/WorkloadPowerProfile/EnableProfilesActionInfo
    • /redfish/v1/Systems/HGX_Baseboard_0/Processors/{gpu_id}/Oem/Nvidia/WorkloadPowerProfile/DisableProfilesActionInfo
  • Chassis Resources
    • /redfish/v1/Chassis: Chassis collection
    • /redfish/v1/Chassis/HGX_Chassis_0: HGX GPU chassis - All devices
    • /redfish/v1/Chassis/Chassis_0: Main chassis - Enhanced monitoring in GB200
    • GB200:
      • /redfish/v1/Chassis/HGX_ProcessorModule_{id}: HGX processor modules
  • Telemetry Services
    • /redfish/v1/TelemetryService: Telemetry service root
    • /redfish/v1/TelemetryService/MetricReports: Metric reports collection
    • /redfish/v1/TelemetryService/MetricReports/NvidiaNMMetrics_0: NVIDIA NodeManager metrics
    • /redfish/v1/TelemetryService/MetricReportDefinitions/NvidiaNMMetrics_0: Metrics definitions
  • Session Management
    • /redfish/v1/SessionService: Session service
    • /redfish/v1/SessionService/Sessions: Sessions collection
    • /redfish/v1/SessionService/Sessions/{session_id}: Individual sessions
  • Environment Metrics Actions
    • NVIDIA-specific environment actions:
      • /redfish/v1/Systems/HGX_Baseboard_0/Processors/{gpu_id}/EnvironmentMetrics/Actions/Oem/NvidiaEnvironmentMetrics.ClearOOBSetPoint
      • /redfish/v1/Systems/HGX_Baseboard_0/Processors/{gpu_id}/EnvironmentMetrics/Actions/Oem/NvidiaEnvironmentMetrics.ResetEDPp
    • /redfish/v1/{objectPath}/EnvironmentMetrics: Flexible pattern for any object path

References