Deployment Guide

Overview

This guide provides a step-by-step process for deploying DPS (NVIDIA Domain Power Service) to a Kubernetes cluster.

Prerequisites

  • Kubernetes Cluster >= 1.31.x
  • Helm 3.x
  • Kube context configured
  • Network access to the NGC Helm repository (for air-gapped environments, see Air-Gapped Deployment)

Configuration Reference

The Helm chart’s values.yaml file contains the complete list of configuration options with detailed comments. Always refer to this file for the latest options and defaults.

To download and view the values file:

# Add the NGC Helm repository (if not already added)
helm repo add ngc https://helm.ngc.nvidia.com/nvidia
helm repo update ngc

# Download and extract the chart
helm pull ngc/dps --untar

# View the values file
cat dps/values.yaml

Quickstart

For evaluation and testing, you can deploy DPS with default settings:

helm repo add ngc https://helm.ngc.nvidia.com/nvidia
helm repo update ngc
helm install dps -n dps ngc/dps --create-namespace
kubectl get pods -n dps

Note: The Quickstart deployment uses default settings suitable for evaluation only:

  • Built-in PostgreSQL (not production-ready)
  • No LDAP authentication
  • Authentication in warning-only mode (accepts any credentials)
  • Default hostnames: api.dps and ui.dps

For production deployments, follow the Production Deployment section below.

Production Deployment

1. Configure External PostgreSQL

The built-in PostgreSQL is not production-ready. For production, use your own PostgreSQL instance.

In your values.yaml, disable the built-in PostgreSQL and configure your external database:

postgresql:
  enabled: false

global:
  postgresql:
    host: "your-postgres.example.com"
    auth:
      username: "dps"
      database: "dps"
      existingSecret: "dps-postgres-credentials"
      secretKeys:
        passwordKey: "password"

See the chart’s values.yaml for all available PostgreSQL options including SSL mode and port configuration.

2. Create BMC Credentials

Create secrets for each BMC (Baseboard Management Controller) you want to manage. BMC credentials are required for bare-metal compute nodes (such as DGX systems or servers with Redfish/IPMI interfaces) that DPS will monitor and control. These are not Kubernetes node credentials. See Supported Platforms for a complete list of supported hardware.

First, create the namespace:

kubectl create namespace dps

Then create secrets for your BMC nodes:

kubectl apply -f - <<EOF
---
apiVersion: v1
kind: Secret
metadata:
  name: node001
  namespace: dps
  labels:
    app: bmc-secret
type: Opaque
stringData:
  bmc: |
    {
      "username": "admin",
      "password": "your-bmc-password"
    }
---
apiVersion: v1
kind: Secret
metadata:
  name: node002
  namespace: dps
  labels:
    app: bmc-secret
type: Opaque
stringData:
  bmc: |
    {
      "username": "admin",
      "password": "your-bmc-password"
    }
EOF

Note: The secret names are referenced in your topology configuration via the SecretName field in the Redfish configuration.

3. Create Values File

Create a values.yaml file with your configuration. At minimum, configure your ingress hostnames and mount your BMC secrets:

dps:
  ingress:
    hostname: "api.dps.your-domain.com"
  secrets:
    - name: node001
      secretKey: bmc
      mountPath: /home/nonroot/secrets/baremetal/node001/bmc
    - name: node002
      secretKey: bmc
      mountPath: /home/nonroot/secrets/baremetal/node002/bmc

ui:
  ingress:
    hostname: "ui.dps.your-domain.com"

See the chart’s values.yaml for additional options including TLS configuration, ingress annotations, and resource limits.

4. Install the Chart

helm install dps ngc/dps \
  --namespace dps \
  --values values.yaml \
  --wait

Optional Configuration

These options can be added to your values.yaml during initial installation or applied later via helm upgrade.

Certificate Management

DPS can work with cert-manager for automatic TLS certificate management:

dps:
  ingress:
    grpcAnnotations:
      cert-manager.io/cluster-issuer: "letsencrypt-prod"
    httpAnnotations:
      cert-manager.io/cluster-issuer: "letsencrypt-prod"

See dps.ingress.* and ui.ingress.* in the values file for all available annotation fields.

LDAP Integration

To enable LDAP authentication:

dps:
  ldap:
    enabled: true

Configure your LDAP server connection via dps.ldap.* settings including serverUrl, bindDn, and TLS certificate paths. See the chart’s values.yaml for all LDAP options and volume mount examples.

Enabling Telemetry Provider

Scope: This section covers DPS-side configuration only. It assumes Zapp is already deployed and operational. Zapp configuration and deployment are not covered here.

Telemetry Providers enable live streaming of sensor data from data center entities.

To enable Telemetry Provider:

    provider:
      enabled: true
      # -- Telemetry provider gRPC URL
      url: "grpc://telemetry-provider.dps.svc.cluster.local:1234"
      # -- Telemetry provider buffer interval in milliseconds
      bufferMs: 1000

When Telemetry Provider is disabled, DPS continues to operate normally — topology management, resource group policies, and power allocation all function as expected. The only difference is DPS does not receive real-time power data, so features that depend on it (such as excursion detection and mitigation) will not be active.

Alert Manager

Excursion Alerts require an Alertmanager instance. Without it, excursions are still logged but no alerts are fired.

dps:
  monitoring:
    enabled: true
    alertmanager:
      # Alert Manager URL
      url: "http://monitoring-kube-prometheus-alertmanager.dps.svc.cluster.local:9093"
      timeout: "5s"   # optional

Excursion Mitigation

DPS Power Excursion Mitigation capability is disabled by default. Set strategy to enable it.

dps:
  excursionMitigation:
     # -- Excursion mitigation strategy. Valid values: "" (disabled), "simple"
    strategy: "simple"
    # Supported keys for "simple": interval, max_iterations, oversteal_factor, initial_delay
    # Example: "initial_delay=10s,interval=10s,max_iterations=10,oversteal_factor=2.0"
    params: ""   # optional overrides

Nautobot Integration

Enable the Nautobot integration by setting dps.nautobot.enabled: true, pointing dps.nautobot.url at your existing Nautobot instance, and providing a read-capable API token via dps.nautobot.tokenSecret. If existingSecret is set, no token secret will be created

dps:
  nautobot:
    enabled: true
    url: "https://nautobot.company.com"
    tokenSecret:
      # If set, the chart will NOT create any token secret.
      existingSecret: ""
      # -- Name of the Secret to create when existingSecret is empty.
      secretName: "nautobot-dps-token"
      key: "token"
      mountPath: "/home/nonroot/secrets/nautobot-token"
      value: "api-token-123"

Verifying the Installation

1. Check Pod Status

kubectl get pods -n dps

Expected output:

NAME                          READY   STATUS    RESTARTS   AGE
dps-server-7d94d5744b-rf8ql   1/1     Running   0          4m
dps-ui-7d94d5744b-rf8ql       1/1     Running   0          4m

2. Check Services

kubectl get services -n dps

3. Test the API with dpsctl

Note: If you haven’t installed dpsctl yet, see Installing dpsctl for download and setup instructions.

Note: These commands require that DNS resolves DPS_HOST to your cluster’s ingress IP and ingress is configured. Consult your cluster administrator if the hostname is not reachable.

# Set environment variables
export DPS_HOST="api.dps.your-domain.com"
export DPS_PORT="443"

# Log in and verify
dpsctl login
dpsctl verify
dpsctl server-version

4. Access the UI

Note: This requires that DNS resolves the UI hostname to your cluster’s ingress IP and ingress is configured. Consult your cluster administrator if the hostname is not reachable.

Open your browser and navigate to https://ui.dps.your-domain.com

Login Credentials:

  • Evaluation/Quickstart deployments: Authentication is in warning-only mode by default, so any credentials will work (e.g., abcd for both username and password).
  • Production deployments: Use your LDAP credentials as configured in dps.ldap.*.

Upgrading DPS

helm repo update ngc
helm upgrade dps ngc/dps \
  --namespace dps \
  --values values.yaml

Troubleshooting

Common Issues

  1. Ingress Issues: Verify your ingress controller is properly configured
  2. BMC Connection Failures: Check BMC credentials and network connectivity to the BMC from dps-server
  3. Database Issues: Verify PostgreSQL is running and accessible

Getting Logs

# DPS server logs
kubectl logs -n dps -l app=dps-server --all-containers

# UI logs
kubectl logs -n dps -l app=dps-ui --all-containers

# PRS logs
kubectl logs -n dps -l app=prs --all-containers

# PostgreSQL logs (if using built-in)
kubectl logs -n dps -l app.kubernetes.io/name=postgresql --all-containers

Next Steps