Deployment Guide

Deployment Guide

Overview

This guide provides a step-by-step process for deploying DPS (NVIDIA Domain Power Service) to a Kubernetes cluster.

Prerequisites

  • Kubernetes Cluster >= 1.31.x
  • Helm 3.x
  • kubectl configured to access your cluster
  • Access to DPS NGC
  • Access to the DPS Artifactory Repositories
  • Access to the managed hosts and their BMC (Baseboard Management Controller)

Downloading DPS Artifacts

1. Download the Helm Chart

Download the DPS Helm chart from NGC. Visit DPS Helm Charts on NGC to find available versions.

# Set your version (e.g., 0.7.0)
VERSION="0.7.0"

# Download the chart
helm fetch https://helm.ngc.nvidia.com/nvidia/charts/dps-${VERSION}.tgz

# Extract the chart
tar -xzf dps-${VERSION}.tgz

Preparing Your Kubernetes Cluster

1. Create the DPS Namespace

kubectl create namespace dps

2. Configure Image Pull Secrets (if needed)

If your cluster requires authentication to pull images from NVIDIA’s NGC registry:

kubectl create secret docker-registry nvcr-secret \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password=<your-ngc-api-key> \
  --namespace=dps

3. Configure BMC Credentials

Create secrets for each BMC (Baseboard Management Controller) you want to manage:

---
apiVersion: v1
kind: Secret
metadata:
  name: <node-secret-name>
  namespace: dps
  labels:
    app: bmc-secret
type: Opaque
stringData:
  bmc: |
    {
      "username": "your-bmc-username",
      "password": "your-bmc-password"
    }

Example for multiple nodes:

# Create secrets for your BMC nodes
kubectl apply -f - <<EOF
---
apiVersion: v1
kind: Secret
metadata:
  name: dgxh100
  namespace: dps
  labels:
    app: bmc-secret
type: Opaque
stringData:
  bmc: |
    {
      "username": "admin",
      "password": "admin"
    }
---
apiVersion: v1
kind: Secret
metadata:
  name: viking592
  namespace: dps
  labels:
    app: bmc-secret
type: Opaque
stringData:
  bmc: |
    {
      "username": "your-username",
      "password": "your-password"
    }
EOF

Note: The secret names are referenced in your topology configuration via the SecretName field in the Redfish configuration.

Installing DPS

1. Create a Values File

Create a values.yaml file with your configuration:

For example:

# Basic DPS configuration
global:
  imagePullSecrets:
    - name: nvcr-secret

dps:
  ingress:
    hostname: "api.dps.your-domain.com"
    tls:
      - hosts:
          - "api.dps.your-domain.com"
        secretName: dps-api-tls

  # Mount BMC secrets
  secrets:
    - name: dgxh100
      secretKey: bmc
      mountPath: /home/nonroot/secrets/baremetal/dgxh100/bmc
    - name: viking592
      secretKey: bmc
      mountPath: /home/nonroot/secrets/baremetal/viking592/bmc
    - name: viking593
      secretKey: bmc
      mountPath: /home/nonroot/secrets/baremetal/viking593/bmc

# Enable UI
ui:
  ingress:
    hostname: "ui.dps.your-domain.com"
    tls:
      - hosts:
          - "ui.dps.your-domain.com"
        secretName: dps-ui-tls

# Enable documentation
docs:
  ingress:
    hostname: "docs.dps.your-domain.com"
    tls:
      - hosts:
          - "docs.dps.your-domain.com"
        secretName: dps-docs-tls

2. Install the Chart

# Install DPS
helm install dps ./dps-${VERSION} \
  --namespace dps \
  --values values.yaml \
  --wait

# Verify installation
kubectl get pods -n dps
kubectl get services -n dps

Dependencies and Optional Components

Certificate Management

DPS can work with cert-manager for automatic TLS certificate management:

# In your values.yaml
dps:
  ingress:
    grpcAnnotations:
      cert-manager.io/cluster-issuer: "letsencrypt-prod"
    httpAnnotations:
      cert-manager.io/cluster-issuer: "letsencrypt-prod"

LDAP Integration

For LDAP authentication, you can configure DPS to use an external LDAP server:

For example:

dps:
  ldap:
    enabled: true
    serverUrl: "ldaps://your-ldap-server.company.com:636"
    bindDn: "cn=dps-service,ou=services,dc=company,dc=com"
    bindPassword: "your-service-account-password"
    defaultRole: "admin"
    groupRoleMapping: "dcpower-admins=admin,dcpower-users=user,dcpower-readonly=readonly"
    certSource: "file"
    tlsCaCert: "/home/nonroot/secrets/ldap/ca.crt"
    tlsClientCert: "/home/nonroot/secrets/ldap/client.crt"
    tlsClientKey: "/home/nonroot/secrets/ldap/client.key"
  extraVolumeMounts:
    - name: ldap-certs
      mountPath: "/home/nonroot/secrets/ldap"
      readOnly: true
  extraVolumes:
    - name: ldap-certs
      secret:
        secretName: "external-ldap-certs"
        items:
          - key: "ca.crt"
            path: "ca.crt"
            mode: 0444
          - key: "client.crt"
            path: "client.crt"
            mode: 0444
          - key: "client.key"
            path: "client.key"
            mode: 0400

You will need to change the values above to match your environment.

Verifying the Installation

1. Check Pod Status

kubectl get pods -n dps

Expected output:

NAME READY STATUS RESTARTS AGE
dps-postgresql-0 1/1 Running 0 4m22s
dps-server-7d94d5744b-rf8ql 1/1 Running 1 (2m18s ago) 4m22s
dps-ui-7d94d5744b-rf8ql 1/1 Running 0 4m22s
dps-docs-7d94d5744b-rf8ql 1/1 Running 0 4m22s

2. Check Services

kubectl get services -n dps

3. Test the API with dpsctl

# Set environment variables for dpsctl defaults
export DPS_HOST="api.dps.your-domain.com"
export DPS_PORT="443"

# Log in to the DPS server
dpsctl login

# Test server functionality
dpsctl verify

# Test with dpsctl
dpsctl server-version

# Or test the gRPC endpoint directly

# Get a token
grpcurl -d '{
  "passwordCredential": {
    "username": "<username>",
    "password": "<password>"
  }
}' \
  ${DPS_HOST}:${DPS_PORT} \
  nvidia.dcpower.v1.AuthService/Token

# List gRPC endpoints
grpcurl \
  -H "authorization: Bearer ${ACCESS_TOKEN}" \
  ${DPS_HOST}:${DPS_PORT} \
  list

4. Access the UI

Open your browser and navigate to https://ui.dps.your-domain.com

Troubleshooting

Common Issues

  1. Image Pull Errors: Ensure your image pull secrets are configured correctly
  2. Ingress Issues: Verify your ingress controller is properly configured
  3. BMC Connection Failures: Check BMC credentials and network connectivity
  4. Database Issues: Verify PostgreSQL is running and accessible

Getting Logs

# DPS server logs
kubectl logs -n dps statefulset/dps-server

# UI logs
kubectl logs -n dps deployment/dps-ui

# PostgreSQL logs
kubectl logs -n dps statefulset/dps-postgresql

Upgrading DPS

# Using Helm
helm upgrade dps ./dps-new-version \
  --namespace dps \
  --values values.yaml

Next Steps