Architecture

DPS Architecture

Overview

The Domain Power Service (DPS) is a power management solution designed as a cloud-native Kubernetes service. This document describes the overall architecture, core components, deployment model, and system integrations that make up the DPS ecosystem.

DPS provides centralized power management for datacenter infrastructure through a microservices architecture deployed on Kubernetes. The system manages power policies, monitors consumption, and controls hardware through standardized protocols.

Key Architectural Principles

  • Cloud-Native Design: Built for Kubernetes deployment with containerized microservices
  • API-First Architecture: All functionality exposed through gRPC APIs with HTTP/REST gateway
  • Protocol Standards: Uses industry-standard Redfish protocol for hardware control
  • Scalable Design: Horizontally scalable components with persistent storage
  • Security-First: Comprehensive authentication and authorization with LDAP integration

Core Components

DPS Server (dps-server)

The core orchestration engine of the DPS system.

Functionality:

  • gRPC API Server: Primary interface for all DPS operations
  • Topology Management: Manages power distribution network topology
  • Policy Engine: Implements and enforces power policies
  • Resource Group Management: Handles dynamic resource allocation
  • BMC Integration: Communicates with Baseboard Management Controllers via Redfish

Key Responsibilities:

  1. API Management: Serves gRPC API for client interactions
  2. Topology Validation: Ensures power topology constraints are satisfied
  3. Hardware Control: Manages BMC connections and Redfish communications
  4. Policy Enforcement: Applies and monitors power policies across resources
  5. Database Operations: Manages persistent state in PostgreSQL

DPS Control Utility (dpsctl)

Command-line interface for DPS operations.

Architecture:

  • gRPC Client: Communicates with dps-server via gRPC protocol
  • Command Parser: Processes CLI commands and arguments
  • Output Formatter: Provides human-readable and machine-parseable output

DPS Web UI (dps-ui)

Web-based user interface for DPS management.

Functionality:

  • Topology Visualization: Interactive topology creation and editing
  • Policy Management: Web-based policy configuration interface
  • Telemetry Dashboard: Real-time power consumption monitoring
  • Import/Export: Topology and configuration management tools

Integration:

  • DPS Server Communication: RESTful API calls to dps-server HTTP gateway
  • Authentication: Integrates with DPS authentication system

DPS Documentation (dps-docs)

Documentation service providing guides and API references.

Features:

  • Hugo-Generated: Static site generator for documentation
  • API Documentation: Auto-generated API references
  • Interactive Guides: Step-by-step operational procedures
  • SDK Documentation: Developer guides and examples

Data Storage

PostgreSQL Database

Primary persistent storage for DPS state and configuration.

Schema Components:

  • Topologies: Power distribution network definitions
  • Policies: Power management policy configurations
  • Resource Groups: Dynamic resource allocations
  • Entities: Hardware component specifications

Authentication & Authorization

OpenLDAP Integration

Centralized authentication provider for DPS ecosystem.

Components:

  • OpenLDAP Server: Directory service for user authentication
  • TLS Security: Encrypted LDAP communications
  • Group-Based Authorization: Role mapping through LDAP groups

Authentication Flow:

  1. User provides credentials to DPS component
  2. DPS validates credentials against OpenLDAP
  3. LDAP returns user groups and attributes
  4. DPS maps groups to internal roles
  5. JWT token issued for subsequent requests

Hardware Integration

BMC Communication

DPS controls hardware through Baseboard Management Controllers using the Redfish protocol.

Protocol Stack:

DPS Server
    ↓
Redfish API (HTTPS)
    ↓
BMC Controllers
    ↓
Hardware Components

Supported Hardware:

  • Real BMCs: Production server BMCs supporting Redfish
  • BMC Simulators: Development and testing environments

Connection Management:

  • Credential Storage: BMC credentials stored in Kubernetes secrets
  • Connection Pooling: Efficient BMC connection management
  • Health Monitoring: BMC availability and status tracking
  • Error Handling: Robust error recovery and retry logic

BMC Simulation System

Development and testing infrastructure simulating NVIDIA DGX systems.

Simulated Systems:

  • H100 Systems: NVIDIA H100 GPU-based servers
  • B200 Systems: NVIDIA B200 GPU-based servers
  • B300 Systems: NVIDIA B300 GPU-based servers
  • GB200 Systems: NVIDIA GB200 Grace Blackwell systems

Simulator Features:

  • Redfish Compliance: Partial Redfish API implementation
  • Realistic Behavior: Accurate power consumption modeling
  • Dynamic Response: Real-time power limit changes
  • Monitoring Integration: Prometheus metrics export

Python API Integration

Python API for programmatic DPS interaction.

API Features:

  • Complete Coverage: All DPS functionality accessible via Python
  • Type Safety: Strongly typed interfaces with validation
  • Async Support: Asynchronous operation support
  • Error Handling: Exception handling

Deployment Architecture

Kubernetes Deployment Model

DPS deploys as a collection of Kubernetes services using Helm charts.

Namespace Organization:

dps (namespace)
├── dps-server (deployment + service)
├── dps-ui (deployment + service)
├── dps-docs (deployment + service)
├── postgresql (statefulset + service)
├── openldap (deployment + service)
└── bmc-simulators (deployment + service)

Security Architecture

Network Security

Service Isolation:

  • Network Policies: Kubernetes network policies restrict inter-pod communication
  • TLS Encryption: All external communications encrypted with TLS
  • Internal Communication: Secure internal service mesh

Authentication Security

Multi-Layer Authentication:

  1. LDAP Integration: Enterprise directory service integration
  2. JWT Tokens: Stateless authentication with configurable expiration
  3. API Keys: Service-to-service authentication
  4. BMC Credentials: Hardware credentials stored in Kubernetes secrets

Monitoring & Observability

Metrics Collection

Prometheus Integration:

  • Service Monitors: Automatic service discovery for metrics
  • Custom Metrics: DPS-specific power consumption metrics
  • Hardware Metrics: BMC-sourced telemetry data

Logging Architecture

Structured Logging:

  • JSON Format: Machine-parseable log formats
  • Correlation IDs: Request tracing across components
  • Log Levels: Configurable verbosity levels

Tracing Integration

Distributed Tracing:

  • Tempo Integration: Request tracing across microservices
  • Span Collection: Detailed operation timing
  • Performance Analysis: Bottleneck identification