Base Command Manager

Base Command Manager

Overview

NVIDIA Base Command Manager (BCM) is a comprehensive cluster management solution designed specifically for AI and HPC workloads. It provides centralized management of NVIDIA DGX systems, GPU clusters, and heterogeneous computing environments. BCM simplifies datacenter operations by offering automated provisioning, monitoring, and management capabilities for large-scale AI infrastructure.

Key Concepts

Cluster Management

BCM provides unified management of:

  • Compute nodes - DGX systems, GPU servers, and traditional compute nodes
  • Storage systems - Network-attached storage and distributed file systems
  • Network infrastructure - High-speed interconnects and network switches
  • User management - Authentication, authorization, and resource quotas
  • Job scheduling - Integration with SLURM, Kubernetes, and other schedulers

Resource Orchestration

BCM orchestrates resources through:

  • Automated provisioning - Bare metal and container-based deployment
  • Configuration management - Centralized system configuration and updates
  • Monitoring and alerting - Real-time health monitoring and proactive maintenance
  • Backup and recovery - Automated backup strategies and disaster recovery

BCM Architecture

Core Components

BCM Head Node:

  • Central management server running BCM software
  • Web-based management interface
  • REST API for programmatic access
  • Database for configuration and monitoring data

BCM Compute Nodes:

  • Managed compute resources (DGX systems, GPU servers)
  • BCM agent software for communication with head node
  • Automated configuration and monitoring capabilities

BCM Storage:

  • Centralized configuration and user data storage
  • Backup and recovery management
  • Shared file systems and data management

DPS Integration

DPS (Domain Power Service) is integrated into BCM as a plugin to provide power management and optimization capabilities for the infrastructure BCM manages.

Entity Generation

BCM can automatically generate DPS entities from its cluster inventory:

# Generate DPS entities from BCM cluster
dpsctl bcm import \
  --url https://bcm-headnode:8443 \
  --username admin \
  --password secret123

Generated Entity Example:

{
  "entities": [
    {
      "name": "dgx001",
      "type": "ComputerSystem",
      "model": "DGX_H100",
      "redfish": {
        "@odata.type": "#ComputerSystem.v1_23_0.ComputerSystem",
        "@odata.id": "/dgx001",
        "id": "dgx001",
        "url": "https://dgx001-bmc.example.com",
        "secret_name": "dgx001"
      }
    }
  ]
}

Further Reading