Deployment Summary Validation Checklist#

This comprehensive checklist provides end-to-end validation for the entire NVIDIA Mission Control Management Plane and GB200 NVL72 rack deployment process. Use this checklist after completing all deployment phases to ensure the entire system is properly configured and operational.

Note

This checklist consolidates validation points from all phases of the deployment process. Complete all sections systematically to ensure full deployment validation.

Prerequisites and Hardware Validation#

Physical Infrastructure#

Rack and Power Validation

  • [ ] All GB200 racks properly installed and positioned

  • [ ] Power infrastructure correctly configured (sufficient power capacity)

  • [ ] Cooling infrastructure adequate for GB200 NVL72 systems

  • [ ] Network infrastructure deployed per reference architecture

  • [ ] All cable connections verified (power, network, InfiniBand)

Hardware Inventory Verification

  • [ ] All head nodes present and correctly configured (primary and secondary for HA)

  • [ ] All control plane nodes present (slogin, k8s-admin, k8s-user)

  • [ ] All GB200 compute trays accounted for (18 per rack)

  • [ ] All NVLink switch trays present (9 per rack)

  • [ ] All power shelves installed (8 per rack)

  • [ ] Hardware matches documented specifications and BOMs

Network Infrastructure

  • [ ] All TOR switches configured and operational

  • [ ] OOB management network connectivity verified

  • [ ] InfiniBand fabric connectivity confirmed

  • [ ] Ethernet fabric connectivity confirmed

  • [ ] IPMI/BMC networks accessible from head nodes

BCM Software Installation Validation#

Head Node Configuration#

BCM Installation Verification

  • [ ] BCM 11 software successfully installed on primary head node

  • [ ] Correct BCM version installed and verified (Ubuntu 24.04 base)

  • [ ] Mixed architecture support configured for ARM/aarch64 and x86

  • [ ] BCM license successfully installed and activated

  • [ ] NVIDIA Mission Control license enabled

  • [ ] Head node network interfaces correctly configured

  • [ ] Head node storage configuration validated (RAID 1 recommended)

Post-Installation System Health

  • [ ] BCM services running correctly (systemctl status cmd)

  • [ ] Head node accessible via SSH

  • [ ] CMDaemon operational and responding

  • [ ] System logs free of critical errors

  • [ ] Adequate disk space available for operations

  • [ ] Time synchronization configured (NTP)

  • [ ] DNS resolution working correctly

Network Configuration Validation#

Network Definitions#

Core Networks Configured

  • [ ] internalnet - Control plane provisioning network

  • [ ] dgxnet - GB200 compute node network(s)

  • [ ] ipminet - Out-of-band management network(s)

  • [ ] computenet - InfiniBand/East-West network

  • [ ] storagenet - Storage/converged Ethernet network

  • [ ] globalnet - Global network settings

Network Settings Verification

  • [ ] All networks have correct IP ranges and CIDR assignments

  • [ ] DHCP ranges configured appropriately for each network

  • [ ] Gateway settings configured correctly

  • [ ] Network booting enabled for provisioning networks

  • [ ] Management allowed settings configured appropriately

  • [ ] MTU settings configured (9000 for high-speed networks)

Head Node Network Interfaces

  • [ ] Provisioning bond (bond0) configured and operational

  • [ ] IPMI network bond (bond1) configured for OOB access

  • [ ] All physical interfaces properly assigned to bonds

  • [ ] Link aggregation (LACP mode 4) functioning correctly

  • [ ] Network interface naming conventions followed

Software Image and Category Validation#

Mixed Architecture Setup#

Image Availability Verification

  • [ ] Default images available for all required architectures

  • [ ] ARM/aarch64 images properly configured

  • [ ] x86_64 images properly configured

  • [ ] Node-installer images for cross-architecture support

  • [ ] CM-shared images for all architectures

  • [ ] DGX OS 7 image available for GB200 nodes

Category Configuration

  • [ ] slogin category configured with appropriate software image

  • [ ] k8s-admin category configured (x86 required for NMX-M)

  • [ ] k8s-user category configured with appropriate software image

  • [ ] dgx-gb200 category configured with DGX OS 7

  • [ ] All categories have correct network assignments

  • [ ] BMC settings configured at category level

  • [ ] Boot and installation options configured

Software Image Customization

  • [ ] Custom packages installed in appropriate images

  • [ ] GPU drivers and CUDA libraries in GB200 images

  • [ ] Kubernetes tools in k8s categories

  • [ ] Development tools in slogin images

  • [ ] All image customizations documented and tested

Control Plane Node Configuration Validation#

Head Node Configuration#

Primary Head Node

  • [ ] BMC interface (rf0/ipmi0) configured and accessible

  • [ ] Provisioning bond (bond0) operational with correct IP

  • [ ] IPMI network bond (bond1) configured for OOB access

  • [ ] All MAC addresses correctly assigned

  • [ ] Power control functional via BMC

  • [ ] BMC credentials properly configured

Secondary Head Node (if HA configured)

  • [ ] Secondary head node defined in BCM

  • [ ] Network interfaces properly configured

  • [ ] Cloned from primary head node successfully

  • [ ] Ready for HA failover setup

SLURM Login Nodes#

Golden Node Configuration

  • [ ] Physical node created with slogin category

  • [ ] ARM/C2 architecture confirmed

  • [ ] BMC interface (rf0) configured and accessible

  • [ ] Provisioning bond operational on internalnet

  • [ ] Storage network interfaces configured (/31 IPs)

  • [ ] MAC addresses correctly assigned to all interfaces

Node Cloning and Scaling

  • [ ] Additional slogin nodes cloned successfully

  • [ ] Hostname conventions followed: <RACK>-<RU>-P[1-16]-SLOGIN-0[1-X]

  • [ ] IP addresses incremented correctly

  • [ ] All cloned nodes have unique MAC addresses

K8s-Admin Nodes#

Configuration Requirements

  • [ ] Physical nodes created with k8s-admin category

  • [ ] x86 architecture confirmed (required for NMX-M compatibility)

  • [ ] Four network interfaces configured per node

  • [ ] Provisioning bond (bond0) operational

  • [ ] NVLink COMe network bond (bond1) configured

  • [ ] Total of 3 nodes configured (odd number for quorum)

K8s-User Nodes#

User Space Configuration

  • [ ] Physical nodes created with k8s-user category

  • [ ] Architecture documented (ARM or x86 supported)

  • [ ] Provisioning bond operational

  • [ ] Storage network interfaces configured

  • [ ] Total of 3 nodes configured (odd number for quorum)

Network Verification for All Control Plane Nodes

  • [ ] All interfaces show “always” for Start if

  • [ ] Bond interfaces show correct member interfaces

  • [ ] IP addresses match planning documentation

  • [ ] BMC connectivity verified for all nodes

  • [ ] Power control functional for all nodes

GB200 Rack Configuration Validation#

Rack Import Verification#

Automated Import Process (if used)

  • [ ] Rack inventory file obtained from factory

  • [ ] Point-to-Point (P2P) documentation available

  • [ ] bcm-netautogen tool executed successfully

  • [ ] All .json files generated for rack components

  • [ ] bcm-post-install automation completed successfully

  • [ ] All devices imported into BCM without errors

Manual Import Process (if used)

  • [ ] Individual .json files created for all components

  • [ ] 18 GB200 compute tray entries

  • [ ] 9 NVLink switch tray entries

  • [ ] 8 power shelf entries

  • [ ] All .json files imported successfully

  • [ ] No import errors in BCM logs

Device Inventory Validation

  • [ ] Total device count matches expected: - [ ] 18 GB200 compute trays per rack - [ ] 9 NVLink switches per rack - [ ] 8 power shelves per rack

  • [ ] All devices appear in BCM device list

  • [ ] All devices show proper status in BCM

Naming Convention Compliance

  • [ ] GB200 compute trays: <RACK>-<RU>-P[1-16]-<ROLE>-0[1-8]-C0[1-18]

  • [ ] NVLink switches: <RACK>-<RU>-P[1-16]-<switch_role>-0[1-9]

  • [ ] Power shelves: proper naming convention followed

  • [ ] All hostnames consistent with site naming standards

GB200 Compute Tray Configuration#

Golden Node Configuration

  • [ ] Physical node created with dgx-gb200 category

  • [ ] BMC interface (rf0) configured and accessible

  • [ ] Network interfaces properly configured: - [ ] M1 and M2 (Bluefield management ports) - [ ] S1 and S2 (storage network ports) - [ ] InfiniBand interfaces (if configured)

  • [ ] Provisioning bond (bond0) operational

  • [ ] System MAC set for initial boot

Node Cloning Verification

  • [ ] 18 compute tray entries cloned from golden node

  • [ ] All cloned nodes have incremental IPs

  • [ ] All cloned nodes have unique MAC addresses

  • [ ] Rack positions set correctly for all nodes

High Availability Configuration Validation#

HA Setup Verification#

Primary Head Node HA Configuration

  • [ ] cmha-setup wizard completed successfully

  • [ ] Virtual IP (VIP) addresses configured for internal and external networks

  • [ ] Secondary head node entry created in BCM

  • [ ] Failover network configuration completed

  • [ ] License updated with MACs from both head nodes

Secondary Head Node Deployment

  • [ ] Secondary head node PXE booted successfully

  • [ ] Rescue environment accessed

  • [ ] cm-clone-install –failover completed successfully

  • [ ] Secondary head node rebooted and accessible

  • [ ] Database synchronization completed

HA Functionality Testing

  • [ ] cmha status shows both nodes operational

  • [ ] Manual failover testing successful (cmha makeactive)

  • [ ] Both directions of failover tested

  • [ ] Primary and secondary head nodes function correctly as active

  • [ ] All HA tests show [ OK ] status

NFS and Shared Storage#

NFS Infrastructure

  • [ ] NFS appliance/server configured and accessible

  • [ ] NFS exports configured with recommended options

  • [ ] Mount points created for /home and /cm/shared

  • [ ] Mixed architecture support configured (separate cm/shared for each arch)

  • [ ] NFSv3 configured (default for DGX SuperPOD)

Shared Storage Setup

  • [ ] NAS configuration completed in cmha-setup

  • [ ] /cm/shared directories copied to NFS

  • [ ] /home directory copied to NFS

  • [ ] All head nodes mount shared storage correctly

  • [ ] Compute nodes configured to use shared storage

Mixed Architecture Fsmounts

  • [ ] Fsmounts configured for each architecture

  • [ ] Incorrect default mounts removed

  • [ ] Architecture-specific /cm/shared mounts configured

  • [ ] Mount retry counts increased (retry=15)

  • [ ] All categories have correct fsmounts

Rack Power-On and Provisioning Validation#

GB200 Compute Tray Bring-Up#

Power Control Verification

  • [ ] OOB power control configured for all compute trays

  • [ ] Power status commands return proper responses

  • [ ] BMC connectivity verified for all trays

  • [ ] Power control settings configured (rf0 preferred)

  • [ ] Power reset delay configured (5s recommended)

Provisioning Process

  • [ ] Test node powered on and provisioned successfully

  • [ ] Node installer logs reviewed for errors

  • [ ] System logs reviewed for issues

  • [ ] Node appears in “UP” state in BCM

  • [ ] All compute trays provisioned successfully

  • [ ] Network bonds operational on all nodes

  • [ ] North-south networking connectivity verified

Network Interface Validation

  • [ ] All Bluefield and CX-7 interfaces operational

  • [ ] Bond0 interfaces up and configured

  • [ ] Storage network interfaces configured

  • [ ] InfiniBand interfaces functional (if configured)

  • [ ] Link status verified for all connections

Power Shelf Validation#

Power Infrastructure

  • [ ] All power shelves reporting as operational

  • [ ] Power shelf status verified in BCM

  • [ ] Environmental monitoring functional (if applicable)

  • [ ] Power capacity adequate for full rack operation

Firmware Update Validation#

Compute Tray Firmware#

Firmware Management Setup

  • [ ] Firmware packages placed in correct BCM directories

  • [ ] BMC firmware management mode set to “gb200”

  • [ ] Current firmware versions documented

  • [ ] Target firmware versions identified

Firmware Update Process

  • [ ] BMC firmware update completed successfully

  • [ ] Compute tray firmware update completed

  • [ ] AC power cycles (AUX power) performed after updates

  • [ ] New firmware versions verified and activated

  • [ ] All components show expected firmware versions

Update Verification

  • [ ] All BMC, GPU, CPU, and FPGA firmware up to date

  • [ ] No failed firmware update attempts

  • [ ] System stability verified post-update

  • [ ] Performance validated after firmware updates

BlueField and CX7 Firmware#

Network Card Firmware

  • [ ] MFT tools installed and functional

  • [ ] Current firmware versions identified

  • [ ] Firmware updates applied successfully

  • [ ] Network interfaces functional post-update

  • [ ] Performance validated after updates

Power Shelf Firmware#

Power Management Firmware

  • [ ] PMC firmware updated to latest version

  • [ ] PSU firmware updated successfully

  • [ ] Power shelf functionality verified

  • [ ] No power interruptions during updates

System Integration and Final Validation#

Network Connectivity#

End-to-End Network Validation

  • [ ] All management network connectivity verified

  • [ ] Provisioning network fully operational

  • [ ] Storage network performance validated

  • [ ] InfiniBand fabric connectivity confirmed

  • [ ] Inter-rack connectivity verified (if multiple racks)

Network Performance Testing

  • [ ] Bandwidth testing completed on all networks

  • [ ] Latency measurements within acceptable ranges

  • [ ] No network errors or dropped packets

  • [ ] Load balancing functional across bonds

  • [ ] Failover testing completed for redundant paths

Compute Workload Validation#

System Readiness for Workloads

  • [ ] All GB200 compute trays accessible and operational

  • [ ] GPU functionality verified on all nodes

  • [ ] CUDA and driver installations validated

  • [ ] NVLink fabric operational across all nodes

  • [ ] Memory and storage accessible on all nodes

Control Plane Functionality

  • [ ] SLURM login nodes accessible and functional

  • [ ] Kubernetes admin nodes operational

  • [ ] Kubernetes user space nodes functional

  • [ ] Workload orchestration systems ready

Monitoring and Management#

System Monitoring

  • [ ] All devices reporting health status to BCM

  • [ ] Environmental monitoring functional

  • [ ] Performance monitoring operational

  • [ ] Log aggregation functional

  • [ ] Alerting systems configured

Management Interface Validation

  • [ ] BCM web interface accessible and functional

  • [ ] Command line management (cmsh) operational

  • [ ] Remote management capabilities verified

  • [ ] Backup and recovery procedures tested

Security and Access Control#

Access Control Validation

  • [ ] User authentication systems functional

  • [ ] Network security policies enforced

  • [ ] BMC access restricted and monitored

  • [ ] SSH access controlled and logged

  • [ ] Service accounts properly configured

Security Hardening

  • [ ] Unnecessary services disabled

  • [ ] Firewall rules configured appropriately

  • [ ] Security patches applied

  • [ ] Vulnerability scanning completed

Documentation and Handover#

System Documentation#

Configuration Documentation

  • [ ] All IP address assignments documented

  • [ ] Network topology documented with diagrams

  • [ ] Hardware inventory and serial numbers recorded

  • [ ] Configuration files backed up and documented

  • [ ] Custom configurations and modifications noted

Operational Documentation

  • [ ] Standard operating procedures documented

  • [ ] Troubleshooting guides prepared

  • [ ] Emergency procedures documented

  • [ ] Contact information for support escalation

  • [ ] System administrator training completed

Change Management

  • [ ] Change control procedures established

  • [ ] Backup and recovery procedures validated

  • [ ] Update and maintenance schedules defined

  • [ ] System lifecycle management plan created

Performance Validation#

System Performance Benchmarks#

Compute Performance

  • [ ] GPU performance benchmarks completed

  • [ ] CPU performance validated

  • [ ] Memory bandwidth tested

  • [ ] Storage performance validated

  • [ ] Network throughput measured

System Load Testing

  • [ ] Full system load testing completed

  • [ ] Stress testing passed on all components

  • [ ] Thermal management validated under load

  • [ ] Power consumption measurements recorded

  • [ ] System stability verified over extended periods

Final System Acceptance#

Deployment Sign-Off#

Technical Validation Complete

  • [ ] All technical validation items completed

  • [ ] Performance requirements met

  • [ ] Functional requirements satisfied

  • [ ] Non-functional requirements validated

  • [ ] System ready for production workloads

Stakeholder Approval

  • [ ] System administrator sign-off obtained

  • [ ] Operations team acceptance confirmed

  • [ ] End user validation completed

  • [ ] Management approval received

  • [ ] System officially handed over for production use

Post-Deployment Activities

  • [ ] Monitoring baselines established

  • [ ] Operational procedures transferred

  • [ ] Support escalation procedures activated

  • [ ] Training completion verified

  • [ ] System transition to production support complete

Commands for Quick Final Validation#

Use these commands for rapid final system validation:

System Status Overview

# Check overall cluster status
cmsh -c "device; list"
cmsh -c "device; list -c dgx-gb200"
cmsh -c "device; list -t switch | grep nvsw"

# Verify HA status (if configured)
cmha status

# Check all rack devices
cmsh -c "rack; list"
cmsh -c "rack; use <rack_name>; rackoverview"

Network Connectivity Validation

# Verify interface status
cmsh -c "device; foreach -c dgx-gb200 (interfaces; list)"

# Test network connectivity
Can you ping the gateway of each network from the head node? If not, check the network configuration and the firewall rules.
Can you ping the head node from the compute nodes? If not, check the network configuration and the firewall rules.
Can you SSH from the headnode to the NVLink switch? If not, check the network configuration and the firewall rules.

Power and BMC Validation

# Check power status across all devices
cmsh -c "device; power status -c dgx-gb200"
cmsh -c "device; power status -t switch | grep nvsw"

# Verify BMC connectivity
cmsh -c "device; foreach -c dgx-gb200 (bmcsettings; show)"

Health and Monitoring Validation

# Check device health
cmsh -c "device; foreach -t switch | grep -invsw (latesthealthdata)"
cmsh -c "device; foreach -c dgx-gb200 (latesthealthdata)"

# Monitor system logs
tail -f /var/log/cmdaemon
tail -f /var/log/syslog

Performance Quick Check

 # Basic network performance test
<TBD>>

 # Check system resources
<TBD>

Critical Success Criteria#

The deployment is considered successful when all of the following criteria are met:

Infrastructure Criteria

  • [ ] All physical hardware operational and accessible

  • [ ] All network connectivity functional with expected performance

  • [ ] All power and cooling systems operational within specifications

  • [ ] All firmware at required versions with no critical vulnerabilities

Software Criteria

  • [ ] BCM 11 operational with all required licenses

  • [ ] All compute nodes provisioned and accessible

  • [ ] All control plane services functional

  • [ ] High availability functional (if configured)

Integration Criteria

  • [ ] End-to-end system integration validated

  • [ ] All monitoring and management systems operational

  • [ ] Security policies enforced and validated

  • [ ] Performance meets or exceeds requirements

Operational Criteria

  • [ ] System ready for production workloads

  • [ ] Operations team trained and ready

  • [ ] Documentation complete and accessible

  • [ ] Support procedures established and tested

Note

This comprehensive validation checklist should be completed systematically. Any failed validation items must be addressed before considering the deployment complete. Document any deviations from standard configurations and ensure they are approved through proper change management processes.