Deployment Summary Validation Checklist#
This comprehensive checklist provides end-to-end validation for the entire NVIDIA Mission Control Management Plane and GB200 NVL72 rack deployment process. Use this checklist after completing all deployment phases to ensure the entire system is properly configured and operational.
Note
This checklist consolidates validation points from all phases of the deployment process. Complete all sections systematically to ensure full deployment validation.
Prerequisites and Hardware Validation#
Physical Infrastructure#
Rack and Power Validation
[ ] All GB200 racks properly installed and positioned
[ ] Power infrastructure correctly configured (sufficient power capacity)
[ ] Cooling infrastructure adequate for GB200 NVL72 systems
[ ] Network infrastructure deployed per reference architecture
[ ] All cable connections verified (power, network, InfiniBand)
Hardware Inventory Verification
[ ] All head nodes present and correctly configured (primary and secondary for HA)
[ ] All control plane nodes present (slogin, k8s-admin, k8s-user)
[ ] All GB200 compute trays accounted for (18 per rack)
[ ] All NVLink switch trays present (9 per rack)
[ ] All power shelves installed (8 per rack)
[ ] Hardware matches documented specifications and BOMs
Network Infrastructure
[ ] All TOR switches configured and operational
[ ] OOB management network connectivity verified
[ ] InfiniBand fabric connectivity confirmed
[ ] Ethernet fabric connectivity confirmed
[ ] IPMI/BMC networks accessible from head nodes
BCM Software Installation Validation#
Head Node Configuration#
BCM Installation Verification
[ ] BCM 11 software successfully installed on primary head node
[ ] Correct BCM version installed and verified (Ubuntu 24.04 base)
[ ] Mixed architecture support configured for ARM/aarch64 and x86
[ ] BCM license successfully installed and activated
[ ] NVIDIA Mission Control license enabled
[ ] Head node network interfaces correctly configured
[ ] Head node storage configuration validated (RAID 1 recommended)
Post-Installation System Health
[ ] BCM services running correctly (
systemctl status cmd
)[ ] Head node accessible via SSH
[ ] CMDaemon operational and responding
[ ] System logs free of critical errors
[ ] Adequate disk space available for operations
[ ] Time synchronization configured (NTP)
[ ] DNS resolution working correctly
Network Configuration Validation#
Network Definitions#
Core Networks Configured
[ ]
internalnet
- Control plane provisioning network[ ]
dgxnet
- GB200 compute node network(s)[ ]
ipminet
- Out-of-band management network(s)[ ]
computenet
- InfiniBand/East-West network[ ]
storagenet
- Storage/converged Ethernet network[ ]
globalnet
- Global network settings
Network Settings Verification
[ ] All networks have correct IP ranges and CIDR assignments
[ ] DHCP ranges configured appropriately for each network
[ ] Gateway settings configured correctly
[ ] Network booting enabled for provisioning networks
[ ] Management allowed settings configured appropriately
[ ] MTU settings configured (9000 for high-speed networks)
Head Node Network Interfaces
[ ] Provisioning bond (bond0) configured and operational
[ ] IPMI network bond (bond1) configured for OOB access
[ ] All physical interfaces properly assigned to bonds
[ ] Link aggregation (LACP mode 4) functioning correctly
[ ] Network interface naming conventions followed
Software Image and Category Validation#
Mixed Architecture Setup#
Image Availability Verification
[ ] Default images available for all required architectures
[ ] ARM/aarch64 images properly configured
[ ] x86_64 images properly configured
[ ] Node-installer images for cross-architecture support
[ ] CM-shared images for all architectures
[ ] DGX OS 7 image available for GB200 nodes
Category Configuration
[ ]
slogin
category configured with appropriate software image[ ]
k8s-admin
category configured (x86 required for NMX-M)[ ]
k8s-user
category configured with appropriate software image[ ]
dgx-gb200
category configured with DGX OS 7[ ] All categories have correct network assignments
[ ] BMC settings configured at category level
[ ] Boot and installation options configured
Software Image Customization
[ ] Custom packages installed in appropriate images
[ ] GPU drivers and CUDA libraries in GB200 images
[ ] Kubernetes tools in k8s categories
[ ] Development tools in slogin images
[ ] All image customizations documented and tested
Control Plane Node Configuration Validation#
Head Node Configuration#
Primary Head Node
[ ] BMC interface (rf0/ipmi0) configured and accessible
[ ] Provisioning bond (bond0) operational with correct IP
[ ] IPMI network bond (bond1) configured for OOB access
[ ] All MAC addresses correctly assigned
[ ] Power control functional via BMC
[ ] BMC credentials properly configured
Secondary Head Node (if HA configured)
[ ] Secondary head node defined in BCM
[ ] Network interfaces properly configured
[ ] Cloned from primary head node successfully
[ ] Ready for HA failover setup
SLURM Login Nodes#
Golden Node Configuration
[ ] Physical node created with
slogin
category[ ] ARM/C2 architecture confirmed
[ ] BMC interface (rf0) configured and accessible
[ ] Provisioning bond operational on
internalnet
[ ] Storage network interfaces configured (/31 IPs)
[ ] MAC addresses correctly assigned to all interfaces
Node Cloning and Scaling
[ ] Additional slogin nodes cloned successfully
[ ] Hostname conventions followed:
<RACK>-<RU>-P[1-16]-SLOGIN-0[1-X]
[ ] IP addresses incremented correctly
[ ] All cloned nodes have unique MAC addresses
K8s-Admin Nodes#
Configuration Requirements
[ ] Physical nodes created with
k8s-admin
category[ ] x86 architecture confirmed (required for NMX-M compatibility)
[ ] Four network interfaces configured per node
[ ] Provisioning bond (bond0) operational
[ ] NVLink COMe network bond (bond1) configured
[ ] Total of 3 nodes configured (odd number for quorum)
K8s-User Nodes#
User Space Configuration
[ ] Physical nodes created with
k8s-user
category[ ] Architecture documented (ARM or x86 supported)
[ ] Provisioning bond operational
[ ] Storage network interfaces configured
[ ] Total of 3 nodes configured (odd number for quorum)
Network Verification for All Control Plane Nodes
[ ] All interfaces show “always” for Start if
[ ] Bond interfaces show correct member interfaces
[ ] IP addresses match planning documentation
[ ] BMC connectivity verified for all nodes
[ ] Power control functional for all nodes
GB200 Rack Configuration Validation#
Rack Import Verification#
Automated Import Process (if used)
[ ] Rack inventory file obtained from factory
[ ] Point-to-Point (P2P) documentation available
[ ] bcm-netautogen tool executed successfully
[ ] All .json files generated for rack components
[ ] bcm-post-install automation completed successfully
[ ] All devices imported into BCM without errors
Manual Import Process (if used)
[ ] Individual .json files created for all components
[ ] 18 GB200 compute tray entries
[ ] 9 NVLink switch tray entries
[ ] 8 power shelf entries
[ ] All .json files imported successfully
[ ] No import errors in BCM logs
Device Inventory Validation
[ ] Total device count matches expected: - [ ] 18 GB200 compute trays per rack - [ ] 9 NVLink switches per rack - [ ] 8 power shelves per rack
[ ] All devices appear in BCM device list
[ ] All devices show proper status in BCM
Naming Convention Compliance
[ ] GB200 compute trays:
<RACK>-<RU>-P[1-16]-<ROLE>-0[1-8]-C0[1-18]
[ ] NVLink switches:
<RACK>-<RU>-P[1-16]-<switch_role>-0[1-9]
[ ] Power shelves: proper naming convention followed
[ ] All hostnames consistent with site naming standards
GB200 Compute Tray Configuration#
Golden Node Configuration
[ ] Physical node created with
dgx-gb200
category[ ] BMC interface (rf0) configured and accessible
[ ] Network interfaces properly configured: - [ ] M1 and M2 (Bluefield management ports) - [ ] S1 and S2 (storage network ports) - [ ] InfiniBand interfaces (if configured)
[ ] Provisioning bond (bond0) operational
[ ] System MAC set for initial boot
Node Cloning Verification
[ ] 18 compute tray entries cloned from golden node
[ ] All cloned nodes have incremental IPs
[ ] All cloned nodes have unique MAC addresses
[ ] Rack positions set correctly for all nodes
NVLink Switch Configuration#
Basic Switch Configuration
[ ] Switch entries created in BCM with proper hostnames
[ ] MAC addresses configured correctly
[ ] cm-lite-daemon enabled (hasclientdaemon = yes)
[ ] Switch kind set to
nvlink
[ ] SNMP disabled (disablesnmp = yes)
Network Interface Configuration
[ ] eth0/COMe0 interface configured with correct MAC
[ ] Management network IP assigned
[ ] SSH access configured and tested
[ ] REST port set to 443
ZTP Configuration (if applicable)
[ ] ZTP settings configured with proper templates
[ ] API enabled on switches
[ ] Switch-specific directories created
[ ] NVOS firmware images available
[ ] Startup configuration files configured
Switch Cloning and Registration
[ ] 9 NVLink switch entries cloned from golden switch
[ ] All switches have incremental IPs
[ ] All switches have unique MAC addresses
[ ] cm-lite-daemon installed on all switches
[ ] All switches show “UP” status in BCM
High Availability Configuration Validation#
HA Setup Verification#
Primary Head Node HA Configuration
[ ] cmha-setup wizard completed successfully
[ ] Virtual IP (VIP) addresses configured for internal and external networks
[ ] Secondary head node entry created in BCM
[ ] Failover network configuration completed
[ ] License updated with MACs from both head nodes
Secondary Head Node Deployment
[ ] Secondary head node PXE booted successfully
[ ] Rescue environment accessed
[ ] cm-clone-install –failover completed successfully
[ ] Secondary head node rebooted and accessible
[ ] Database synchronization completed
HA Functionality Testing
[ ]
cmha status
shows both nodes operational[ ] Manual failover testing successful (
cmha makeactive
)[ ] Both directions of failover tested
[ ] Primary and secondary head nodes function correctly as active
[ ] All HA tests show [ OK ] status
Rack Power-On and Provisioning Validation#
GB200 Compute Tray Bring-Up#
Power Control Verification
[ ] OOB power control configured for all compute trays
[ ] Power status commands return proper responses
[ ] BMC connectivity verified for all trays
[ ] Power control settings configured (rf0 preferred)
[ ] Power reset delay configured (5s recommended)
Provisioning Process
[ ] Test node powered on and provisioned successfully
[ ] Node installer logs reviewed for errors
[ ] System logs reviewed for issues
[ ] Node appears in “UP” state in BCM
[ ] All compute trays provisioned successfully
[ ] Network bonds operational on all nodes
[ ] North-south networking connectivity verified
Network Interface Validation
[ ] All Bluefield and CX-7 interfaces operational
[ ] Bond0 interfaces up and configured
[ ] Storage network interfaces configured
[ ] InfiniBand interfaces functional (if configured)
[ ] Link status verified for all connections
NVLink Switch Bring-Up#
Basic Connectivity
[ ] SSH access to all NVLink switches verified
[ ] admin user access functional with correct passwords
[ ] COMe0 and COMe1 network interfaces operational
[ ] BMC access verified for all switches
[ ] Switch status shows “UP” in BCM
cm-lite-daemon Installation
[ ] cm-lite-daemon downloaded and prepared
[ ] Installation completed on all switches
[ ] Registration successful with BCM
[ ] Switch monitoring and health data functional
[ ] No installation errors in logs
NMX-C Configuration
[ ] NMX-C leader configured (typically NVSW-01)
[ ] Cluster apps enabled on master switch
[ ] fm_config.cfg file generated and installed
[ ] NMX-C controller status shows “ok”
[ ] Fabric manager configuration operational
NMX-Telemetry Configuration
[ ] Telemetry service started successfully
[ ] nmx-telemetry status shows “ok”
[ ] Telemetry data collection functional
[ ] Monitoring integration operational
Power Shelf Validation#
Power Infrastructure
[ ] All power shelves reporting as operational
[ ] Power shelf status verified in BCM
[ ] Environmental monitoring functional (if applicable)
[ ] Power capacity adequate for full rack operation
Firmware Update Validation#
Compute Tray Firmware#
Firmware Management Setup
[ ] Firmware packages placed in correct BCM directories
[ ] BMC firmware management mode set to “gb200”
[ ] Current firmware versions documented
[ ] Target firmware versions identified
Firmware Update Process
[ ] BMC firmware update completed successfully
[ ] Compute tray firmware update completed
[ ] AC power cycles (AUX power) performed after updates
[ ] New firmware versions verified and activated
[ ] All components show expected firmware versions
Update Verification
[ ] All BMC, GPU, CPU, and FPGA firmware up to date
[ ] No failed firmware update attempts
[ ] System stability verified post-update
[ ] Performance validated after firmware updates
NVLink Switch Firmware#
Switch Firmware Management
[ ] Firmware packages available in BCM directories
[ ] Switch firmware management mode set to “gb200sw”
[ ] Current firmware versions documented
Firmware Update Sequence
[ ] BMC+FPGA+ERoT firmware updated first
[ ] CPLD firmware updated second
[ ] SBIOS+ERoT firmware updated third
[ ] NVOS software updated
[ ] Power cycles performed between major updates
Update Validation
[ ] All switch firmware components at target versions
[ ] Switch functionality verified post-update
[ ] cm-lite-daemon operational after updates
[ ] NMX-C and NMX-T services functional
BlueField and CX7 Firmware#
Network Card Firmware
[ ] MFT tools installed and functional
[ ] Current firmware versions identified
[ ] Firmware updates applied successfully
[ ] Network interfaces functional post-update
[ ] Performance validated after updates
Power Shelf Firmware#
Power Management Firmware
[ ] PMC firmware updated to latest version
[ ] PSU firmware updated successfully
[ ] Power shelf functionality verified
[ ] No power interruptions during updates
System Integration and Final Validation#
Network Connectivity#
End-to-End Network Validation
[ ] All management network connectivity verified
[ ] Provisioning network fully operational
[ ] Storage network performance validated
[ ] InfiniBand fabric connectivity confirmed
[ ] Inter-rack connectivity verified (if multiple racks)
Network Performance Testing
[ ] Bandwidth testing completed on all networks
[ ] Latency measurements within acceptable ranges
[ ] No network errors or dropped packets
[ ] Load balancing functional across bonds
[ ] Failover testing completed for redundant paths
Compute Workload Validation#
System Readiness for Workloads
[ ] All GB200 compute trays accessible and operational
[ ] GPU functionality verified on all nodes
[ ] CUDA and driver installations validated
[ ] NVLink fabric operational across all nodes
[ ] Memory and storage accessible on all nodes
Control Plane Functionality
[ ] SLURM login nodes accessible and functional
[ ] Kubernetes admin nodes operational
[ ] Kubernetes user space nodes functional
[ ] Workload orchestration systems ready
Monitoring and Management#
System Monitoring
[ ] All devices reporting health status to BCM
[ ] Environmental monitoring functional
[ ] Performance monitoring operational
[ ] Log aggregation functional
[ ] Alerting systems configured
Management Interface Validation
[ ] BCM web interface accessible and functional
[ ] Command line management (cmsh) operational
[ ] Remote management capabilities verified
[ ] Backup and recovery procedures tested
Security and Access Control#
Access Control Validation
[ ] User authentication systems functional
[ ] Network security policies enforced
[ ] BMC access restricted and monitored
[ ] SSH access controlled and logged
[ ] Service accounts properly configured
Security Hardening
[ ] Unnecessary services disabled
[ ] Firewall rules configured appropriately
[ ] Security patches applied
[ ] Vulnerability scanning completed
Documentation and Handover#
System Documentation#
Configuration Documentation
[ ] All IP address assignments documented
[ ] Network topology documented with diagrams
[ ] Hardware inventory and serial numbers recorded
[ ] Configuration files backed up and documented
[ ] Custom configurations and modifications noted
Operational Documentation
[ ] Standard operating procedures documented
[ ] Troubleshooting guides prepared
[ ] Emergency procedures documented
[ ] Contact information for support escalation
[ ] System administrator training completed
Change Management
[ ] Change control procedures established
[ ] Backup and recovery procedures validated
[ ] Update and maintenance schedules defined
[ ] System lifecycle management plan created
Performance Validation#
System Performance Benchmarks#
Compute Performance
[ ] GPU performance benchmarks completed
[ ] CPU performance validated
[ ] Memory bandwidth tested
[ ] Storage performance validated
[ ] Network throughput measured
System Load Testing
[ ] Full system load testing completed
[ ] Stress testing passed on all components
[ ] Thermal management validated under load
[ ] Power consumption measurements recorded
[ ] System stability verified over extended periods
Final System Acceptance#
Deployment Sign-Off#
Technical Validation Complete
[ ] All technical validation items completed
[ ] Performance requirements met
[ ] Functional requirements satisfied
[ ] Non-functional requirements validated
[ ] System ready for production workloads
Stakeholder Approval
[ ] System administrator sign-off obtained
[ ] Operations team acceptance confirmed
[ ] End user validation completed
[ ] Management approval received
[ ] System officially handed over for production use
Post-Deployment Activities
[ ] Monitoring baselines established
[ ] Operational procedures transferred
[ ] Support escalation procedures activated
[ ] Training completion verified
[ ] System transition to production support complete
Commands for Quick Final Validation#
Use these commands for rapid final system validation:
System Status Overview
# Check overall cluster status
cmsh -c "device; list"
cmsh -c "device; list -c dgx-gb200"
cmsh -c "device; list -t switch | grep nvsw"
# Verify HA status (if configured)
cmha status
# Check all rack devices
cmsh -c "rack; list"
cmsh -c "rack; use <rack_name>; rackoverview"
Network Connectivity Validation
# Verify interface status
cmsh -c "device; foreach -c dgx-gb200 (interfaces; list)"
# Test network connectivity
Can you ping the gateway of each network from the head node? If not, check the network configuration and the firewall rules.
Can you ping the head node from the compute nodes? If not, check the network configuration and the firewall rules.
Can you SSH from the headnode to the NVLink switch? If not, check the network configuration and the firewall rules.
Power and BMC Validation
# Check power status across all devices
cmsh -c "device; power status -c dgx-gb200"
cmsh -c "device; power status -t switch | grep nvsw"
# Verify BMC connectivity
cmsh -c "device; foreach -c dgx-gb200 (bmcsettings; show)"
Health and Monitoring Validation
# Check device health
cmsh -c "device; foreach -t switch | grep -invsw (latesthealthdata)"
cmsh -c "device; foreach -c dgx-gb200 (latesthealthdata)"
# Monitor system logs
tail -f /var/log/cmdaemon
tail -f /var/log/syslog
Performance Quick Check
# Basic network performance test
<TBD>>
# Check system resources
<TBD>
Critical Success Criteria#
The deployment is considered successful when all of the following criteria are met:
Infrastructure Criteria
[ ] All physical hardware operational and accessible
[ ] All network connectivity functional with expected performance
[ ] All power and cooling systems operational within specifications
[ ] All firmware at required versions with no critical vulnerabilities
Software Criteria
[ ] BCM 11 operational with all required licenses
[ ] All compute nodes provisioned and accessible
[ ] All control plane services functional
[ ] High availability functional (if configured)
Integration Criteria
[ ] End-to-end system integration validated
[ ] All monitoring and management systems operational
[ ] Security policies enforced and validated
[ ] Performance meets or exceeds requirements
Operational Criteria
[ ] System ready for production workloads
[ ] Operations team trained and ready
[ ] Documentation complete and accessible
[ ] Support procedures established and tested
Note
This comprehensive validation checklist should be completed systematically. Any failed validation items must be addressed before considering the deployment complete. Document any deviations from standard configurations and ensure they are approved through proper change management processes.