Deployment Summary Validation Checklist#
This comprehensive checklist provides end-to-end validation for the entire NVIDIA Mission Control Management Plane and GB200/GB300 NVL72 rack deployment process. Use this checklist after completing all deployment phases to ensure the entire system is properly configured and operational.
Note
This checklist consolidates validation points from all phases of the deployment process. Complete all sections systematically to ensure full deployment validation.
Prerequisites and Hardware Validation#
Physical Infrastructure#
Rack and Power Validation
[ ] All GB200/GB300 racks properly installed and positioned
[ ] Power infrastructure correctly configured (sufficient power capacity)
[ ] Cooling infrastructure adequate for GB200/GB300 NVL72 systems
[ ] Network infrastructure deployed per reference architecture
[ ] All cable connections verified (power, network, InfiniBand)
Hardware Inventory Verification
[ ] All head nodes present and correctly configured (primary and secondary for HA)
[ ] All control plane nodes present (slogin, k8s-admin, k8s-user)
[ ] All GB200/GB300 compute trays accounted for (18 per rack)
[ ] All NVLink switch trays present (9 per rack)
[ ] All power shelves installed (8 per rack)
[ ] Hardware matches documented specifications and BOMs
Network Infrastructure
[ ] All TOR switches configured and operational
[ ] OOB management network connectivity verified
[ ] InfiniBand fabric connectivity confirmed
[ ] Ethernet fabric connectivity confirmed
[ ] IPMI/BMC networks accessible from head nodes
BCM Software Installation Validation#
This section provides a checklist for the BCM software installation and configuration that is completed on the head node.
Head Node Configuration#
BCM Installation Verification
[ ] BCM 11 software successfully installed on primary head node
[ ] Correct BCM version installed and verified (Ubuntu
24.04base)[ ] Mixed architecture support configured for ARM/aarch64 and x86
[ ] BCM license successfully installed and activated
[ ] NVIDIA Mission Control license enabled
[ ] Head node network interfaces correctly configured
[ ] Head node storage configuration validated (RAID 1 recommended)
Post-Installation System Health
[ ] BCM services running correctly (
systemctl status cmd)[ ] Head node accessible through SSH
[ ] CMDaemon operational and responding
[ ] System logs free of critical errors
[ ] Adequate disk space available for operations
[ ] Time synchronization configured (NTP)
[ ] DNS resolution working correctly
Finalize Head Node Setup
[ ] Bond priority and gateway metrics configured correctly for network reachability
[ ] internalnet gateway metric set to
5[ ] ipminet0 gateway metric set to
10[ ] Partition external network changed from managementnet to internalnet
[ ] Head node rebooted after partition changes
[ ] fsexports configured for all management networks
[ ] Node installer directories exported on all provisioning networks
[ ] /home and /cm/shared exported on required networks
Network Configuration Validation#
This section provides a checklist for the network definitions and settings that are configured for the control plane and GB200/GB300 racks.
Network Definitions#
Core Networks Configured
[ ]
internalnet- Control plane provisioning network[ ]
dgxnet- GB200/GB300 compute node network(s)[ ]
ipminet- Out-of-band management network(s)[ ]
computenet- InfiniBand/East-West network[ ]
storagenet- Storage/converged Ethernet network[ ]
globalnet- Global network settings
Network Settings Verification
[ ] All networks have correct IP ranges and CIDR assignments
[ ] DHCP ranges configured appropriately for each network
[ ] Gateway settings configured correctly
[ ] Network booting enabled for provisioning networks
[ ] Management allowed settings configured appropriately
[ ] MTU settings configured (9000 for high-speed networks)
Head Node Network Interfaces
[ ] Provisioning bond (bond0) configured and operational
[ ] IPMI network bond (bond1) configured for OOB access
[ ] All physical interfaces properly assigned to bonds
[ ] Link aggregation (LACP mode 4) functioning correctly
[ ] Network interface naming conventions followed
Software Image and Category Validation#
This section provides a checklist for the software images and category validation that are created and assigned to the nodes.
Mixed Architecture Setup#
Image Availability Verification
[ ] Default images available for all required architectures
[ ] ARM/aarch64 images properly configured
[ ] x86_64 images properly configured
[ ] Node-installer images for cross-architecture support
[ ] CM-shared images for all architectures
[ ] DGX OS 7 image available for GB200/GB300 nodes
Category Configuration
[ ]
slogincategory configured with appropriate software image[ ]
k8s-system-admincategory configured (x86 required for NMX-M)[ ]
k8s-system-usercategory configured with appropriate software image[ ]
dgx-gb200ordgx-gb300category configured with DGX OS 7[ ] All categories have correct network assignments
[ ] BMC settings configured at category level
[ ] Boot and installation options configured
[ ] /Disksetup.xml files created and assigned to each category
[ ] Bootloader set to GRUB for all ARM/aarch64 categories
[ ] GB200/GB300 category firmware management mode set to “gb200”
[ ] GB200/GB300 category BMC credentials configured (username, password, privilege)
[ ] Management network set correctly for each category (internalnet or dgxnet)
Software Image Customization
[ ] Custom packages installed in appropriate images using cm-chroot-sw-image
[ ] GPU drivers and CUDA libraries in GB200/GB300 images
[ ] Kubernetes tools in k8s categories
[ ] Development tools in slogin images
[ ] All image customizations documented and tested
GB200/GB300 Image SBOM Compliance
[ ] DGX OS 7 image created or imported successfully
[ ] GB200/GB300 images meet current SBOM requirements (see Appendix section Check the DGX OS packages to meet current SBOM)
[ ] DOCA stack version matches SBOM specification
[ ] DCGM version validated against SBOM
[ ] NVSM version validated against SBOM
[ ] NVIDIA driver version matches SBOM
[ ] NVIDIA Fabric Manager version validated
[ ] NVIDIA-IMEX version validated
[ ] Kernel parameters configured for GB200/GB300 (
nouveau.modeset=0,iommu.passthrough=1, etc.)[ ] NVMe multipath disabled in kernel parameters (
nvme_core.multipath=n)
Control Plane Node Configuration Validation#
This section provides a checklist for the control plane node configuration validation that is created and assigned to the head node, SLURM login nodes, Kubernetes system user space nodes, and Kubernetes system admin space nodes.
Head Node Configuration#
Primary Head Node
[ ] BMC interface (rf0/ipmi0) configured and accessible
[ ] Provisioning bond (bond0) operational with correct IP
[ ] IPMI network bond (bond1) configured for OOB access
[ ] All MAC addresses correctly assigned
[ ] Power control functional via BMC
[ ] BMC credentials properly configured
Secondary Head Node (if HA configured)
[ ] Secondary head node defined in BCM
[ ] Network interfaces properly configured
[ ] Cloned from primary head node successfully
[ ] Ready for HA failover setup
SLURM Login Nodes#
Golden Node Configuration
[ ] Physical node created with
slogincategory[ ] ARM/C2 architecture confirmed
[ ] BMC interface (rf0) configured and accessible
[ ] Provisioning bond operational on
internalnet[ ] Storage network interfaces configured (/31 IPs)
[ ] MAC addresses correctly assigned to all interfaces
Node Cloning and Scaling
[ ] Additional slogin nodes cloned successfully
[ ] Hostname conventions followed:
<RACK>-<RU>-P[1-16]-SLOGIN-0[1-X][ ] IP addresses incremented correctly
[ ] All cloned nodes have unique MAC addresses
K8s-system-admin Nodes#
Configuration Requirements
[ ] Physical nodes created with
k8s-system-admincategory[ ] x86 architecture confirmed (required for NMX-M compatibility)
[ ] Four network interfaces configured per node
[ ] Provisioning bond (bond0) operational
[ ] NVLink COMe network bond (bond1) configured
[ ] Total of 3 nodes configured (odd number for quorum)
K8s-system-user Nodes#
User Space Configuration
[ ] Physical nodes created with
k8s-system-usercategory[ ] Architecture documented (ARM or x86 supported)
[ ] Provisioning bond operational
[ ] Storage network interfaces configured
[ ] Total of 3 nodes configured (odd number for quorum)
Network Verification for All Control Plane Nodes
[ ] All interfaces show “always” for Start if
[ ] Bond interfaces show correct member interfaces
[ ] IP addresses match planning documentation
[ ] BMC connectivity verified for all nodes
[ ] Power control functional for all nodes
UFM Pre-Setup Validation#
This section provides a checklist for the UFM pre-setup validation that is completed on the head node.
UFM Appliance Configuration
[ ] UFM iDRAC MAC addresses documented
[ ] UFM iDRAC entries added to /etc/dhcpd.ipminet<X>.include.conf
[ ] UFM iDRAC IP addresses assigned correctly
[ ] UFM appliances accessible using iDRAC credentials
[ ] UFM hostname conventions followed:
<RACK>-P<POD_NUMBER>-UFM-01
Note
UFM configuration and setup should be completed after the cluster is operational in BCM. For more information, see the UFM Enterprise Appliance Software User Manual.
GB200/GB300 Rack Configuration Validation#
This section provides a checklist for the GB200/GB300 rack configuration validation that is created and assigned to the GB200/GB300 compute trays, NVLink switches, and power shelves.
Rack Import Verification#
Automated Import Process (if used)
[ ] Rack inventory file obtained from factory
[ ] Point-to-Point (P2P) documentation available
[ ] bcm-netautogen tool executed successfully
[ ] All .json files generated for rack components
[ ] bcm-post-install automation completed successfully
[ ] All devices imported into BCM without errors
Manual Import Process (if used)
[ ] Individual .json files created for all components
[ ] 18 GB200/GB300 compute tray entries
[ ] 9 NVLink switch tray entries
[ ] 8 power shelf entries
[ ] All .json files imported successfully
[ ] No import errors in BCM logs
Device Inventory Validation
- [ ] Total device count matches expected:
[ ] 18 GB200/GB300 compute trays per rack
[ ] 9 NVLink switches per rack
[ ] 8 power shelves per rack
[ ] All devices appear in BCM device list
[ ] All devices show proper status in BCM
Naming Convention Compliance
[ ] GB200/GB300 compute trays:
<RACK>-<RU>-P[1-16]-<ROLE>-0[1-8]-C0[1-18][ ] NVLink switches:
<RACK>-<RU>-P[1-16]-<SWITCH_ROLE>-0[1-9][ ] Power shelves: proper naming convention followed
[ ] All host names consistent with site naming conventions
GB200/GB300 Compute Tray Configuration#
Golden Node Configuration (GB200)
[ ] Physical node created with
dgx-gb200category[ ] BMC interface (rf0) configured and accessible
- [ ] Network interfaces properly configured:
[ ] M1 and M2 (Bluefield management ports)
[ ] S1 and S2 (storage network ports)
[ ] InfiniBand interfaces (if configured)
[ ] Provisioning bond (bond0) operational (M1 + M2)
[ ] System MAC set for initial boot
Golden Node Configuration (GB300)
[ ] Physical node created with
dgx-gb300category[ ] BMC interface (rf0) configured and accessible
- [ ] Network interfaces properly configured:
[ ] M1 (Bluefield management port)
[ ] S1 (storage network port)
[ ] Provisioning interface (enP22p3s0f0np0) operational (M1 only)
[ ] System MAC set for initial boot
Node Cloning Verification
[ ] 18 compute tray entries cloned from golden node
[ ] All cloned nodes have incremental IPs
[ ] All cloned nodes have unique MAC addresses
[ ] Rack positions set correctly for all nodes
NVLink Switch Configuration#
Basic Switch Configuration
[ ] Switch entries created in BCM with proper hostnames
[ ] MAC addresses configured correctly
[ ] cm-lite-daemon enabled (hasclientdaemon = yes)
[ ] Switch kind set to
nvlink[ ] SNMP disabled (disablesnmp = yes)
Network Interface Configuration
[ ] eth0/COMe0 interface configured with correct MAC
[ ] Management network IP assigned
[ ] SSH access configured and tested
[ ] REST port set to
443
ZTP Configuration (if applicable)
[ ] ZTP settings configured with proper templates
[ ] API enabled on switches
[ ] Switch-specific directories created: /cm/local/apps/cmd/etc/htdocs/switch/<switch_name>/
[ ] NVOS firmware images copied to: /cm/local/apps/cmd/etc/htdocs/switch/image/
[ ] Startup configuration files copied to switch-specific directories
[ ] Startup configuration modified with hashed password field
[ ] Configuration mode set to “file”
[ ] Startup YAML file specified correctly
[ ] JSON template configured
[ ] Image name specified for updates
[ ] “checkimageonboot” enabled
[ ] “installlitedaemon” enabled (set “installlitedaemon” to “yes”)
Switch Cloning and Registration
[ ] 9 NVLink switch entries cloned from golden switch
[ ] All switches have incremental IPs
[ ] All switches have unique MAC addresses
[ ] cm-lite-daemon installed on all switches
[ ] All switches show “UP” status in BCM
High Availability Configuration Validation#
This section provides a checklist for the High Availability configuration validation that is completed on the head node. This also includes a checklist for the validation of the NFS and shared storage.
HA Setup Verification#
Primary Head Node HA Configuration
[ ] cmha-setup wizard completed successfully
[ ] Virtual IP (VIP) addresses configured for internal and external networks
[ ] Secondary head node entry created in BCM
[ ] Failover network configuration completed
[ ] License updated with MACs from both head nodes
Secondary Head Node Deployment
[ ] Secondary head node PXE booted successfully
[ ] Rescue environment accessed
[ ] cm-clone-install
--failovercompleted successfully[ ] Secondary head node rebooted and accessible
[ ] Database synchronization completed
HA Functionality Testing
[ ]
cmha statusshows both nodes operational[ ] Manual failover testing successful (
cmha makeactive)[ ] Both directions of failover tested
[ ] Primary and secondary head nodes function correctly as active
- [ ] All HA tests show [ OK ] status:
[ ] mysql [ OK ]
[ ] ping [ OK ]
[ ] status [ OK ]
[ ] Compute nodes powered off during HA configuration (required)
[ ] License updated with MAC addresses from both head nodes
[ ] Both head nodes can reach all IPMI networks through bond1
Rack Power-On and Provisioning Validation#
This section provides a checklist for the Rack power-on and provisioning validation that is completed on the GB200/GB300 compute trays. This also includes a checklist for the validation of the NVLink switch tray and the power shelf.
GB200/GB300 Compute Tray Bring-Up#
Power Control Verification
[ ] OOB power control configured for all compute trays
[ ] Power status commands return proper responses
[ ] BMC connectivity verified for all trays
[ ] Power control settings configured (rf0 preferred)
[ ] Power reset delay configured (5s recommended)
Provisioning Process
[ ] Test node powered on and provisioned successfully
[ ] Node installer logs reviewed for errors
[ ] System logs reviewed for issues
[ ] Node appears in “UP” state in BCM
[ ] All compute trays provisioned successfully
[ ] Network bonds operational on all nodes
[ ] North-south networking connectivity verified
Network Interface Validation
[ ] All Bluefield and CX-7 interfaces operational
[ ] Bond0 interfaces up and configured
[ ] Storage network interfaces configured
[ ] InfiniBand interfaces functional (if configured)
[ ] Link status verified for all connections
NVLink Switch Bring-Up#
Basic Connectivity
[ ] SSH access to all NVLink switches verified
[ ] admin user access functional with correct passwords
[ ] COMe0 and COMe1 network interfaces operational
[ ] BMC access verified for all switches
[ ] Switch status shows “UP” in BCM
cm-lite-daemon Installation
[ ] cm-lite-daemon downloaded and prepared
[ ] Installation completed on all switches
[ ] Registration successful with BCM
[ ] Switch monitoring and health data functional
[ ] No installation errors in logs
NMX-C Configuration
[ ] NMX-C leader configured (typically NVSW-01)
[ ] Cluster apps enabled on master switch
[ ] /fm_config.cfg file generated and installed
[ ] NMX-C controller status shows “ok”
[ ] Fabric manager configuration operational
NMX-Telemetry Configuration
[ ] Telemetry service started successfully
[ ] nmx-telemetry status shows “ok”
[ ] Telemetry data collection functional
[ ] Monitoring integration operational
Power Shelf Validation#
Power Infrastructure
[ ] All power shelves reporting as operational
[ ] Power shelf status verified in BCM
[ ] Environmental monitoring functional (if applicable)
[ ] Power capacity adequate for full rack operation
Firmware Update Validation#
This section provides a checklist for the firmware update validation that is completed on the GB200/GB300 compute trays, NVLink switch tray, and power shelf.
Compute Tray Firmware#
This does not have any items that are specific to the compute tray firmware update.
Firmware Management Setup#
[ ] Firmware packages placed in /cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200
[ ] BMC firmware management mode set to
gb200at category level[ ] BMC username
adminenabled and credentials configured[ ] Current firmware versions documented using
firmware status[ ] Target firmware versions identified from SBOM
- [ ] Firmware packages available:
[ ] Compute BMC bundle (
nvfw_DGX-GBX00_*.fwpkg)[ ] Compute HMC bundle (
nvfw_HGX-GBX00_*.fwpkg)[ ] BlueField 3 firmware (
fw-BlueField-3-rel-*.bin)[ ] ConnectX-7/8 firmware (
fw-ConnectX7-rel-*.binorfw-ConnectX8-rel-*.bin)
Firmware Update Process
[ ] Dry-run performed before actual updates (
firmware flash --dry-run)If the dry-run fails, typical causes include:
Missing or improperly named firmware packages in the required directory.
Incorrect firmware management mode not set to
gb200at the category level.BMC credentials not configured correctly, preventing access to nodes.
Mismatches between firmware bundle and hardware revision.
[ ] BMC firmware (DGX-GBX00) updated first
If the update fails at this step, it is often due to:
Network communication failure to BMC
Outdated or incompatible firmware bundle used
Node is powered off or unreachable
[ ] AC power cycle performed after BMC update
Errors may arise if the AC power cycle is not completed, leaving the BMC in an inconsistent state.
[ ] Compute tray firmware (HGX-GBX00) updated second
Failure here typically means the BMC update was not successful or the compute tray firmware bundle does not match the system hardware.
[ ] AC power cycle performed after compute tray update
[ ] Firmware update status monitored to completion
Errors are generated in monitoring if previous updates failed or node connectivity is lost.
[ ] New firmware versions verified and activated
Errors may indicate that the versions have not changed due to failed flashing or skipped AC cycles.
Component Verification
[ ] BMC firmware version validated
[ ] GPU firmware validated (all GPUs)
[ ] CPU firmware validated (Grace CPUs)
[ ] FPGA firmware validated
[ ] ERoT (Embedded Root of Trust) firmware validated
[ ] CPLD firmware validated
[ ] NIC firmware validated (CX-7, CX-8, BF3)
[ ] UEFI/BIOS version validated
Errors in validation steps above are often due to version mismatches, update failures, or skipped reboots and/or power cycles.
Update Verification
[ ] All components show expected firmware versions per SBOM
If not, double-check SBOM target versions, hardware compatibility, and confirm each update step completed successfully.
[ ] No failed firmware update attempts in logs
Common failures in logs include permission errors, connectivity issues, or improper firmware file selection.
[ ] System stability verified post-update
[ ] All compute trays show consistent firmware versions
[ ] Performance validated after firmware updates
NVLink Switch Firmware#
Switch Firmware Management
[ ] Firmware packages placed in /cm/local/apps/cmd/etc/htdocs/bios/firmware/gb200sw
[ ] Switch firmware management mode set to
gb200swat device level[ ] Current firmware versions documented
- [ ] Firmware packages available:
[ ] Switch BMC bundle (
nvfw_GB200-P4978_0004*.fwpkg)[ ] Switch BIOS bundle (
nvfw_GB200-P4978_0006*.fwpkg)[ ] Switch CPLD bundle (
nvfw_GB200-P4978_0007*.fwpkg)[ ] NVOS image (
nvos-amd64-*.bin)
Firmware Update Sequence
[ ] Dry-run performed before actual updates
[ ] BMC+FPGA+ERoT firmware (0004 bundle) updated first
[ ] Switch rebooted after BMC update
[ ] CPLD1-4 firmware (0007 bundle) updated second
[ ] Switch rebooted after CPLD update
[ ] SBIOS+ERoT firmware (0006 bundle) updated third
[ ] Switch rebooted after BIOS update
[ ] NVOS software updated (updated manually or using ZTP)
[ ] All firmware updates verified with
firmware status
Update Validation
[ ] All switch firmware components at target versions per SBOM
[ ] Switch functionality verified post-update
[ ] cm-lite-daemon operational after updates
[ ] NMX-C and NMX-T services functional
[ ] Switch accessible via SSH post-update
[ ] All 9 switches updated consistently
BlueField and CX7/CX8 Firmware#
Network Card Firmware
[ ] MFT tools installed and functional
[ ] Current firmware versions identified
[ ] Firmware updates applied successfully
[ ] Network interfaces functional post-update
[ ] Performance validated after updates
Power Shelf Firmware#
Power Management Firmware
- [ ] Firmware packages available:
[ ] PMC firmware (“common-pmc-3.*.tar”)
[ ] PSU firmware (“NVIDIA_5500_APP_*.tar”)
[ ] PMC firmware updated to latest version per SBOM
[ ] PSU firmware updated successfully
[ ] Power shelf functionality verified
[ ] No power interruptions during updates
[ ] All 8 power shelves per rack updated consistently
[ ] Power shelf status reporting operational in BCM
System Integration and Final Validation#
This section provides a checklist for the system integration and final validation that is completed for network connectivity, compute workload, control plane functionality, monitoring and management, and security and access control.
Network Connectivity#
End-to-End Network Validation
[ ] All management network connectivity verified
[ ] Provisioning network fully operational
[ ] Storage network performance validated
[ ] InfiniBand fabric connectivity confirmed
[ ] Inter-rack connectivity verified (if multiple racks)
Network Performance Testing
[ ] Bandwidth testing completed on all networks
[ ] Latency measurements within acceptable ranges
[ ] No network errors or dropped packets
[ ] Load balancing functional across bonds
[ ] Failover testing completed for redundant paths
Compute Workload Validation#
System Readiness for Workloads
[ ] All GB200/GB300 compute trays accessible and operational
[ ] GPU functionality verified on all nodes
[ ] CUDA and driver installations validated
[ ] NVLink fabric operational across all nodes
[ ] Memory and storage accessible on all nodes
Control Plane Functionality
[ ] SLURM login nodes accessible and functional
[ ] Kubernetes admin nodes operational
[ ] Kubernetes user space nodes functional
[ ] Workload orchestration systems ready
Monitoring and Management#
System Monitoring
[ ] All devices reporting health status to BCM
[ ] Environmental monitoring functional
[ ] Performance monitoring operational
[ ] Log aggregation functional
[ ] Alerting systems configured
Management Interface Validation
[ ] BCM web interface accessible and functional
[ ] Command line management (cmsh) operational
[ ] Remote management capabilities verified
[ ] Backup and recovery procedures tested
Security and Access Control#
Access Control Validation
[ ] User authentication systems functional
[ ] Network security policies enforced
[ ] BMC access restricted and monitored
[ ] SSH access controlled and logged
[ ] Service accounts properly configured
Security Hardening
[ ] Unnecessary services disabled
[ ] Firewall rules configured appropriately
[ ] Security patches applied
[ ] Vulnerability scanning completed
Documentation and Handover#
This section provides a checklist for the documentation and handover that is completed for the system.
System Documentation#
Configuration Documentation
[ ] All IP address assignments documented
[ ] Network topology documented with diagrams
[ ] Hardware inventory and serial numbers recorded
[ ] Configuration files backed up and documented
[ ] Custom configurations and modifications noted
Operational Documentation
[ ] Standard operating procedures documented
[ ] Troubleshooting guides prepared
[ ] Emergency procedures documented
[ ] Contact information for support escalation
[ ] System administrator training completed
Change Management
[ ] Change control procedures established
[ ] Backup and recovery procedures validated
[ ] Update and maintenance schedules defined
[ ] System lifecycle management plan created
Performance Validation#
This section provides a checklist for the performance validation that is completed for the system.
System Performance Benchmarks#
Compute Performance
[ ] GPU performance benchmarks completed
[ ] CPU performance validated
[ ] Memory bandwidth tested
[ ] Storage performance validated
[ ] Network throughput measured
System Load Testing
[ ] Full system load testing completed
[ ] Stress testing passed on all components
[ ] Thermal management validated under load
[ ] Power consumption measurements recorded
[ ] System stability verified over extended periods
Final System Acceptance#
This section provides a checklist for the final system acceptance that is completed for the system.
Deployment Sign-Off#
Technical Validation Complete
[ ] All technical validation items completed
[ ] Performance requirements met
[ ] Functional requirements satisfied
[ ] Non-functional requirements validated
[ ] System ready for production workloads
Stakeholder Approval
[ ] System administrator sign-off obtained
[ ] Operations team acceptance confirmed
[ ] End user validation completed
[ ] Management approval received
[ ] System officially handed over for production use
Post-Deployment Activities
[ ] Monitoring baselines established
[ ] Operational procedures transferred
[ ] Support escalation procedures activated
[ ] Training completion verified
[ ] System transition to production support complete
Commands for Quick Final Validation#
This section provides commands for rapid final system validation. These commands can be used to verify the system status and configuration and include:
System Status Overview#
# Check overall cluster status
cmsh -c "device; list"
cmsh -c "device; list -c dgx-gb200"
cmsh -c "device; list -c dgx-gb300"
cmsh -c "device; list -t switch | grep nvsw"
# Verify HA status (if configured)
cmha status
# Check all rack devices
cmsh -c "rack; list"
cmsh -c "rack; use <rack_name>; rackoverview"
Category and Software Image Verification#
# List all categories
cmsh -c "category; list"
# Check category configurations
cmsh -c "category; use slogin; show"
cmsh -c "category; use k8s-system-admin; show"
cmsh -c "category; use k8s-system-user; show"
cmsh -c "category; use dgx-gb200; show"
# List all software images
cmsh -c "softwareimage; list"
# Check image architecture
file /cm/images/<image-name>/bin/bash
Network Configuration Validation#
# List all networks
cmsh -c "network; list"
# Check specific network settings
cmsh -c "network; use internalnet; show"
cmsh -c "network; use dgxnet1; show"
cmsh -c "network; use ipminet0; show"
# Verify fsexports
cmsh -c "device; use master; fsexports; list"
# Check bond priority/gateway metrics
cmsh -c "network; use internalnet; show" | grep -i gateway
cmsh -c "network; use ipminet0; show" | grep -i gateway
Node Interface Validation#
# Verify interface status for specific node
cmsh -c "device; use <node-name>; interfaces; list"
# Check bond configuration
cmsh -c "device; use <node-name>; interfaces; use bond0; show"
# Verify all GB200/GB300 interfaces
cmsh -c "device; foreach -c dgx-gb200 (interfaces; list)"
Network Connectivity Validation#
# Ping gateway from head node
ping -c 3 <internalnet-gateway>
ping -c 3 <dgxnet-gateway>
ping -c 3 <ipminet-gateway>
# Test SSH to compute node
ssh <compute-node-hostname>
# Test SSH to NVLink switch
ssh admin@<nvlink-switch-hostname>
# Check network reachability from compute nodes
cmsh -c "device; use <compute-node>; ping <gateway-ip>"
Power and BMC Validation#
# Check power status for category
cmsh -c "device; power status -c dgx-gb200"
cmsh -c "device; power status -c dgx-gb300"
# Check power status by rack
cmsh -c "device; power status -r <rack-number>"
# Check individual node power status
cmsh -c "device; use <node-name>; power status"
# Verify BMC settings
cmsh -c "device; use <node-name>; bmcsettings; show"
cmsh -c "category; use dgx-gb200; bmcsettings; show"
Firmware Validation#
# List available firmware packages
cmsh -c "device; firmware info"
# Check current firmware versions for node
cmsh -c "device; use <node-name>; firmware status"
# Check firmware for all compute trays
cmsh -c "device; firmware status -c dgx-gb200"
# Check firmware for all switches
cmsh -c "device; firmware status -t switch | grep nvsw"
# Monitor firmware update progress
cmsh -c "device; firmware status -n <node-name>"
Health and Monitoring Validation#
# Check device health data
cmsh -c "device; use <switch-name>; latesthealthdata"
cmsh -c "device; foreach -c dgx-gb200 (latesthealthdata)"
# Monitor system logs in real-time
tail -f /var/log/cmdaemon
tail -f /var/log/syslog
tail -f /var/log/node-installer
# Check ZTP status on switches (if configured)
ssh admin@<switch-name> "sudo ztp status"
# Verify cm-lite-daemon status
cmsh -c "device; use <switch-name>; ssh"
# Then: systemctl status cm-lite-daemon
High Availability Validation#
# Check HA status
cmha status
# Verify failover configuration
cmsh -c "partition; show" | grep -i failover
# Check VIP assignments
ip addr show bond0 | grep -i ha
ip addr show bond1 | grep -i ha
Provisioning Validation#
# Check node installer logs during provisioning
tail -f /var/log/node-installer
# Verify nodes are UP
cmsh -c "device; list" | grep -i up
# Check provisioning status
cmsh -c "device; use <node-name>; show" | grep -i status
Critical Success Criteria#
The deployment is considered successful when all of the following criteria are met:
Infrastructure Criteria#
[ ] All physical hardware operational and accessible
[ ] All network connectivity functional with expected performance
[ ] All power and cooling systems operational within specifications
[ ] All firmware at required versions with no critical vulnerabilities
Software Criteria#
[ ] BCM 11 operational with all required licenses
[ ] All compute nodes provisioned and accessible
[ ] All control plane services functional
[ ] High availability functional (if configured)
Integration Criteria#
[ ] End-to-end system integration validated
[ ] All monitoring and management systems operational
[ ] Security policies enforced and validated
[ ] Performance meets or exceeds requirements
Operational Criteria#
[ ] System ready for production workloads
[ ] Operations team trained and ready
[ ] Documentation complete and accessible
[ ] Support procedures established and tested
Note
This comprehensive validation checklist should be completed systematically. Any failed validation items must be addressed before considering the deployment complete. Document any deviations from standard configurations and ensure they are approved through proper change management processes. This checklist is not intended to be a comprehensive list of all possible validation items, but rather a guide to help ensure the system is properly configured and operational.