GB200 Rack Configuration Verification Checklist#
This checklist helps verify that all GB200 rack configurations have been completed correctly after following any of the rack configuration processes. Use this systematic verification process before proceeding to rack power-on and provisioning.
General Prerequisites#
[ ] Rack inventory file obtained from factory after L11 testing (for automated/manual import)
[ ] Point-to-Point (P2P) documentation available with MAC addresses
[ ] Site information and IP allocation plan documented
[ ] GB200 and NVLink Switch categories configured in BCM
[ ] Network subnets defined and available for device assignment
Automated Rack Import Process Verification#
If you used the automated bcm-netautogen tool and bcm-post-install automation:
Prerequisites Check - [ ] Rack inventory file from factory available - [ ] Point-to-Point (P2P) file available - [ ] siteinformation.yaml file properly configured - [ ] bcm-netautogen tool available and configured
Process Verification - [ ] bcm-netautogen tool executed successfully with all input files - [ ] .json files generated for all rack components:
[ ] 18 GB200 compute tray .json files
[ ] 9 NVLink Switch tray .json files
[ ] 8 power shelf .json files
[ ] bcm-post-install automation executed successfully
[ ] All .json files imported into BCM without errors
Naming Convention Verification
- [ ] GB200 compute trays follow: <RACK>-<RU>-P[1-16]-<ROLE>-0[1-8]-C0[1-18]
- [ ] NVLink switches follow: <RACK>-<RU>-P[1-16]-<switch_role>-0[1-9]
- [ ] Power shelves follow proper naming convention
Manual Rack Import Process Verification#
If you manually created and imported .json files:
File Creation Verification - [ ] Rack inventory file reviewed (or MAC addresses collected manually) - [ ] Individual .json files created for each component:
[ ] 18 GB200 compute tray .json files with proper MAC addresses
[ ] 9 NVLink Switch tray .json files with proper MAC addresses
[ ] 8 power shelf .json files
[ ] All .json files follow proper BCM format and syntax
[ ] IP addresses assigned correctly per network subnets
Import Verification - [ ] All .json files successfully imported into BCM - [ ] No import errors or warnings in BCM logs - [ ] All devices appear in BCM device list
Naming Convention Verification
- [ ] All hostnames follow: <Rack Location>-<RU>-<POD Number>-<Tray Type>-<node number>
- [ ] Tray types properly designated: DGX, NVSW, or PWR
Manual Addition of GB200 Rack Entries Verification#
If you manually added GB200 compute trays using cmsh commands:
Rack Entry Verification - [ ] Rack entry created with proper coordinates - [ ] Rack number matches site documentation
GB200 Compute Tray Golden Node - [ ] Physical node created with proper hostname - [ ] GB200 category assigned correctly - [ ] BMC interface configured (rf0 preferred, ipmi0 as fallback) - [ ] BMC IP, network, and MAC address configured
Network Interface Configuration - [ ] Bluefield and CX-7 interfaces added:
[ ] M1 (enP6p3s0f0np0) with correct MAC
[ ] M2 (enP22p3s0f0np0) with correct MAC
[ ] S1 (enP6p3s0f1np1) with correct MAC and storage network
[ ] S2 (enP22p3s0f1np1) with correct MAC and storage network
Bond Configuration - [ ] Bond0 created with M1 and M2 interfaces - [ ] Bond mode 4 (LACP) configured - [ ] Bond IP assigned on internal/management network - [ ] Bond set as provisioning interface
InfiniBand Interfaces - [ ] Four InfiniBand interfaces added if needed:
[ ] ibp3s0 with compute network assignment
[ ] ibP2p3s0 with compute network assignment
[ ] ibP16p3s0 with compute network assignment
[ ] ibP18p3s0 with compute network assignment
System Configuration - [ ] System MAC set for initial boot (M1 or M2)
Node Cloning - [ ] 18 compute tray entries cloned from golden node - [ ] All cloned nodes have incremental IPs - [ ] All cloned nodes have unique MAC addresses - [ ] Rack positions set correctly for all nodes
Manual Addition of NVLink Switch Entries Verification#
If you manually added NVLink switches using cmsh commands:
Basic Switch Entry Creation
Switch Configuration - [ ] Switch entry created in BCM with proper hostname - [ ] MAC address configured for switch - [ ] cm-lite-daemon enabled (hasclientdaemon = yes) - [ ] Switch kind set to nvlink - [ ] SNMP disabled (disablesnmp = yes)
Network Interface Configuration - [ ] eth0/COMe0 interface added with correct MAC - [ ] eth0 IP address assigned on management network - [ ] Management network properly configured
SSH Access Configuration - [ ] Username and password configured for SSH access - [ ] REST port set to 443 - [ ] SSH credentials tested and working
ZTP Settings Configuration (Optional)
If ZTP was configured for automatic NVOS updates:
ZTP Directory Structure
- [ ] ZTP settings configured with proper templates
- [ ] API enabled on switch
- [ ] initialize command executed successfully
- [ ] Switch-specific directory created: /cm/local/apps/cmd/etc/htdocs/switch/<switch_name>/
File Repository Setup
- [ ] NVOS firmware image files copied to: /cm/local/apps/cmd/etc/htdocs/switch/image/
- [ ] Startup configuration files copied to switch-specific directories
- [ ] Startup configuration modified with hashed password field
ZTP Parameters - [ ] Configuration mode set to file - [ ] Startup YAML file specified correctly - [ ] JSON template configured - [ ] Image name specified for updates - [ ] checkimageonboot enabled
Switch Cloning - [ ] 9 NVLink switch entries cloned from golden switch - [ ] All cloned switches have incremental IPs - [ ] All cloned switches have unique MAC addresses - [ ] ZTP settings inherited by all cloned switches
ZTP Process Execution - [ ] Switch restart/reset completed to initiate ZTP - [ ] ZTP status monitored via cmdaemon logs - [ ] ZTP completion verified with SUCCESS status - [ ] CM Lite Daemon installed successfully on all switches - [ ] Switch monitoring and health data validated
Final Rack Verification#
Device Inventory Check - [ ] Total device count in BCM matches expected:
[ ] 18 GB200 compute trays
[ ] 9 NVLink switches
[ ] 8 power shelves
[ ] All devices show proper status in BCM
Network Configuration Validation - [ ] All devices have correct IP assignments - [ ] Network connectivity verified between management network and all devices - [ ] BMC connectivity tested for all compute trays and switches
Rack Physical Layout - [ ] All devices assigned to correct rack positions - [ ] Rack layout matches physical hardware arrangement - [ ] Device naming follows site conventions
Documentation and Naming - [ ] All hostnames follow established naming conventions - [ ] Device categories properly assigned - [ ] MAC addresses documented and match physical hardware - [ ] IP allocation documented in site records
Power and Environmental - [ ] Power shelf configurations complete - [ ] Environmental monitoring configured if applicable - [ ] Rack power requirements documented
Ready for Next Steps#
High Availability and Networking - [ ] All rack components configured and ready - [ ] Network infrastructure validated - [ ] Power infrastructure validated
Provisioning Readiness - [ ] All GB200 compute trays ready for provisioning - [ ] All NVLink switches ready for configuration - [ ] Rack management system operational
Commands for Quick Verification#
Use these commands to quickly verify rack configurations:
List All Rack Devices
cmsh -c "device; list"
cmsh -c "device; list -c gb200"
cmsh -c "device; list -k nvlink"
Check Rack Layout
cmsh -c "rack; display <rack_number>"
cmsh -c "rack; list"
Verify Network Assignments
cmsh -c "device use <device_name>; interfaces; list"
cmsh -c "device use <device_name>; ping"
Check BMC Connectivity
cmsh -c "device use <device_name>; bmcsettings; show"
Verify Switch ZTP Status (if applicable)
cmsh -c "device; use <switch_name>; ssh"
sudo ztp status
Monitor Switch Health (if applicable)
cmsh -c "device; use <switch_name>; latesthealthdata"
Next Steps#
Once all items in this checklist are verified:
Proceed to High Availability if HA is required
Begin GB200 Rack Power On and Bring Up for rack power-on sequence
Start compute node provisioning process
Note
This checklist should be completed before proceeding to the rack bring-up phase. Any missing configurations should be addressed by returning to the appropriate configuration sections.