GB200 Rack Configuration Verification Checklist#

This checklist helps verify that all GB200 rack configurations have been completed correctly after following any of the rack configuration processes. Use this systematic verification process before proceeding to rack power-on and provisioning.

General Prerequisites#

  • [ ] Rack inventory file obtained from factory after L11 testing (for automated/manual import)

  • [ ] Point-to-Point (P2P) documentation available with MAC addresses

  • [ ] Site information and IP allocation plan documented

  • [ ] GB200 and NVLink Switch categories configured in BCM

  • [ ] Network subnets defined and available for device assignment

Automated Rack Import Process Verification#

If you used the automated bcm-netautogen tool and bcm-post-install automation:

Prerequisites Check - [ ] Rack inventory file from factory available - [ ] Point-to-Point (P2P) file available - [ ] siteinformation.yaml file properly configured - [ ] bcm-netautogen tool available and configured

Process Verification - [ ] bcm-netautogen tool executed successfully with all input files - [ ] .json files generated for all rack components:

  • [ ] 18 GB200 compute tray .json files

  • [ ] 9 NVLink Switch tray .json files

  • [ ] 8 power shelf .json files

  • [ ] bcm-post-install automation executed successfully

  • [ ] All .json files imported into BCM without errors

Naming Convention Verification - [ ] GB200 compute trays follow: <RACK>-<RU>-P[1-16]-<ROLE>-0[1-8]-C0[1-18] - [ ] NVLink switches follow: <RACK>-<RU>-P[1-16]-<switch_role>-0[1-9] - [ ] Power shelves follow proper naming convention

Manual Rack Import Process Verification#

If you manually created and imported .json files:

File Creation Verification - [ ] Rack inventory file reviewed (or MAC addresses collected manually) - [ ] Individual .json files created for each component:

  • [ ] 18 GB200 compute tray .json files with proper MAC addresses

  • [ ] 9 NVLink Switch tray .json files with proper MAC addresses

  • [ ] 8 power shelf .json files

  • [ ] All .json files follow proper BCM format and syntax

  • [ ] IP addresses assigned correctly per network subnets

Import Verification - [ ] All .json files successfully imported into BCM - [ ] No import errors or warnings in BCM logs - [ ] All devices appear in BCM device list

Naming Convention Verification - [ ] All hostnames follow: <Rack Location>-<RU>-<POD Number>-<Tray Type>-<node number> - [ ] Tray types properly designated: DGX, NVSW, or PWR

Manual Addition of GB200 Rack Entries Verification#

If you manually added GB200 compute trays using cmsh commands:

Rack Entry Verification - [ ] Rack entry created with proper coordinates - [ ] Rack number matches site documentation

GB200 Compute Tray Golden Node - [ ] Physical node created with proper hostname - [ ] GB200 category assigned correctly - [ ] BMC interface configured (rf0 preferred, ipmi0 as fallback) - [ ] BMC IP, network, and MAC address configured

Network Interface Configuration - [ ] Bluefield and CX-7 interfaces added:

  • [ ] M1 (enP6p3s0f0np0) with correct MAC

  • [ ] M2 (enP22p3s0f0np0) with correct MAC

  • [ ] S1 (enP6p3s0f1np1) with correct MAC and storage network

  • [ ] S2 (enP22p3s0f1np1) with correct MAC and storage network

Bond Configuration - [ ] Bond0 created with M1 and M2 interfaces - [ ] Bond mode 4 (LACP) configured - [ ] Bond IP assigned on internal/management network - [ ] Bond set as provisioning interface

InfiniBand Interfaces - [ ] Four InfiniBand interfaces added if needed:

  • [ ] ibp3s0 with compute network assignment

  • [ ] ibP2p3s0 with compute network assignment

  • [ ] ibP16p3s0 with compute network assignment

  • [ ] ibP18p3s0 with compute network assignment

System Configuration - [ ] System MAC set for initial boot (M1 or M2)

Node Cloning - [ ] 18 compute tray entries cloned from golden node - [ ] All cloned nodes have incremental IPs - [ ] All cloned nodes have unique MAC addresses - [ ] Rack positions set correctly for all nodes

Final Rack Verification#

Device Inventory Check - [ ] Total device count in BCM matches expected:

  • [ ] 18 GB200 compute trays

  • [ ] 9 NVLink switches

  • [ ] 8 power shelves

  • [ ] All devices show proper status in BCM

Network Configuration Validation - [ ] All devices have correct IP assignments - [ ] Network connectivity verified between management network and all devices - [ ] BMC connectivity tested for all compute trays and switches

Rack Physical Layout - [ ] All devices assigned to correct rack positions - [ ] Rack layout matches physical hardware arrangement - [ ] Device naming follows site conventions

Documentation and Naming - [ ] All hostnames follow established naming conventions - [ ] Device categories properly assigned - [ ] MAC addresses documented and match physical hardware - [ ] IP allocation documented in site records

Power and Environmental - [ ] Power shelf configurations complete - [ ] Environmental monitoring configured if applicable - [ ] Rack power requirements documented

Ready for Next Steps#

High Availability and Networking - [ ] All rack components configured and ready - [ ] Network infrastructure validated - [ ] Power infrastructure validated

Provisioning Readiness - [ ] All GB200 compute trays ready for provisioning - [ ] All NVLink switches ready for configuration - [ ] Rack management system operational

Commands for Quick Verification#

Use these commands to quickly verify rack configurations:

List All Rack Devices

cmsh -c "device; list"
cmsh -c "device; list -c gb200"
cmsh -c "device; list -k nvlink"

Check Rack Layout

cmsh -c "rack; display <rack_number>"
cmsh -c "rack; list"

Verify Network Assignments

cmsh -c "device use <device_name>; interfaces; list"
cmsh -c "device use <device_name>; ping"

Check BMC Connectivity

cmsh -c "device use <device_name>; bmcsettings; show"

Verify Switch ZTP Status (if applicable)

cmsh -c "device; use <switch_name>; ssh"
sudo ztp status

Monitor Switch Health (if applicable)

cmsh -c "device; use <switch_name>; latesthealthdata"

Next Steps#

Once all items in this checklist are verified:

  1. Proceed to High Availability if HA is required

  2. Begin GB200 Rack Power On and Bring Up for rack power-on sequence

  3. Start compute node provisioning process

Note

This checklist should be completed before proceeding to the rack bring-up phase. Any missing configurations should be addressed by returning to the appropriate configuration sections.