NVLink Switch Firmware Upgrade
The NVLink Switch firmware upgrade workflow safely upgrades firmware on NVIDIA NVLink switches (NVLink Switch NVL and future iterations) running NV-OS, using a vendor-supplied firmware bundle.
This workflow is a component of a full rack-level automation pipeline for upgrading NVLink domain GPU systems. These systems require a coordinated upgrade of all compute and NVLink shelves.
Workflow Stages
1. Get Current State
- Purpose: Query the device to get current OS and firmware versions
- Activities:
get_network_device: Retrieves device data from Nautobotget_current_os: Gets the running OS version using the NVUE APIget_running_firmware: Gets all firmware component versions through/nvue_v1/platform/firmware
- Validation: Checks if the device platform is supported (NV-OS only)
2. Compare Versions
- Purpose: Compare running versions against desired versions from config context
- Activities:
compare_running_desired: Compares current vs desired firmware/OS versions
- Logic:
- Extracts desired versions from device config context using
firmware_bundle_version - Identifies which components need updates
- Determines if any upgrade is needed
- Extracts desired versions from device config context using
3. Perform Backup
- Purpose: Create configuration backup before firmware changes
- Activities: Runs the standard
BackupWorkflowas a child workflow - Validation: Checks for configuration drift and halts if detected
4. Update Context and Validate
- Purpose: Prepare the device context and validate upgrade readiness
- Activities:
update_device_context: Updates the device’s firmware_bundle_version and intended-firmware contextvalidate_render_targets: Ensures render service generates correct firmware commands with pollingvalidate_target_files: ⚠️ Temporarily disabled (MTLS ingress not enabled in utility clusters)
- Validation:
- Updates both
firmware_bundle_versionandintended-firmwarecontext for template compatibility - Polls rendered
fwupdate-commands.txtto confirm the device will request the correct files
- Updates both
5. Execute Firmware Upgrade
- Purpose: Trigger factory reset and wait for ZTP completion
- Activities:
execute_ztp: Triggers factory reset using the NVUE APIpoll_ztp_status: Polls ZTP status with extended 120-minute timeout for firmware upgrades
- Timeout: Extended to 120 minutes to accommodate firmware installation time
6. Validate Firmware Upgrade
- Purpose: Verify firmware upgrade success and handle conditional reboot
- Activities:
get_current_os: Gets OS version after upgradeget_running_firmware: Gets firmware versions after upgradecompare_running_desired: Validates firmware matches expected versionsreboot_device: Conditionally reboots device if firmware mismatch detectedwait_reboot: Waits for device to come back online using uptime comparison
- Logic:
- If all firmware matches: Success
- If mismatch after upgrade: Attempt reboot and wait for device recovery (10-minute timeout)
- If still mismatched after reboot: Fail workflow (repeated reboots will not solve the problem)
Key Features
Firmware Version Mapping
- Reads
firmware_bundle_versionfrom device config context - Maps to firmware bundle definitions in site context
- Extracts expected firmware versions for comparison
Extended Timeouts
Firmware upgrades take longer than OS upgrades:
- ZTP wait extended to 120 minutes
- Firmware polling with appropriate timeouts
- Conditional reboot with 10-minute device recovery timeout using proper uptime detection
Comprehensive Validation
- Pre-upgrade:
- Validates rendered firmware commands are correct
- Validates target files exist on ZTP server (⚠️ temporarily disabled)
- Post-upgrade: Validates actual firmware matches expected
- Conditional reboot: Handles cases where firmware is installed but not active
Error Handling
- Platform validation: Only supports NV-OS
- Configuration drift detection: Halts workflow if drift detected
- ZTP failure handling: Extended timeouts with proper error reporting
- Persistent firmware mismatch detection: Fails after conditional reboot attempt
Configuration Requirements
Device Config Context
firmware_bundles is inherited from the site level config context tied to the NVSwitch role.
Usage
Input
Prerequisites
- Device must be running NV-OS platform
- Device must have firmware_bundles configured in config context
- Firmware files must be available on ZTP server
- Device must be reachable using the NVUE API
Execution
You can trigger this workflow using the Config Manager Temporal API: