NVIDIA UFM Cable Validation Tool v1.7.1

High Availability (HA) Mode Support for CVT Plugin

The CVT (Cable Validation Tool) plugin supports High Availability (HA) mode to ensure continuous operation and service resilience in production environments. In an HA deployment, CVT operates in an active-passive configuration where the collector runs on only one node at a time. When a failover event occurs, the CVT collector on the active node (node1) is stopped, and the CVT collector on the standby node (node2) automatically starts up. This architecture ensures that cable validation monitoring continues with minimal interruption, which is critical for maintaining network health and detecting issues in real-time.

The HA implementation for CVT follows a shared-storage model where both nodes have access to common configuration and data files. This shared storage ensures that when the standby node takes over, it has immediate access to the same topology data, validation history, and configuration that was being used by the primary node. The shared resources include:

  • Configuration files - All CVT configuration settings, including cvt_env.conf

  • Topology data - Current and historical topology files

  • Database and persistent storage - All collected metrics and historical information

Note on Validation State: The runtime validation state (agent status, validation results, and operational data) is maintained in memory only and is not persisted to shared storage. When a failover occurs, the standby node will reload the topology from shared storage and restart validation. The validation state will be rebuilt as agents are redeployed and begin reporting data. This means there will be a brief period during failover where the validation state is being re-established, but the topology configuration and historical data remain intact.

When a failover occurs—whether due to planned maintenance, system failures, or network issues—the standby node can quickly resume operations by loading the same topology and restarting validation, minimizing the gap in validation coverage.

To enable seamless automatic failover in HA mode, the CVT plugin requires specific configuration settings that allow the standby collector to automatically resume validation operations immediately upon startup. Without proper configuration, manual intervention would be required after each failover to reload the topology and restart validation, which would defeat the purpose of having an HA setup. The automated recovery mechanism ensures that network monitoring remains consistent and continuous across failover events.

For proper HA failover support, users must configure two critical parameters in the cvt_env.conf file under the [application] section:

Setting

Description

STARTUP_TOPOLOGY=last

This setting instructs the collector to automatically load the last successfully loaded topology from the history when starting up. In a failover scenario, this ensures that the standby collector immediately restores the exact topology state that was active on the primary collector before the failover.

Since both nodes share the same data files, the "last" topology reference points to the same topology file that was in use on the primary node. Without this setting, the collector would start with no topology loaded (the default none value), requiring manual topology loading after every failover and causing a gap in validation coverage.

AUTO_START_VALIDATION=true

This setting enables automatic validation startup once a topology is successfully loaded. When combined with STARTUP_TOPOLOGY=last, this creates a fully automated recovery process: upon startup, the standby collector loads the last topology from shared storage and immediately begins validation operations without requiring any user intervention.

This is essential for maintaining continuous validation during failover events, as it eliminates the manual steps that would otherwise be needed to resume monitoring. The validation process will automatically deploy agents to network devices and resume cable validation exactly as it was running on the primary node.

To configure CVT for HA mode with automatic failover support, update the [application] section in cvt_env.conf:

[application]

# Topology loading at startup - specify what to load:

# none - do not load any topology file (default)

# last - load the last loaded topology from history

# <path> - load a specific topology file (supports .topo, .dot, .xlsx, .json)

# For HA mode: set to 'last' to enable automatic topology restore on failover

STARTUP_TOPOLOGY=last

# Automatically start validation if a topology file is loaded

# For HA mode: set to 'true' to enable automatic validation resume on failover

AUTO_START_VALIDATION=true

With the above configuration, the failover sequence operates as follows:

  1. Primary node (node1) failure - The CVT collector on node1 stops due to hardware failure, network issue, or planned maintenance

  2. HA cluster failover - The cluster management system (e.g., Pacemaker, Kubernetes, or other HA solution) detects the failure and initiates failover to node2

  3. Standby node (node2) startup - The CVT collector starts on node2

  4. Automatic topology restore - CVT reads the STARTUP_TOPOLOGY=last setting and automatically loads the last topology from the shared storage

  5. Automatic validation start - CVT reads the AUTO_START_VALIDATION=true setting and immediately begins validation operations

  6. Validation state rebuild - Agents are redeployed to network devices and begin reporting data. The runtime validation state (agent status, current validation results) is rebuilt in memory as agents come online and start collecting cable data

  7. Service resumed - Cable validation monitoring continues with minimal interruption

The entire process is fully automated, requiring no manual intervention to restore service after a failover event. While the topology configuration and historical data are immediately available from shared storage, the runtime validation state will be progressively rebuilt as agents reconnect and report their status.

  • Shared Storage: Ensure that the shared storage is reliable and accessible from both nodes with low latency

  • Network Configuration: Both nodes should have similar network configurations to ensure agents can communicate with the collector regardless of which node is active

  • Cluster Management: Use a proper HA cluster management solution to handle failover detection and node transitions

  • Testing: Regularly test failover scenarios to ensure the configuration is working as expected

  • Monitoring: Implement monitoring to detect failover events and verify that validation resumes successfully on the standby node

Together, the STARTUP_TOPOLOGY=last and AUTO_START_VALIDATION=true configuration settings create a robust failover mechanism where the standby collector can automatically take over validation operations in an active-passive HA deployment. This configuration ensures minimal disruption to network monitoring and allows HA deployments to provide the high reliability that production environments demand, without requiring manual intervention during failover events.

© Copyright 2025, NVIDIA. Last updated on Nov 12, 2025