NVIDIA UFM Cable Validation Tool v1.7.1

Cable Validation Tool - Prometheus Metrics Endpoint

The Cable Validation Tool (CVT) now provides a Prometheus-compatible metrics endpoint that exposes real-time cable health and performance data for monitoring and alerting. This endpoint enables integration with modern monitoring stacks like Prometheus, Grafana, and other observability tools.

🎯 Key Capabilities

  • Real-time Metrics: Live cable validation data from network switches and hosts

  • Multi-format Support: Prometheus, JSON, and CSV output formats

  • Rich Labeling: Complete network topology context with peer relationships

  • High Performance: Multi-level caching optimized for frequent scraping

  • Production Ready: Handles 100K+ ports with memory-adaptive optimizations

📊 Metrics Categories

  • Power Metrics: RX/TX optical power per lane (up to 8 lanes per port)

  • BER Metrics: Effective and Raw Bit Error Rates

  • Temperature Metrics: Module temperature and thresholds

  • Counter Metrics: Transceiver reinsert/swap events

  • Validation Metrics: Port validation status with issue descriptions

  • Threshold Metrics: Power and temperature alarm thresholds

  • Timestamp Metrics: Data collection and report timestamps

Base URL

https://<cvt-server>/cablevalidation/metrics

Available Endpoints

Endpoint

Format

Content-Type

Description

/cablevalidation/metrics

Prometheus

text/plain

Standard Prometheus exposition format

/cablevalidation/metrics/json

JSON

application/json

Structured JSON for programmatic access

/cablevalidation/metrics/csv

CSV

text/plain

Comma-separated values for spreadsheet import


Authentication

  • No authentication required for metrics endpoints

  • HTTPS enforced for security

  • Bypasses session handling for automated scraping

Prometheus Format

# HELP effective_ber Effective Bit Error Rate

# TYPE effective_ber gauge

# HELP validation_status Port validation status with issue descriptions in labels (value = issue count)

# TYPE validation_status gauge

# HELP port_info Port information with status and validation details in labels

# TYPE port_info gauge

# Healthy port with performance metrics

effective_ber{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 1.5e-254 1759345924622

module_temperature{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 65.2 1759345924622

validation_status{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 0 1759345924622

# Unplugged port with validation issue (power/temp metrics excluded due to NA values)

validation_status{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1",MediaUnplugged="Insert; Reseat or Replace Cable/Transceiver"} 1 1759345923752

effective_ber{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1"} 0.0 1759345923752

time_since_last_clear{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1"} 320035.6 1759345923752

# Port info with detailed status

port_info{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1",phy_manager_state="Disable",module_oper_status="unplugged",cable_sn="N/A",cable_pn="N/A",protocol="Ethernet",module_fw_version="N/A"} 1 1759345923752

JSON Format

{

"ufm-host38:enp3s0f0np0": {

"timestamp": 1757524769.645,

"port_info": {

"node_name": "ufm-host38",

"port_name": "enp3s0f0np0",

"peer_node_name": "r-ufm-sw-eth01",

"peer_port_name": "swp2",

"node_type": "Host",

"su_number": "SU1",

"data_hall": "DH1"

},

"port_labels": {

"cable_sn": "ABC123",

"cable_pn": "DEF456",

"protocol": "400G"

},

"port_stats": {

"effective_ber": 1.5e-254,

"module_temperature": 65.2,

"rx_power_lane_0": -2.5

},

"validation_data": {

"issues_count": 1,

"last_report_time": 1757524769.645,

"issues": {

"WrongNeighbor": "Check cable connection to switch2"

}

}

}

}

Power Metrics

Metric

Type

Description

Unit

rx_power_lane_N

gauge

RX optical power for lane N (0-7, not all lanes may be present)

dBm

tx_power_lane_N

gauge

TX optical power for lane N (0-7, not all lanes may be present)

dBm

rx_power_high_th

gauge

RX power high threshold

dBm

rx_power_low_th

gauge

RX power low threshold

dBm


BER Metrics

Metric

Type

Description

effective_ber

gauge

Effective Bit Error Rate

raw_ber

gauge

Raw Bit Error Rate


Temperature Metrics

Metric

Type

Description

Unit

module_temperature

gauge

Current module temperature

Celsius

temperature_high_th

gauge

Temperature high threshold

Celsius

temperature_low_th

gauge

Temperature low threshold

Celsius


Status Metrics

Metric

Type

Description

Values

port_status

gauge

Port plugged status

1=Up, 0=Down

port_oper_status

gauge

Port operational status

1=Up, 0=Down


Counter Metrics

Metric

Type

Description

transceiver_reinsert_cnt

counter

Number of transceiver reinsert events

transceiver_swap_cnt

counter

Number of transceiver swap events

time_since_last_clear

gauge

Time since last counter clear (seconds)


Validation Metrics

Metric

Type

Description

Special Features

validation_status

gauge

Port validation status with issue descriptions

Value = issue count, descriptions in labels

last_report_time

gauge

Timestamp of last validation report

Unix timestamp

Validation Status Labels

The validation_status metric includes dynamic labels for each type of validation issue:

  • WrongNeighbor: "Check cable connection to correct switch"

  • MediaUnplugged: "Insert; Reseat or Replace Cable/Transceiver"

  • AnomalousPort: "Temperature exceeds threshold"

  • FlappingLink: "Reseat transceiver; Check Fiber"

  • UnknownNeighbor: "Verify neighbor device connectivity"

  • WrongPort: "Check port mapping in topology"

  • ExtraCable: "Remove unexpected cable connection"

  • UnreachableDevice: "Check device connectivity and power"

  • LinkDown_NoSignal: "Check physical connection"

  • ErrDisable_Flap: "Port disabled due to flapping"

  • AdminDown: "Port administratively disabled"

  • ErrDisable_Rx: "RX error disable condition"

  • NegotiationFail: "Check autonegotiation settings"

  • NicNameMismatch: "Verify NIC provisioning"

  • ModulePnMismatch: "Replace with compatible module"

Note: Commas in descriptions are automatically converted to semicolons to maintain Prometheus label format compatibility.

Examples:

# Port with validation issues (unplugged cable)

validation_status{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1",MediaUnplugged="Insert; Reseat or Replace Cable/Transceiver"} 1

# Port without issues (healthy connection)

validation_status{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 0

# Port with multiple validation issues

validation_status{node="switch1",port="1/1",WrongNeighbor="Check cable connection to switch2",AnomalousPort="Temperature exceeds threshold"} 2

Topology Labels (All Metrics)

  • node: Switch or host name

  • port: Port identifier

  • peer_node: Connected peer node name

  • peer_port: Connected peer port identifier

  • node_type: Node type (Switch, Host, etc.)

  • su_number: Scalable Unit identifier

  • data_hall: Data hall location

Cable Labels (Status Metrics Only)

  • cable_sn: Cable serial number

  • cable_pn: Cable part number

  • protocol: Cable protocol (400G, InfiniBand, etc.)

  • port_status: Port status (Up, Down, etc.)

  • plugged: Module plugged status

Update Frequency

  • Agent Data: Updated every 10 minutes (configurable)

  • Metrics Cache: Invalidated on data changes

  • Prometheus Scraping: Recommended 15-30 second intervals

Performance Metrics

Deployment Size

Response Time

Memory Usage

Caching Strategy

< 10K ports

< 50ms

~20MB

Full caching enabled

10K-50K ports

< 200ms

~100MB

Collection cache only

50K+ ports

< 500ms

~200MB

No caching, real-time generation


Optimization Features

  • Multi-level Caching: Port, collection, and label caching

  • Memory Adaptive: Automatically adjusts for large deployments

  • Smart Change Detection: Only updates when cable/module data changes

  • Zero Value Handling: Includes all values for complete visibility

Common Issues

1. No Metrics Data

Symptoms: Empty response or no metrics Causes:

  • CVT service not running

  • No topology loaded

  • No advanced stats collection

Solutions:

# Check service status

# Check topology loading

# Check agent connectivity

2. Missing Port Data

Symptoms: Some ports not appearing in metrics Causes:

  • Port not in loaded topology

  • Agent not deployed on switch

  • Advanced stats not collected

Solutions:

  • Verify topology includes all expected ports

  • Deploy agents on missing switches

  • Check agent connectivity and data collection

3. Stale Timestamps

Symptoms: Old timestamps in metrics Causes:

  • Agent not sending updates

  • Network connectivity issues

Solutions:

  • Check agent logs for errors

  • Verify network connectivity to switches

  • Restart agents if necessary

4. Missing Validation Data

Symptoms: validation_status metrics missing or always 0 Causes:

  • Validation reports not being generated

  • Agent data filtering (switch not in topology)

  • Report processing errors

Solutions:

  • Verify validation is started on agents

  • Check switch IP exists in topology

  • Review agent and collector logs for errors

5. Inconsistent Issue Counts

Symptoms: validation_status count doesn't match expected issues Causes:

  • Issues filtered by port

  • Report data synchronization issues

  • Processing errors

Solutions:

  • Check that report data includes port-specific issues

  • Verify advanced stats and reports arrive together

  • Review validation report structure

Performance Tuning

Environment Variables

TBD: not supported yet.

# Adjust caching thresholds

PROMETHEUS_MAX_CACHED_PORTS=10000

# Disable detailed metrics for very large deployments

PROMETHEUS_ENABLE_DETAILED_METRICS=false

# Adjust cache TTL

PROMETHEUS_CACHE_TTL=60

Memory Optimization

For deployments > 50K ports:

  • Collection-level caching automatically disabled

  • Port-level caching automatically disabled

  • Real-time generation used (acceptable 200-500ms response time)

Access Control

  • HTTPS Required: All access must be over HTTPS

  • No Authentication: Designed for automated monitoring tools

  • Network Restrictions: Consider IP-based access control

Sensitive Data

  • Network Topology: Metrics expose network structure

  • Cable Information: Serial numbers and part numbers included

  • Performance Data: Could reveal network capacity information

Recommended Security

<Location /cablevalidation/metrics>

Use BringupProxy

SSLRequireSSL

# Restrict to monitoring networks

<RequireAll>

Require ip 10.0.0.0/8 # Internal networks

Require ip 172.16.0.0/12 # Container networks

Require ip 192.168.0.0/16 # Private networks

</RequireAll>

</Location>

Python Client Example

import requests

import json

# Get metrics in different formats

def get_cvt_metrics(server: str, port: int, response_format='prometheus'):

endpoints = {

'prometheus': '/cablevalidation/metrics',

'json': '/cablevalidation/metrics/json',

'csv': '/cablevalidation/metrics/csv'

}

url = f"https://{server}:{port}{endpoints[format]}"

response = requests.get(url, verify=False)

if format == 'json':

return response.json()

return response.text

# Usage

metrics = get_cvt_metrics('cvt-server.example.com', 'json')

for port_key, port_data in metrics.items():

if port_data['port_stats']['effective_ber'] > 1e-12:

print(f"High BER on {port_key}: {port_data['port_stats']['effective_ber']}")

Validation Monitoring Examples

# Find all ports with validation issues

validation_status > 0

# Count issues by syndrome type

sum by (node) (validation_status{WrongNeighbor!=""})

sum by (node) (validation_status{MediaUnplugged!=""})

sum by (node) (validation_status{LinkDown_NoSignal!=""})

# Find specific issue types

validation_status{MediaUnplugged!=""} > 0 # Unplugged cables

validation_status{AdminDown!=""} > 0 # Administratively disabled ports

validation_status{ModulePnMismatch!=""} > 0 # Hardware compatibility issues

# Ports with multiple issue types

validation_status{WrongNeighbor!="",AnomalousPort!=""}

# Correlation with performance metrics

(validation_status > 0) and (effective_ber > 1e-12)

# Port status correlation

validation_status{MediaUnplugged!=""} and on() port_info{module_oper_status="unplugged"}

Data Flow

Network Agents → CVT Collector → Advanced Stats + Report Data → Prometheus Collector → Metrics Endpoint<p></p>     ↓              ↓                      ↓                           ↓                    ↓<p></p>  (10 min)      (Real-time)         (Synchronized)              (Multi-level Cache)  (GET request)

Enhanced Data Processing

  1. Agent Data Validation: Switch IP validated against topology before processing

  2. Synchronized Processing: Advanced stats and validation reports processed together

  3. Optimized Issue Processing: Report data pre-processed to group issues by port (O(n+m) complexity)

  4. Independent Validation Cache: PortValidationStatus class with hash-based change detection

  5. Robust Syndrome Handling: Automatic fallback for unknown syndromes with developer warnings

  6. Smart Data Quality: NA values properly excluded, counters preserve semantics

Caching Strategy

  1. Port-level Cache: Individual port metrics cached until data changes

  2. Collection-level Cache: Aggregated output cached for fast retrieval

  3. Label Cache: Stable topology/cable labels cached separately

  4. Validation Cache: Independent cache for validation status with hash-based change detection

  5. Metadata Cache: Static TYPE/HELP comments cached permanently

Performance Optimizations

  • Push-based Updates: Metrics updated when advanced stats arrive

  • Smart Change Detection: Only cable/module changes invalidate caches

  • Memory Adaptive: Caching disabled automatically for large deployments

  • String Manipulation: Efficient JSON aggregation using string operations

  • Validation Processing: O(n+m) complexity with report preprocessing

  • Hash-based Cache: Validation cache only invalidated when issue content changes

  • Agent Data Filtering: Invalid switch IPs filtered early to prevent unnecessary processing

Prometheus Configuration

  • Scrape Interval: 15-30 seconds (matches CVT data update frequency)

  • Timeout: 10 seconds (allows for cache generation)

  • Retention: Configure based on historical analysis needs

Alerting Guidelines

  • BER Thresholds: Alert when effective_ber > 1e-12

  • Temperature Limits: Alert when module_temperature approaches temperature_high_th

  • Validation Issues: Alert when validation_status > 0

  • Critical Issues: Alert on specific syndromes (MediaUnplugged, UnreachableDevice, ModulePnMismatch)

  • Infrastructure Issues: Alert on LinkDown_NoSignal, ErrDisable conditions

  • Counter Anomalies: Alert on rapid increases in transceiver_reinsert_cnt

Sample Alerting Rules

# Validation issues alert

- alert: PortValidationIssues

expr: validation_status > 0

labels:

severity: warning

annotations:

summary: "Port {{ $labels.node }}:{{ $labels.port }} has validation issues"

description: "{{ $value }} validation issues detected"

# Critical validation issues

- alert: CriticalPortIssues

expr: validation_status{MediaUnplugged!=""} > 0 or validation_status{UnreachableDevice!=""} > 0

labels:

severity: critical

annotations:

summary: "Critical issues on {{ $labels.node }}:{{ $labels.port }}"

description: "{{ if $labels.MediaUnplugged }}Cable unplugged: {{ $labels.MediaUnplugged }}{{ end }}{{ if $labels.UnreachableDevice }}Device unreachable: {{ $labels.UnreachableDevice }}{{ end }}"

# Infrastructure issues

- alert: InfrastructureIssues

expr: validation_status{LinkDown_NoSignal!=""} > 0 or validation_status{ModulePnMismatch!=""} > 0

labels:

severity: warning

annotations:

summary: "Infrastructure issue on {{ $labels.node }}:{{ $labels.port }}"

# Administrative issues

- alert: AdminIssues

expr: validation_status{AdminDown!=""} > 0 or validation_status{NicNameMismatch!=""} > 0

labels:

severity: info

annotations:

summary: "Administrative issue on {{ $labels.node }}:{{ $labels.port }}"

Release Notes

  • Version: 1.1.0

  • Release Date: October 2025

  • Compatibility: CVT 1.7.0 and later

  • Dependencies: Requires advanced stats collection enabled

New in Version 1.1.0

  • ✅ Validation Metrics Integration: Port validation status with actionable issue descriptions

  • ✅ Synchronized Data Processing: Advanced stats and validation reports processed together

  • ✅ Performance Optimizations: O(n+m) validation processing, hash-based change detection

  • ✅ Enhanced Security: Agent data validation prevents processing from unknown switches

  • ✅ Improved Data Quality: None-based initialization for gauges, proper counter semantics

  • ✅ Better Caching: Independent validation cache with content-based invalidation

  • ✅ Comprehensive Syndrome Coverage: 15+ validation issue types with fallback handling

  • ✅ Real-world Validation: Successfully tested with production data and unplugged ports

API Stability

  • Metric Names: Stable (no breaking changes planned)

  • Label Names: Stable (additions possible, no removals)

  • Output Format: Prometheus standard compliance maintained

  • Endpoint URLs: Stable API contract

Enhanced Counter Semantics

  • Gauges default to None: Missing sensor data excluded instead of showing false zeros

  • Counters preserve values: No unexpected resets when data temporarily unavailable

  • Proper NA handling: Invalid data marked as NA and excluded from metrics

  • Temperature accuracy: Fixed zero temperature issue by using actual amber timestamps

Validation Integration Benefits

  • Synchronized processing: Performance metrics and validation issues always in sync

  • Rich context: Issue descriptions provide actionable corrective actions

  • Efficient processing: O(n+m) complexity prevents performance degradation

  • Smart caching: Validation cache independent of performance metrics cache

Troubleshooting

  1. Check CVT Service: Ensure Cable Validation service is running

  2. Verify Topology: Confirm network topology is loaded

  3. Agent Status: Check that agents are deployed and collecting data

  4. Network Connectivity: Verify switch/host accessibility

Performance Monitoring

# Check metrics endpoint response time

time curl -k https://cvt-server/cablevalidation/metrics > /dev/null

Contact Information

  • Development Team: Cable Validation Engineering

  • Documentation: [Internal Wiki Link]

  • Support: [Support Channel/Email]


This endpoint provides comprehensive cable validation metrics for modern monitoring and observability workflows, enabling proactive network health management and automated alerting.

© Copyright 2025, NVIDIA. Last updated on Nov 12, 2025