Cable Validation Tool - Prometheus Metrics Endpoint
The Cable Validation Tool (CVT) now provides a Prometheus-compatible metrics endpoint that exposes real-time cable health and performance data for monitoring and alerting. This endpoint enables integration with modern monitoring stacks like Prometheus, Grafana, and other observability tools.
🎯 Key Capabilities
Real-time Metrics: Live cable validation data from network switches and hosts
Multi-format Support: Prometheus, JSON, and CSV output formats
Rich Labeling: Complete network topology context with peer relationships
High Performance: Multi-level caching optimized for frequent scraping
Production Ready: Handles 100K+ ports with memory-adaptive optimizations
📊 Metrics Categories
Power Metrics: RX/TX optical power per lane (up to 8 lanes per port)
BER Metrics: Effective and Raw Bit Error Rates
Temperature Metrics: Module temperature and thresholds
Counter Metrics: Transceiver reinsert/swap events
Validation Metrics: Port validation status with issue descriptions
Threshold Metrics: Power and temperature alarm thresholds
Timestamp Metrics: Data collection and report timestamps
Base URL
https://<cvt-server>/cablevalidation/metrics
Available Endpoints
Endpoint | Format | Content-Type | Description |
| Prometheus |
| Standard Prometheus exposition format |
| JSON |
| Structured JSON for programmatic access |
| CSV |
| Comma-separated values for spreadsheet import |
Authentication
No authentication required for metrics endpoints
HTTPS enforced for security
Bypasses session handling for automated scraping
Prometheus Format
# HELP effective_ber Effective Bit Error Rate# TYPE effective_ber gauge
# HELP validation_status Port validation status with issue descriptions in labels (value = issue count)
# TYPE validation_status gauge
# HELP port_info Port information with status and validation details in labels
# TYPE port_info gauge
# Healthy port with performance metrics
effective_ber{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 1.5e-254 1759345924622
module_temperature{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 65.2 1759345924622
validation_status{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 0 1759345924622
# Unplugged port with validation issue (power/temp metrics excluded due to NA values)
validation_status{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1",MediaUnplugged="Insert; Reseat or Replace Cable/Transceiver"} 1 1759345923752
effective_ber{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1"} 0.0 1759345923752
time_since_last_clear{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1"} 320035.6 1759345923752
# Port info with detailed status
port_info{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1",phy_manager_state="Disable",module_oper_status="unplugged",cable_sn="N/A",cable_pn="N/A",protocol="Ethernet",module_fw_version="N/A"} 1 1759345923752
JSON Format
{ "ufm-host38:enp3s0f0np0": {
"timestamp": 1757524769.645,
"port_info": {
"node_name": "ufm-host38",
"port_name": "enp3s0f0np0",
"peer_node_name": "r-ufm-sw-eth01",
"peer_port_name": "swp2",
"node_type": "Host",
"su_number": "SU1",
"data_hall": "DH1"
},
"port_labels": {
"cable_sn": "ABC123",
"cable_pn": "DEF456",
"protocol": "400G"
},
"port_stats": {
"effective_ber": 1.5e-254,
"module_temperature": 65.2,
"rx_power_lane_0": -2.5
},
"validation_data": {
"issues_count": 1,
"last_report_time": 1757524769.645,
"issues": {
"WrongNeighbor": "Check cable connection to switch2"
}
}
}
}
Power Metrics
Metric | Type | Description | Unit |
| gauge | RX optical power for lane N (0-7, not all lanes may be present) | dBm |
| gauge | TX optical power for lane N (0-7, not all lanes may be present) | dBm |
| gauge | RX power high threshold | dBm |
| gauge | RX power low threshold | dBm |
BER Metrics
Metric | Type | Description |
| gauge | Effective Bit Error Rate |
| gauge | Raw Bit Error Rate |
Temperature Metrics
Metric | Type | Description | Unit |
| gauge | Current module temperature | Celsius |
| gauge | Temperature high threshold | Celsius |
| gauge | Temperature low threshold | Celsius |
Status Metrics
Metric | Type | Description | Values |
| gauge | Port plugged status | 1=Up, 0=Down |
| gauge | Port operational status | 1=Up, 0=Down |
Counter Metrics
Metric | Type | Description |
| counter | Number of transceiver reinsert events |
| counter | Number of transceiver swap events |
| gauge | Time since last counter clear (seconds) |
Validation Metrics
Metric | Type | Description | Special Features |
| gauge | Port validation status with issue descriptions | Value = issue count, descriptions in labels |
| gauge | Timestamp of last validation report | Unix timestamp |
Validation Status Labels
The validation_status metric includes dynamic labels for each type of validation issue:
WrongNeighbor: "Check cable connection to correct switch"MediaUnplugged: "Insert; Reseat or Replace Cable/Transceiver"AnomalousPort: "Temperature exceeds threshold"FlappingLink: "Reseat transceiver; Check Fiber"UnknownNeighbor: "Verify neighbor device connectivity"WrongPort: "Check port mapping in topology"ExtraCable: "Remove unexpected cable connection"UnreachableDevice: "Check device connectivity and power"LinkDown_NoSignal: "Check physical connection"ErrDisable_Flap: "Port disabled due to flapping"AdminDown: "Port administratively disabled"ErrDisable_Rx: "RX error disable condition"NegotiationFail: "Check autonegotiation settings"NicNameMismatch: "Verify NIC provisioning"ModulePnMismatch: "Replace with compatible module"
Note: Commas in descriptions are automatically converted to semicolons to maintain Prometheus label format compatibility.
Examples:
# Port with validation issues (unplugged cable)validation_status{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1",MediaUnplugged="Insert; Reseat or Replace Cable/Transceiver"} 1
# Port without issues (healthy connection)
validation_status{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 0
# Port with multiple validation issues
validation_status{node="switch1",port="1/1",WrongNeighbor="Check cable connection to switch2",AnomalousPort="Temperature exceeds threshold"} 2
Topology Labels (All Metrics)
node: Switch or host nameport: Port identifierpeer_node: Connected peer node namepeer_port: Connected peer port identifiernode_type: Node type (Switch, Host, etc.)su_number: Scalable Unit identifierdata_hall: Data hall location
Cable Labels (Status Metrics Only)
cable_sn: Cable serial numbercable_pn: Cable part numberprotocol: Cable protocol (400G, InfiniBand, etc.)port_status: Port status (Up, Down, etc.)plugged: Module plugged status
Update Frequency
Agent Data: Updated every 10 minutes (configurable)
Metrics Cache: Invalidated on data changes
Prometheus Scraping: Recommended 15-30 second intervals
Performance Metrics
Deployment Size | Response Time | Memory Usage | Caching Strategy |
< 10K ports | < 50ms | ~20MB | Full caching enabled |
10K-50K ports | < 200ms | ~100MB | Collection cache only |
50K+ ports | < 500ms | ~200MB | No caching, real-time generation |
Optimization Features
Multi-level Caching: Port, collection, and label caching
Memory Adaptive: Automatically adjusts for large deployments
Smart Change Detection: Only updates when cable/module data changes
Zero Value Handling: Includes all values for complete visibility
Common Issues
1. No Metrics Data
Symptoms: Empty response or no metrics Causes:
CVT service not running
No topology loaded
No advanced stats collection
Solutions:
# Check service status# Check topology loading
# Check agent connectivity
2. Missing Port Data
Symptoms: Some ports not appearing in metrics Causes:
Port not in loaded topology
Agent not deployed on switch
Advanced stats not collected
Solutions:
Verify topology includes all expected ports
Deploy agents on missing switches
Check agent connectivity and data collection
3. Stale Timestamps
Symptoms: Old timestamps in metrics Causes:
Agent not sending updates
Network connectivity issues
Solutions:
Check agent logs for errors
Verify network connectivity to switches
Restart agents if necessary
4. Missing Validation Data
Symptoms: validation_status metrics missing or always 0 Causes:
Validation reports not being generated
Agent data filtering (switch not in topology)
Report processing errors
Solutions:
Verify validation is started on agents
Check switch IP exists in topology
Review agent and collector logs for errors
5. Inconsistent Issue Counts
Symptoms: validation_status count doesn't match expected issues Causes:
Issues filtered by port
Report data synchronization issues
Processing errors
Solutions:
Check that report data includes port-specific issues
Verify advanced stats and reports arrive together
Review validation report structure
Performance Tuning
Environment Variables
TBD: not supported yet.
# Adjust caching thresholdsPROMETHEUS_MAX_CACHED_PORTS=10000
# Disable detailed metrics for very large deployments
PROMETHEUS_ENABLE_DETAILED_METRICS=false
# Adjust cache TTL
PROMETHEUS_CACHE_TTL=60
Memory Optimization
For deployments > 50K ports:
Collection-level caching automatically disabled
Port-level caching automatically disabled
Real-time generation used (acceptable 200-500ms response time)
Access Control
HTTPS Required: All access must be over HTTPS
No Authentication: Designed for automated monitoring tools
Network Restrictions: Consider IP-based access control
Sensitive Data
Network Topology: Metrics expose network structure
Cable Information: Serial numbers and part numbers included
Performance Data: Could reveal network capacity information
Recommended Security
<Location /cablevalidation/metrics>Use BringupProxy
SSLRequireSSL
# Restrict to monitoring networks
<RequireAll>
Require ip 10.0.0.0/8 # Internal networks
Require ip 172.16.0.0/12 # Container networks
Require ip 192.168.0.0/16 # Private networks
</RequireAll>
</Location>
Python Client Example
import requestsimport json
# Get metrics in different formats
def get_cvt_metrics(server: str, port: int, response_format='prometheus'):
endpoints = {
'prometheus': '/cablevalidation/metrics',
'json': '/cablevalidation/metrics/json',
'csv': '/cablevalidation/metrics/csv'
}
url = f"https://{server}:{port}{endpoints[format]}"
response = requests.get(url, verify=False)
if format == 'json':
return response.json()
return response.text
# Usage
metrics = get_cvt_metrics('cvt-server.example.com', 'json')
for port_key, port_data in metrics.items():
if port_data['port_stats']['effective_ber'] > 1e-12:
print(f"High BER on {port_key}: {port_data['port_stats']['effective_ber']}")
Validation Monitoring Examples
# Find all ports with validation issuesvalidation_status > 0
# Count issues by syndrome type
sum by (node) (validation_status{WrongNeighbor!=""})
sum by (node) (validation_status{MediaUnplugged!=""})
sum by (node) (validation_status{LinkDown_NoSignal!=""})
# Find specific issue types
validation_status{MediaUnplugged!=""} > 0 # Unplugged cables
validation_status{AdminDown!=""} > 0 # Administratively disabled ports
validation_status{ModulePnMismatch!=""} > 0 # Hardware compatibility issues
# Ports with multiple issue types
validation_status{WrongNeighbor!="",AnomalousPort!=""}
# Correlation with performance metrics
(validation_status > 0) and (effective_ber > 1e-12)
# Port status correlation
validation_status{MediaUnplugged!=""} and on() port_info{module_oper_status="unplugged"}
Data Flow
Network Agents → CVT Collector → Advanced Stats + Report Data → Prometheus Collector → Metrics Endpoint<p></p> ↓ ↓ ↓ ↓ ↓<p></p> (10 min) (Real-time) (Synchronized) (Multi-level Cache) (GET request)
Enhanced Data Processing
Agent Data Validation: Switch IP validated against topology before processing
Synchronized Processing: Advanced stats and validation reports processed together
Optimized Issue Processing: Report data pre-processed to group issues by port (O(n+m) complexity)
Independent Validation Cache: PortValidationStatus class with hash-based change detection
Robust Syndrome Handling: Automatic fallback for unknown syndromes with developer warnings
Smart Data Quality: NA values properly excluded, counters preserve semantics
Caching Strategy
Port-level Cache: Individual port metrics cached until data changes
Collection-level Cache: Aggregated output cached for fast retrieval
Label Cache: Stable topology/cable labels cached separately
Validation Cache: Independent cache for validation status with hash-based change detection
Metadata Cache: Static TYPE/HELP comments cached permanently
Performance Optimizations
Push-based Updates: Metrics updated when advanced stats arrive
Smart Change Detection: Only cable/module changes invalidate caches
Memory Adaptive: Caching disabled automatically for large deployments
String Manipulation: Efficient JSON aggregation using string operations
Validation Processing: O(n+m) complexity with report preprocessing
Hash-based Cache: Validation cache only invalidated when issue content changes
Agent Data Filtering: Invalid switch IPs filtered early to prevent unnecessary processing
Prometheus Configuration
Scrape Interval: 15-30 seconds (matches CVT data update frequency)
Timeout: 10 seconds (allows for cache generation)
Retention: Configure based on historical analysis needs
Alerting Guidelines
BER Thresholds: Alert when effective_ber > 1e-12
Temperature Limits: Alert when module_temperature approaches temperature_high_th
Validation Issues: Alert when validation_status > 0
Critical Issues: Alert on specific syndromes (MediaUnplugged, UnreachableDevice, ModulePnMismatch)
Infrastructure Issues: Alert on LinkDown_NoSignal, ErrDisable conditions
Counter Anomalies: Alert on rapid increases in transceiver_reinsert_cnt
Sample Alerting Rules
# Validation issues alert- alert: PortValidationIssues
expr: validation_status > 0
labels:
severity: warning
annotations:
summary: "Port {{ $labels.node }}:{{ $labels.port }} has validation issues"
description: "{{ $value }} validation issues detected"
# Critical validation issues
- alert: CriticalPortIssues
expr: validation_status{MediaUnplugged!=""} > 0 or validation_status{UnreachableDevice!=""} > 0
labels:
severity: critical
annotations:
summary: "Critical issues on {{ $labels.node }}:{{ $labels.port }}"
description: "{{ if $labels.MediaUnplugged }}Cable unplugged: {{ $labels.MediaUnplugged }}{{ end }}{{ if $labels.UnreachableDevice }}Device unreachable: {{ $labels.UnreachableDevice }}{{ end }}"
# Infrastructure issues
- alert: InfrastructureIssues
expr: validation_status{LinkDown_NoSignal!=""} > 0 or validation_status{ModulePnMismatch!=""} > 0
labels:
severity: warning
annotations:
summary: "Infrastructure issue on {{ $labels.node }}:{{ $labels.port }}"
# Administrative issues
- alert: AdminIssues
expr: validation_status{AdminDown!=""} > 0 or validation_status{NicNameMismatch!=""} > 0
labels:
severity: info
annotations:
summary: "Administrative issue on {{ $labels.node }}:{{ $labels.port }}"
Release Notes
Version: 1.1.0
Release Date: October 2025
Compatibility: CVT 1.7.0 and later
Dependencies: Requires advanced stats collection enabled
New in Version 1.1.0
✅ Validation Metrics Integration: Port validation status with actionable issue descriptions
✅ Synchronized Data Processing: Advanced stats and validation reports processed together
✅ Performance Optimizations: O(n+m) validation processing, hash-based change detection
✅ Enhanced Security: Agent data validation prevents processing from unknown switches
✅ Improved Data Quality: None-based initialization for gauges, proper counter semantics
✅ Better Caching: Independent validation cache with content-based invalidation
✅ Comprehensive Syndrome Coverage: 15+ validation issue types with fallback handling
✅ Real-world Validation: Successfully tested with production data and unplugged ports
API Stability
Metric Names: Stable (no breaking changes planned)
Label Names: Stable (additions possible, no removals)
Output Format: Prometheus standard compliance maintained
Endpoint URLs: Stable API contract
Enhanced Counter Semantics
Gauges default to None: Missing sensor data excluded instead of showing false zeros
Counters preserve values: No unexpected resets when data temporarily unavailable
Proper NA handling: Invalid data marked as NA and excluded from metrics
Temperature accuracy: Fixed zero temperature issue by using actual amber timestamps
Validation Integration Benefits
Synchronized processing: Performance metrics and validation issues always in sync
Rich context: Issue descriptions provide actionable corrective actions
Efficient processing: O(n+m) complexity prevents performance degradation
Smart caching: Validation cache independent of performance metrics cache
Troubleshooting
Check CVT Service: Ensure Cable Validation service is running
Verify Topology: Confirm network topology is loaded
Agent Status: Check that agents are deployed and collecting data
Network Connectivity: Verify switch/host accessibility
Performance Monitoring
# Check metrics endpoint response timetime curl -k https://cvt-server/cablevalidation/metrics > /dev/null
Contact Information
Development Team: Cable Validation Engineering
Documentation: [Internal Wiki Link]
Support: [Support Channel/Email]
This endpoint provides comprehensive cable validation metrics for modern monitoring and observability workflows, enabling proactive network health management and automated alerting.