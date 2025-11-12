On This Page
- Overview
- Features
- API Endpoints
- Sample Output
- Metrics Reference
- Labels
- Performance Characteristics
- Troubleshooting
- Security Considerations
- Integration Examples
- Architecture
- Monitoring Best Practices
- Version Information
- Data Quality Improvements
- Support
Cable Validation Tool - Prometheus Metrics Endpoint
The Cable Validation Tool (CVT) now provides a Prometheus-compatible metrics endpoint that exposes real-time cable health and performance data for monitoring and alerting. This endpoint enables integration with modern monitoring stacks like Prometheus, Grafana, and other observability tools.
🎯 Key Capabilities
Real-time Metrics: Live cable validation data from network switches and hosts
Multi-format Support: Prometheus, JSON, and CSV output formats
Rich Labeling: Complete network topology context with peer relationships
High Performance: Multi-level caching optimized for frequent scraping
Production Ready: Handles 100K+ ports with memory-adaptive optimizations
📊 Metrics Categories
Power Metrics: RX/TX optical power per lane (up to 8 lanes per port)
BER Metrics: Effective and Raw Bit Error Rates
Temperature Metrics: Module temperature and thresholds
Counter Metrics: Transceiver reinsert/swap events
Validation Metrics: Port validation status with issue descriptions
Threshold Metrics: Power and temperature alarm thresholds
Timestamp Metrics: Data collection and report timestamps
Base URL
https://<cvt-server>/cablevalidation/metrics
Available Endpoints
Endpoint
Format
Content-Type
Description
Prometheus
Standard Prometheus exposition format
JSON
Structured JSON for programmatic access
CSV
Comma-separated values for spreadsheet import
Authentication
No authentication required for metrics endpoints
HTTPS enforced for security
Bypasses session handling for automated scraping
Prometheus Format
# HELP effective_ber Effective Bit Error Rate
# TYPE effective_ber gauge
# HELP validation_status Port validation status with issue descriptions in labels (value = issue count)
# TYPE validation_status gauge
# HELP port_info Port information with status and validation details in labels
# TYPE port_info gauge
# Healthy port with performance metrics
effective_ber{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 1.5e-254 1759345924622
module_temperature{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 65.2 1759345924622
validation_status{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 0 1759345924622
# Unplugged port with validation issue (power/temp metrics excluded due to NA values)
validation_status{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1",MediaUnplugged="Insert; Reseat or Replace Cable/Transceiver"} 1 1759345923752
effective_ber{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1"} 0.0 1759345923752
time_since_last_clear{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1"} 320035.6 1759345923752
# Port info with detailed status
port_info{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1",phy_manager_state="Disable",module_oper_status="unplugged",cable_sn="N/A",cable_pn="N/A",protocol="Ethernet",module_fw_version="N/A"} 1 1759345923752
JSON Format
{
"ufm-host38:enp3s0f0np0": {
"timestamp": 1757524769.645,
"port_info": {
"node_name": "ufm-host38",
"port_name": "enp3s0f0np0",
"peer_node_name": "r-ufm-sw-eth01",
"peer_port_name": "swp2",
"node_type": "Host",
"su_number": "SU1",
"data_hall": "DH1"
},
"port_labels": {
"cable_sn": "ABC123",
"cable_pn": "DEF456",
"protocol": "400G"
},
"port_stats": {
"effective_ber": 1.5e-254,
"module_temperature": 65.2,
"rx_power_lane_0": -2.5
},
"validation_data": {
"issues_count": 1,
"last_report_time": 1757524769.645,
"issues": {
"WrongNeighbor": "Check cable connection to switch2"
}
}
}
}
Power Metrics
Metric
Type
Description
Unit
gauge
RX optical power for lane N (0-7, not all lanes may be present)
dBm
gauge
TX optical power for lane N (0-7, not all lanes may be present)
dBm
gauge
RX power high threshold
dBm
gauge
RX power low threshold
dBm
BER Metrics
Metric
Type
Description
gauge
Effective Bit Error Rate
gauge
Raw Bit Error Rate
Temperature Metrics
Metric
Type
Description
Unit
gauge
Current module temperature
Celsius
gauge
Temperature high threshold
Celsius
gauge
Temperature low threshold
Celsius
Status Metrics
Metric
Type
Description
Values
gauge
Port plugged status
1=Up, 0=Down
gauge
Port operational status
1=Up, 0=Down
Counter Metrics
Metric
Type
Description
counter
Number of transceiver reinsert events
counter
Number of transceiver swap events
gauge
Time since last counter clear (seconds)
Validation Metrics
Metric
Type
Description
Special Features
gauge
Port validation status with issue descriptions
Value = issue count, descriptions in labels
gauge
Timestamp of last validation report
Unix timestamp
Validation Status Labels
The
validation_status metric includes dynamic labels for each type of validation issue:
WrongNeighbor: "Check cable connection to correct switch"
MediaUnplugged: "Insert; Reseat or Replace Cable/Transceiver"
AnomalousPort: "Temperature exceeds threshold"
FlappingLink: "Reseat transceiver; Check Fiber"
UnknownNeighbor: "Verify neighbor device connectivity"
WrongPort: "Check port mapping in topology"
ExtraCable: "Remove unexpected cable connection"
UnreachableDevice: "Check device connectivity and power"
LinkDown_NoSignal: "Check physical connection"
ErrDisable_Flap: "Port disabled due to flapping"
AdminDown: "Port administratively disabled"
ErrDisable_Rx: "RX error disable condition"
NegotiationFail: "Check autonegotiation settings"
NicNameMismatch: "Verify NIC provisioning"
ModulePnMismatch: "Replace with compatible module"
Note: Commas in descriptions are automatically converted to semicolons to maintain Prometheus label format compatibility.
Examples:
# Port with validation issues (unplugged cable)
validation_status{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1",MediaUnplugged="Insert; Reseat or Replace Cable/Transceiver"} 1
# Port without issues (healthy connection)
validation_status{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 0
# Port with multiple validation issues
validation_status{node="switch1",port="1/1",WrongNeighbor="Check cable connection to switch2",AnomalousPort="Temperature exceeds threshold"} 2
Topology Labels (All Metrics)
node: Switch or host name
port: Port identifier
peer_node: Connected peer node name
peer_port: Connected peer port identifier
node_type: Node type (Switch, Host, etc.)
su_number: Scalable Unit identifier
data_hall: Data hall location
Cable Labels (Status Metrics Only)
cable_sn: Cable serial number
cable_pn: Cable part number
protocol: Cable protocol (400G, InfiniBand, etc.)
port_status: Port status (Up, Down, etc.)
plugged: Module plugged status
Update Frequency
Agent Data: Updated every 10 minutes (configurable)
Metrics Cache: Invalidated on data changes
Prometheus Scraping: Recommended 15-30 second intervals
Performance Metrics
Deployment Size
Response Time
Memory Usage
Caching Strategy
< 10K ports
< 50ms
~20MB
Full caching enabled
10K-50K ports
< 200ms
~100MB
Collection cache only
50K+ ports
< 500ms
~200MB
No caching, real-time generation
Optimization Features
Multi-level Caching: Port, collection, and label caching
Memory Adaptive: Automatically adjusts for large deployments
Smart Change Detection: Only updates when cable/module data changes
Zero Value Handling: Includes all values for complete visibility
Common Issues
1. No Metrics Data
Symptoms: Empty response or no metrics Causes:
CVT service not running
No topology loaded
No advanced stats collection
Solutions:
# Check service status
# Check topology loading
# Check agent connectivity
2. Missing Port Data
Symptoms: Some ports not appearing in metrics Causes:
Port not in loaded topology
Agent not deployed on switch
Advanced stats not collected
Solutions:
Verify topology includes all expected ports
Deploy agents on missing switches
Check agent connectivity and data collection
3. Stale Timestamps
Symptoms: Old timestamps in metrics Causes:
Agent not sending updates
Network connectivity issues
Solutions:
Check agent logs for errors
Verify network connectivity to switches
Restart agents if necessary
4. Missing Validation Data
Symptoms: validation_status metrics missing or always 0 Causes:
Validation reports not being generated
Agent data filtering (switch not in topology)
Report processing errors
Solutions:
Verify validation is started on agents
Check switch IP exists in topology
Review agent and collector logs for errors
5. Inconsistent Issue Counts
Symptoms: validation_status count doesn't match expected issues Causes:
Issues filtered by port
Report data synchronization issues
Processing errors
Solutions:
Check that report data includes port-specific issues
Verify advanced stats and reports arrive together
Review validation report structure
Performance Tuning
Environment Variables
TBD: not supported yet.
# Adjust caching thresholds
PROMETHEUS_MAX_CACHED_PORTS=10000
# Disable detailed metrics for very large deployments
PROMETHEUS_ENABLE_DETAILED_METRICS=false
# Adjust cache TTL
PROMETHEUS_CACHE_TTL=60
Memory Optimization
For deployments > 50K ports:
Collection-level caching automatically disabled
Port-level caching automatically disabled
Real-time generation used (acceptable 200-500ms response time)
Access Control
HTTPS Required: All access must be over HTTPS
No Authentication: Designed for automated monitoring tools
Network Restrictions: Consider IP-based access control
Sensitive Data
Network Topology: Metrics expose network structure
Cable Information: Serial numbers and part numbers included
Performance Data: Could reveal network capacity information
Recommended Security
<Location /cablevalidation/metrics>
Use BringupProxy
SSLRequireSSL
# Restrict to monitoring networks
<RequireAll>
Require ip 10.0.0.0/8 # Internal networks
Require ip 172.16.0.0/12 # Container networks
Require ip 192.168.0.0/16 # Private networks
</RequireAll>
</Location>
Python Client Example
import requests
import json
# Get metrics in different formats
def get_cvt_metrics(server: str, port: int, response_format='prometheus'):
endpoints = {
'prometheus': '/cablevalidation/metrics',
'json': '/cablevalidation/metrics/json',
'csv': '/cablevalidation/metrics/csv'
}
url = f"https://{server}:{port}{endpoints[format]}"
response = requests.get(url, verify=False)
if format == 'json':
return response.json()
return response.text
# Usage
metrics = get_cvt_metrics('cvt-server.example.com', 'json')
for port_key, port_data in metrics.items():
if port_data['port_stats']['effective_ber'] > 1e-12:
print(f"High BER on {port_key}: {port_data['port_stats']['effective_ber']}")
Validation Monitoring Examples
# Find all ports with validation issues
validation_status > 0
# Count issues by syndrome type
sum by (node) (validation_status{WrongNeighbor!=""})
sum by (node) (validation_status{MediaUnplugged!=""})
sum by (node) (validation_status{LinkDown_NoSignal!=""})
# Find specific issue types
validation_status{MediaUnplugged!=""} > 0 # Unplugged cables
validation_status{AdminDown!=""} > 0 # Administratively disabled ports
validation_status{ModulePnMismatch!=""} > 0 # Hardware compatibility issues
# Ports with multiple issue types
validation_status{WrongNeighbor!="",AnomalousPort!=""}
# Correlation with performance metrics
(validation_status > 0) and (effective_ber > 1e-12)
# Port status correlation
validation_status{MediaUnplugged!=""} and on() port_info{module_oper_status="unplugged"}
Data Flow
Network Agents → CVT Collector → Advanced Stats + Report Data → Prometheus Collector → Metrics Endpoint<p></p> ↓ ↓ ↓ ↓ ↓<p></p> (10 min) (Real-time) (Synchronized) (Multi-level Cache) (GET request)
Enhanced Data Processing
Agent Data Validation: Switch IP validated against topology before processing
Synchronized Processing: Advanced stats and validation reports processed together
Optimized Issue Processing: Report data pre-processed to group issues by port (O(n+m) complexity)
Independent Validation Cache: PortValidationStatus class with hash-based change detection
Robust Syndrome Handling: Automatic fallback for unknown syndromes with developer warnings
Smart Data Quality: NA values properly excluded, counters preserve semantics
Caching Strategy
Port-level Cache: Individual port metrics cached until data changes
Collection-level Cache: Aggregated output cached for fast retrieval
Label Cache: Stable topology/cable labels cached separately
Validation Cache: Independent cache for validation status with hash-based change detection
Metadata Cache: Static TYPE/HELP comments cached permanently
Performance Optimizations
Push-based Updates: Metrics updated when advanced stats arrive
Smart Change Detection: Only cable/module changes invalidate caches
Memory Adaptive: Caching disabled automatically for large deployments
String Manipulation: Efficient JSON aggregation using string operations
Validation Processing: O(n+m) complexity with report preprocessing
Hash-based Cache: Validation cache only invalidated when issue content changes
Agent Data Filtering: Invalid switch IPs filtered early to prevent unnecessary processing
Prometheus Configuration
Scrape Interval: 15-30 seconds (matches CVT data update frequency)
Timeout: 10 seconds (allows for cache generation)
Retention: Configure based on historical analysis needs
Alerting Guidelines
BER Thresholds: Alert when effective_ber > 1e-12
Temperature Limits: Alert when module_temperature approaches temperature_high_th
Validation Issues: Alert when validation_status > 0
Critical Issues: Alert on specific syndromes (MediaUnplugged, UnreachableDevice, ModulePnMismatch)
Infrastructure Issues: Alert on LinkDown_NoSignal, ErrDisable conditions
Counter Anomalies: Alert on rapid increases in transceiver_reinsert_cnt
Sample Alerting Rules
# Validation issues alert
- alert: PortValidationIssues
expr: validation_status > 0
labels:
severity: warning
annotations:
summary: "Port {{ $labels.node }}:{{ $labels.port }} has validation issues"
description: "{{ $value }} validation issues detected"
# Critical validation issues
- alert: CriticalPortIssues
expr: validation_status{MediaUnplugged!=""} > 0 or validation_status{UnreachableDevice!=""} > 0
labels:
severity: critical
annotations:
summary: "Critical issues on {{ $labels.node }}:{{ $labels.port }}"
description: "{{ if $labels.MediaUnplugged }}Cable unplugged: {{ $labels.MediaUnplugged }}{{ end }}{{ if $labels.UnreachableDevice }}Device unreachable: {{ $labels.UnreachableDevice }}{{ end }}"
# Infrastructure issues
- alert: InfrastructureIssues
expr: validation_status{LinkDown_NoSignal!=""} > 0 or validation_status{ModulePnMismatch!=""} > 0
labels:
severity: warning
annotations:
summary: "Infrastructure issue on {{ $labels.node }}:{{ $labels.port }}"
# Administrative issues
- alert: AdminIssues
expr: validation_status{AdminDown!=""} > 0 or validation_status{NicNameMismatch!=""} > 0
labels:
severity: info
annotations:
summary: "Administrative issue on {{ $labels.node }}:{{ $labels.port }}"
Release Notes
Version: 1.1.0
Release Date: October 2025
Compatibility: CVT 1.7.0 and later
Dependencies: Requires advanced stats collection enabled
New in Version 1.1.0
✅ Validation Metrics Integration: Port validation status with actionable issue descriptions
✅ Synchronized Data Processing: Advanced stats and validation reports processed together
✅ Performance Optimizations: O(n+m) validation processing, hash-based change detection
✅ Enhanced Security: Agent data validation prevents processing from unknown switches
✅ Improved Data Quality: None-based initialization for gauges, proper counter semantics
✅ Better Caching: Independent validation cache with content-based invalidation
✅ Comprehensive Syndrome Coverage: 15+ validation issue types with fallback handling
✅ Real-world Validation: Successfully tested with production data and unplugged ports
API Stability
Metric Names: Stable (no breaking changes planned)
Label Names: Stable (additions possible, no removals)
Output Format: Prometheus standard compliance maintained
Endpoint URLs: Stable API contract
Enhanced Counter Semantics
Gauges default to None: Missing sensor data excluded instead of showing false zeros
Counters preserve values: No unexpected resets when data temporarily unavailable
Proper NA handling: Invalid data marked as NA and excluded from metrics
Temperature accuracy: Fixed zero temperature issue by using actual amber timestamps
Validation Integration Benefits
Synchronized processing: Performance metrics and validation issues always in sync
Rich context: Issue descriptions provide actionable corrective actions
Efficient processing: O(n+m) complexity prevents performance degradation
Smart caching: Validation cache independent of performance metrics cache
Troubleshooting
Check CVT Service: Ensure Cable Validation service is running
Verify Topology: Confirm network topology is loaded
Agent Status: Check that agents are deployed and collecting data
Network Connectivity: Verify switch/host accessibility
Performance Monitoring
# Check metrics endpoint response time
time curl -k https://cvt-server/cablevalidation/metrics > /dev/null
Contact Information
Development Team: Cable Validation Engineering
Documentation: [Internal Wiki Link]
Support: [Support Channel/Email]
This endpoint provides comprehensive cable validation metrics for modern monitoring and observability workflows, enabling proactive network health management and automated alerting.