Cable Validation Tool - Prometheus Metrics Endpoint

Overview

The Cable Validation Tool (CVT) now provides a Prometheus-compatible metrics endpoint that exposes real-time cable health and performance data for monitoring and alerting. This endpoint enables integration with modern monitoring stacks like Prometheus, Grafana, and other observability tools.

Features

🎯 Key Capabilities

  • Real-time Metrics: Live cable validation data from network switches and hosts

  • Multi-format Support: Prometheus, JSON, and CSV output formats

  • Rich Labeling: Complete network topology context with peer relationships

  • High Performance: Multi-level caching optimized for frequent scraping

  • Production Ready: Handles 100K+ ports with memory-adaptive optimizations

📊 Metrics Categories

  • Power Metrics: RX/TX optical power per lane (up to 8 lanes per port)

  • BER Metrics: Effective and Raw Bit Error Rates

  • Temperature Metrics: Module temperature and thresholds

  • Counter Metrics: Transceiver reinsert/swap events

  • Validation Metrics: Port validation status with issue descriptions

  • Threshold Metrics: Power and temperature alarm thresholds

  • Timestamp Metrics: Data collection and report timestamps

API Endpoints

Base URL 

https://<cvt-server>/cablevalidation/metrics

Available Endpoints

Endpoint

Format

Content-Type

Description

/cablevalidation/metrics

Prometheus

text/plain

Standard Prometheus exposition format

/cablevalidation/metrics/json

JSON

application/json

Structured JSON for programmatic access

/cablevalidation/metrics/csv

CSV

text/plain

Comma-separated values for spreadsheet import


Authentication

  • No authentication required for metrics endpoints

  • HTTPS enforced for security

  • Bypasses session handling for automated scraping

Sample Output

Prometheus Format 

# HELP effective_ber Effective Bit Error Rate
# TYPE effective_ber gauge
# HELP validation_status Port validation status with issue descriptions in labels (value = issue count)
# TYPE validation_status gauge
# HELP port_info Port information with status and validation details in labels
# TYPE port_info gauge
# Healthy port with performance metrics
effective_ber{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 1.5e-254 1759345924622
module_temperature{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 65.2 1759345924622
validation_status{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 0 1759345924622
# Unplugged port with validation issue (power/temp metrics excluded due to NA values)
validation_status{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1",MediaUnplugged="Insert; Reseat or Replace Cable/Transceiver"} 1 1759345923752
effective_ber{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1"} 0.0 1759345923752
time_since_last_clear{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1"} 320035.6 1759345923752
# Port info with detailed status
port_info{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1",phy_manager_state="Disable",module_oper_status="unplugged",cable_sn="N/A",cable_pn="N/A",protocol="Ethernet",module_fw_version="N/A"} 1 1759345923752

JSON Format 

{
  "ufm-host38:enp3s0f0np0": {
    "timestamp": 1757524769.645,
    "port_info": {
      "node_name": "ufm-host38",
      "port_name": "enp3s0f0np0",
      "peer_node_name": "r-ufm-sw-eth01",
      "peer_port_name": "swp2",
      "node_type": "Host",
      "su_number": "SU1",
      "data_hall": "DH1"
    },
    "port_labels": {
      "cable_sn": "ABC123",
      "cable_pn": "DEF456",
      "protocol": "400G"
    },
    "port_stats": {
      "effective_ber": 1.5e-254,
      "module_temperature": 65.2,
      "rx_power_lane_0": -2.5
    },
    "validation_data": {
      "issues_count": 1,
      "last_report_time": 1757524769.645,
      "issues": {
        "WrongNeighbor": "Check cable connection to switch2"
      }
    }
  }
}

Metrics Reference

Power Metrics

Metric

Type

Description

Unit

rx_power_lane_N

gauge

RX optical power for lane N (0-7, not all lanes may be present)

dBm

tx_power_lane_N

gauge

TX optical power for lane N (0-7, not all lanes may be present)

dBm

rx_power_high_th

gauge

RX power high threshold

dBm

rx_power_low_th

gauge

RX power low threshold

dBm


BER Metrics

Metric

Type

Description

effective_ber

gauge

Effective Bit Error Rate

raw_ber

gauge

Raw Bit Error Rate


Temperature Metrics

Metric

Type

Description

Unit

module_temperature

gauge

Current module temperature

Celsius

temperature_high_th

gauge

Temperature high threshold

Celsius

temperature_low_th

gauge

Temperature low threshold

Celsius


Status Metrics

Metric

Type

Description

Values

port_status

gauge

Port plugged status

1=Up, 0=Down

port_oper_status

gauge

Port operational status

1=Up, 0=Down


Counter Metrics

Metric

Type

Description

transceiver_reinsert_cnt

counter

Number of transceiver reinsert events

transceiver_swap_cnt

counter

Number of transceiver swap events

time_since_last_clear

gauge

Time since last counter clear (seconds)


Validation Metrics

Metric

Type

Description

Special Features

validation_status

gauge

Port validation status with issue descriptions

Value = issue count, descriptions in labels

last_report_time

gauge

Timestamp of last validation report

Unix timestamp

Validation Status Labels

The validation_status metric includes dynamic labels for each type of validation issue:

  • WrongNeighbor: "Check cable connection to correct switch"

  • MediaUnplugged: "Insert; Reseat or Replace Cable/Transceiver"

  • AnomalousPort: "Temperature exceeds threshold"

  • FlappingLink: "Reseat transceiver; Check Fiber"

  • UnknownNeighbor: "Verify neighbor device connectivity"

  • WrongPort: "Check port mapping in topology"

  • ExtraCable: "Remove unexpected cable connection"

  • UnreachableDevice: "Check device connectivity and power"

  • LinkDown_NoSignal: "Check physical connection"

  • ErrDisable_Flap: "Port disabled due to flapping"

  • AdminDown: "Port administratively disabled"

  • ErrDisable_Rx: "RX error disable condition"

  • NegotiationFail: "Check autonegotiation settings"

  • NicNameMismatch: "Verify NIC provisioning"

  • ModulePnMismatch: "Replace with compatible module"

Note: Commas in descriptions are automatically converted to semicolons to maintain Prometheus label format compatibility.

Examples: 

# Port with validation issues (unplugged cable)
validation_status{node="ufm-host38",port="enp3s0f1np1",peer_node="r-ufm-sw-eth01",peer_port="swp3",node_type="Host",su_number="SU1",data_hall="DH1",MediaUnplugged="Insert; Reseat or Replace Cable/Transceiver"} 1
# Port without issues (healthy connection)
validation_status{node="ufm-host38",port="enp3s0f0np0",peer_node="r-ufm-sw-eth01",peer_port="swp2",node_type="Host",su_number="SU1",data_hall="DH1"} 0
# Port with multiple validation issues
validation_status{node="switch1",port="1/1",WrongNeighbor="Check cable connection to switch2",AnomalousPort="Temperature exceeds threshold"} 2

Labels

Topology Labels (All Metrics)

  • node: Switch or host name

  • port: Port identifier

  • peer_node: Connected peer node name

  • peer_port: Connected peer port identifier

  • node_type: Node type (Switch, Host, etc.)

  • su_number: Scalable Unit identifier

  • data_hall: Data hall location

Cable Labels (Status Metrics Only)

  • cable_sn: Cable serial number

  • cable_pn: Cable part number

  • protocol: Cable protocol (400G, InfiniBand, etc.)

  • port_status: Port status (Up, Down, etc.)

  • plugged: Module plugged status

Performance Characteristics

Update Frequency

  • Agent Data: Updated every 10 minutes (configurable)

  • Metrics Cache: Invalidated on data changes

  • Prometheus Scraping: Recommended 15-30 second intervals

Performance Metrics

Deployment Size

Response Time

Memory Usage

Caching Strategy

< 10K ports

< 50ms

~20MB

Full caching enabled

10K-50K ports

< 200ms

~100MB

Collection cache only

50K+ ports

< 500ms

~200MB

No caching, real-time generation


Optimization Features

  • Multi-level Caching: Port, collection, and label caching

  • Memory Adaptive: Automatically adjusts for large deployments

  • Smart Change Detection: Only updates when cable/module data changes

  • Zero Value Handling: Includes all values for complete visibility

Troubleshooting

Common Issues

1. No Metrics Data

Symptoms: Empty response or no metrics Causes:

  • CVT service not running

  • No topology loaded

  • No advanced stats collection

Solutions:

# Check service status
# Check topology loading
# Check agent connectivity

2. Missing Port Data

Symptoms: Some ports not appearing in metrics Causes:

  • Port not in loaded topology

  • Agent not deployed on switch

  • Advanced stats not collected

Solutions:

  • Verify topology includes all expected ports

  • Deploy agents on missing switches

  • Check agent connectivity and data collection

3. Stale Timestamps

Symptoms: Old timestamps in metrics Causes:

  • Agent not sending updates

  • Network connectivity issues

Solutions:

  • Check agent logs for errors

  • Verify network connectivity to switches

  • Restart agents if necessary

4. Missing Validation Data

Symptoms: validation_status metrics missing or always 0 Causes:

  • Validation reports not being generated

  • Agent data filtering (switch not in topology)

  • Report processing errors

Solutions:

  • Verify validation is started on agents

  • Check switch IP exists in topology

  • Review agent and collector logs for errors

5. Inconsistent Issue Counts

Symptoms: validation_status count doesn't match expected issues Causes:

  • Issues filtered by port

  • Report data synchronization issues

  • Processing errors

Solutions:

  • Check that report data includes port-specific issues

  • Verify advanced stats and reports arrive together

  • Review validation report structure

Performance Tuning

Environment Variables

TBD: not supported yet.

# Adjust caching thresholds
PROMETHEUS_MAX_CACHED_PORTS=10000
# Disable detailed metrics for very large deployments
PROMETHEUS_ENABLE_DETAILED_METRICS=false
# Adjust cache TTL
PROMETHEUS_CACHE_TTL=60

Memory Optimization

For deployments > 50K ports:

  • Collection-level caching automatically disabled

  • Port-level caching automatically disabled

  • Real-time generation used (acceptable 200-500ms response time)

Security Considerations

Access Control

  • HTTPS Required: All access must be over HTTPS

  • No Authentication: Designed for automated monitoring tools

  • Network Restrictions: Consider IP-based access control

Sensitive Data

  • Network Topology: Metrics expose network structure

  • Cable Information: Serial numbers and part numbers included

  • Performance Data: Could reveal network capacity information

Recommended Security 

<Location /cablevalidation/metrics>
    Use BringupProxy
    SSLRequireSSL
    
    # Restrict to monitoring networks
    <RequireAll>
        Require ip 10.0.0.0/8      # Internal networks
        Require ip 172.16.0.0/12   # Container networks  
        Require ip 192.168.0.0/16  # Private networks
    </RequireAll>
</Location>

Integration Examples

Python Client Example 

import requests
import json
# Get metrics in different formats
def get_cvt_metrics(server: str, port: int, response_format='prometheus'):
    endpoints = {
        'prometheus': '/cablevalidation/metrics',
        'json': '/cablevalidation/metrics/json', 
        'csv': '/cablevalidation/metrics/csv'
    }
    url = f"https://{server}:{port}{endpoints[format]}"
    response = requests.get(url, verify=False)
    
    if format == 'json':
        return response.json()
    return response.text
# Usage
metrics = get_cvt_metrics('cvt-server.example.com', 'json')
for port_key, port_data in metrics.items():
    if port_data['port_stats']['effective_ber'] > 1e-12:
        print(f"High BER on {port_key}: {port_data['port_stats']['effective_ber']}")

Validation Monitoring Examples 

# Find all ports with validation issues
validation_status > 0
# Count issues by syndrome type
sum by (node) (validation_status{WrongNeighbor!=""})
sum by (node) (validation_status{MediaUnplugged!=""})
sum by (node) (validation_status{LinkDown_NoSignal!=""})
# Find specific issue types
validation_status{MediaUnplugged!=""} > 0  # Unplugged cables
validation_status{AdminDown!=""} > 0       # Administratively disabled ports
validation_status{ModulePnMismatch!=""} > 0 # Hardware compatibility issues
# Ports with multiple issue types
validation_status{WrongNeighbor!="",AnomalousPort!=""}
# Correlation with performance metrics
(validation_status > 0) and (effective_ber > 1e-12)
# Port status correlation
validation_status{MediaUnplugged!=""} and on() port_info{module_oper_status="unplugged"}

Architecture

Data Flow 

Network Agents → CVT Collector → Advanced Stats + Report Data → Prometheus Collector → Metrics Endpoint<p></p>     ↓              ↓                      ↓                           ↓                    ↓<p></p>  (10 min)      (Real-time)         (Synchronized)              (Multi-level Cache)  (GET request)

Enhanced Data Processing

  1. Agent Data Validation: Switch IP validated against topology before processing

  2. Synchronized Processing: Advanced stats and validation reports processed together

  3. Optimized Issue Processing: Report data pre-processed to group issues by port (O(n+m) complexity)

  4. Independent Validation Cache: PortValidationStatus class with hash-based change detection

  5. Robust Syndrome Handling: Automatic fallback for unknown syndromes with developer warnings

  6. Smart Data Quality: NA values properly excluded, counters preserve semantics

Caching Strategy

  1. Port-level Cache: Individual port metrics cached until data changes

  2. Collection-level Cache: Aggregated output cached for fast retrieval

  3. Label Cache: Stable topology/cable labels cached separately

  4. Validation Cache: Independent cache for validation status with hash-based change detection

  5. Metadata Cache: Static TYPE/HELP comments cached permanently

Performance Optimizations

  • Push-based Updates: Metrics updated when advanced stats arrive

  • Smart Change Detection: Only cable/module changes invalidate caches

  • Memory Adaptive: Caching disabled automatically for large deployments

  • String Manipulation: Efficient JSON aggregation using string operations

  • Validation Processing: O(n+m) complexity with report preprocessing

  • Hash-based Cache: Validation cache only invalidated when issue content changes

  • Agent Data Filtering: Invalid switch IPs filtered early to prevent unnecessary processing

Monitoring Best Practices

Prometheus Configuration

  • Scrape Interval: 15-30 seconds (matches CVT data update frequency)

  • Timeout: 10 seconds (allows for cache generation)

  • Retention: Configure based on historical analysis needs

Alerting Guidelines

  • BER Thresholds: Alert when effective_ber > 1e-12

  • Temperature Limits: Alert when module_temperature approaches temperature_high_th

  • Validation Issues: Alert when validation_status > 0

  • Critical Issues: Alert on specific syndromes (MediaUnplugged, UnreachableDevice, ModulePnMismatch)

  • Infrastructure Issues: Alert on LinkDown_NoSignal, ErrDisable conditions

  • Counter Anomalies: Alert on rapid increases in transceiver_reinsert_cnt

Sample Alerting Rules 

# Validation issues alert
- alert: PortValidationIssues
  expr: validation_status > 0
  labels:
    severity: warning
  annotations:
    summary: "Port {{ $labels.node }}:{{ $labels.port }} has validation issues"
    description: "{{ $value }} validation issues detected"
# Critical validation issues
- alert: CriticalPortIssues  
  expr: validation_status{MediaUnplugged!=""} > 0 or validation_status{UnreachableDevice!=""} > 0
  labels:
    severity: critical
  annotations:
    summary: "Critical issues on {{ $labels.node }}:{{ $labels.port }}"
    description: "{{ if $labels.MediaUnplugged }}Cable unplugged: {{ $labels.MediaUnplugged }}{{ end }}{{ if $labels.UnreachableDevice }}Device unreachable: {{ $labels.UnreachableDevice }}{{ end }}"
# Infrastructure issues
- alert: InfrastructureIssues
  expr: validation_status{LinkDown_NoSignal!=""} > 0 or validation_status{ModulePnMismatch!=""} > 0
  labels:
    severity: warning
  annotations:
    summary: "Infrastructure issue on {{ $labels.node }}:{{ $labels.port }}"
# Administrative issues  
- alert: AdminIssues
  expr: validation_status{AdminDown!=""} > 0 or validation_status{NicNameMismatch!=""} > 0
  labels:
    severity: info
  annotations:
    summary: "Administrative issue on {{ $labels.node }}:{{ $labels.port }}"

Version Information

Release Notes

  • Version: 1.1.0

  • Release Date: October 2025

  • Compatibility: CVT 1.7.0 and later

  • Dependencies: Requires advanced stats collection enabled

New in Version 1.1.0

  • ✅ Validation Metrics Integration: Port validation status with actionable issue descriptions

  • ✅ Synchronized Data Processing: Advanced stats and validation reports processed together

  • ✅ Performance Optimizations: O(n+m) validation processing, hash-based change detection

  • ✅ Enhanced Security: Agent data validation prevents processing from unknown switches

  • ✅ Improved Data Quality: None-based initialization for gauges, proper counter semantics

  • ✅ Better Caching: Independent validation cache with content-based invalidation

  • ✅ Comprehensive Syndrome Coverage: 15+ validation issue types with fallback handling

  • ✅ Real-world Validation: Successfully tested with production data and unplugged ports

API Stability

  • Metric Names: Stable (no breaking changes planned)

  • Label Names: Stable (additions possible, no removals)

  • Output Format: Prometheus standard compliance maintained

  • Endpoint URLs: Stable API contract

Data Quality Improvements

Enhanced Counter Semantics

  • Gauges default to None: Missing sensor data excluded instead of showing false zeros

  • Counters preserve values: No unexpected resets when data temporarily unavailable

  • Proper NA handling: Invalid data marked as NA and excluded from metrics

  • Temperature accuracy: Fixed zero temperature issue by using actual amber timestamps

Validation Integration Benefits

  • Synchronized processing: Performance metrics and validation issues always in sync

  • Rich context: Issue descriptions provide actionable corrective actions

  • Efficient processing: O(n+m) complexity prevents performance degradation

  • Smart caching: Validation cache independent of performance metrics cache

Support

Troubleshooting

  1. Check CVT Service: Ensure Cable Validation service is running

  2. Verify Topology: Confirm network topology is loaded

  3. Agent Status: Check that agents are deployed and collecting data

  4. Network Connectivity: Verify switch/host accessibility

Performance Monitoring 

# Check metrics endpoint response time
time curl -k https://cvt-server/cablevalidation/metrics > /dev/null

Contact Information

  • Development Team: Cable Validation Engineering

  • Documentation: [Internal Wiki Link]

  • Support: [Support Channel/Email]

This endpoint provides comprehensive cable validation metrics for modern monitoring and observability workflows, enabling proactive network health management and automated alerting.
