NMX Manager (NMX-M) Documentation v85.1.3000

Alerts Management

NMX-M continuously collects metrics from Telemetry services. Based on predefined alert rules, certain metrics may trigger alerts, which are sent to relevant stakeholders.

Some alerts provide precise event descriptions, such as a port-down alert, while others indicate trends, such as port health degradation.

The triggering conditions are predefined within the system.

Port Status Warning

The Port Status alert is triggered by a group of events affecting port operation. These alerts are based on deviations in a group of metrics—if any metric exceeds two standard deviations (2STD) from the average, an alert is raised. The alert includes the port ID, domain, node ID, and port number, but not the exact metric that caused the alert.

Currently, there are two types of alerts in this group:

  1. Ongoing Port Validation

    1. Trigger Condition: More than 3 errors of a specific metric are detected within a 24-hour window.

    2. Behavior: The alert is continuously sent until the condition is resolved. This ensures that external systems receive the alert even if temporarily down.

    3. Alert Group: ongoing_port_validation

    4. Alert Name: OngoingPortValidation

    5. Severity: warning

  2. Anomalies Detection

    On each new metric value, NMX-M analyzes historical data over a predefined time span using a statistical model. Alerts are triggered if values deviate significantly (exceeding 2STD). The following alerts are included:

    • Port Congestion Warning

      • Alert Group: port_metrics_deviation

      • Alert Name: PortCongestion

      • Severity: warning

    • Physical Layer Retransmission Warning

      • Alert Group: port_metrics_deviation

      • Alert Name: PhysicalLayerRetransmission

      • Severity: warning

    • Port Degradation Warning

      • Alert Group: port_metrics_deviation

      • Alert Names:

        • PortDegradationHistogram1

        • PortDegradationHistogram2

        • PortDegradationHistogram3

        • PortDegradationBER

        • PortDegradationLinkErrors

      • Severity: warning

    • Packet Discard Warning

      • Alert Group: port_metrics_deviation

      • Alert Name: PacketDiscard

      • Severity: warning

When an alert is triggered, NMX-M sends a notification to the configured webhook URL.

Setting the Webhook URL

You can set the webhook receiver URL during the installation process.

During installation, you will be prompted to provide one or more webhook URLs. You may enter a single URL or multiple URLs separated by commas. All specified URLs will receive the notifications simultaneously.

Example input format:

Copy
Copied!
            

http://alert1.example.com:9093,http://alert2.example.com:9093/webhook

To skip this step and configure the URL later, simply press Enter when prompted.

This URL (or list of URLs) will be used to receive HTTP POST notifications in the Prometheus Alertmanager JSON format:

Copy
Copied!
            

{ "version": "4", "groupKey": <string>, // key identifying the group of alerts (e.g. to deduplicate) "truncatedAlerts": <int>, // how many alerts have been truncated due to "max_alerts" "status": "<resolved|firing>", "receiver": <string>, "groupLabels": <object>, "commonLabels": <object>, "commonAnnotations": <object>, "externalURL": <string>, // backlink to the internal alert manager. "alerts": [ { "status": "<resolved|firing>", "labels": <object>, "annotations": <object>, "startsAt": "<rfc3339>", "endsAt": "<rfc3339>", "generatorURL": <string>, // identifies the entity that caused the alert "fingerprint": <string> // fingerprint to identify the alert }, ... ] }

This format makes it compatible with a wide range of existing integrations. For more information, see Prometheus Integrations – Alertmanager Webhook Receiver.

Copy
Copied!
            

{ "receiver": "webhook-receiver", "status": "firing", "alerts": [ { "status": "firing", "labels": { "alertgroup": "ongoing_port_validation", "alertname": "OngoingPortValidation", "severity": "warning" }, "annotations": { "description": "Error counters have increased\nmultiple times in the last 24 hours for port 123.\n", "summary": "Repeated error counter increases for port 123" }, "startsAt": "2025-02-18T13:25:16.860604419Z", "endsAt": "0001-01-01T00:00:00Z", "generatorURL": "http://c6c0e7e35a8f:8880/vmalert/alert?group_id=1036955090143761274&alert_id=1074584496268461589", "fingerprint": "abcf32454420d0f2" } ], "groupLabels": { "alertname": "OngoingPortValidation" }, "commonLabels": { "alertgroup": "ongoing_port_validation", "alertname": "OngoingPortValidation", "severity": "warning" }, "commonAnnotations": { "description": "Error counters have increased\nmultiple times in the last 24 hours for port 123.\n", "summary": "Repeated error counter increases for port 123" }, "externalURL": "http://224f3bffc652:9093", "version": "4", "groupKey": "{}:{alertname=\"OngoingPortValidation\"}", "truncatedAlerts": 0 }


Updating Webhook Receiver URL

You can update the webhook URL at any time after installation.

To do so:

  1. Create a webhook.yaml file containing the new Alerts webhook receiver URL(s). You can provide a single URL or a comma-separated list. Example:

    Copy
    Copied!
                

    http://alert1.example.com:9093,http://alert2.example.com:9093/webhook

  2. Run the following script under the root user:

    Copy
    Copied!
                

    /opt/nvidia/nmx/scripts/alerts-webhook-url-config.sh

  3. When prompted, you'll see the following menu:

    Copy
    Copied!
                

    Choose an option: 1) Update webhook receiver URLs 2) Retrieve webhook receiver URLs and their statuses 3) Clear webhook receiver URLs 4) Exit Enter your selection: ..

  • To view the current configuration, select option 2.

  • To update the webhook URLs, select option 1.

  • To clear the URLs, select option 3.

New webhook URLs will begin receiving notifications shortly after the system automatically redeploys the relevant components.

© Copyright 2025, NVIDIA. Last updated on Sep 18, 2025.