Alerts Management
NMX-M continuously collects metrics from Telemetry services. Based on predefined alert rules, certain metrics may trigger alerts, which are sent to relevant stakeholders.
Some alerts provide precise event descriptions, such as a port-down alert, while others indicate trends, such as port health degradation.
The triggering conditions are predefined within the system.
Port Status Warning
The Port Status alert is triggered by a group of events affecting port operation. These alerts are based on deviations in a group of metrics—if any metric exceeds two standard deviations (2STD) from the average, an alert is raised. The alert includes the port ID, domain, node ID, and port number, but not the exact metric that caused the alert.
Currently, there are two types of alerts in this group:
Ongoing Port Validation
Trigger Condition: More than 3 errors of a specific metric are detected within a 24-hour window.
Behavior: The alert is continuously sent until the condition is resolved. This ensures that external systems receive the alert even if temporarily down.
Alert Group:
ongoing_port_validation
Alert Name:
OngoingPortValidation
Severity:
warning
Anomalies Detection
On each new metric value, NMX-M analyzes historical data over a predefined time span using a statistical model. Alerts are triggered if values deviate significantly (exceeding 2STD). The following alerts are included:
Port Congestion Warning
Alert Group:
port_metrics_deviation
Alert Name:
PortCongestion
Severity:
warning
Physical Layer Retransmission Warning
Alert Group:
port_metrics_deviation
Alert Name:
PhysicalLayerRetransmission
Severity:
warning
Port Degradation Warning
Alert Group:
port_metrics_deviation
Alert Names:
PortDegradationHistogram1
PortDegradationHistogram2
PortDegradationHistogram3
PortDegradationBER
PortDegradationLinkErrors
Severity:
warning
Packet Discard Warning
Alert Group:
port_metrics_deviation
Alert Name:
PacketDiscard
Severity:
warning
When an alert is triggered, NMX-M sends a notification to the configured webhook URL.
Setting the Webhook URL
You can set the webhook receiver URL during the installation process.
During installation, you will be prompted to provide one or more webhook URLs. You may enter a single URL or multiple URLs separated by commas. All specified URLs will receive the notifications simultaneously.
Example input format:
http://alert1.example.com:9093,http://alert2.example.com:9093/webhook
To skip this step and configure the URL later, simply press Enter when prompted.
This URL (or list of URLs) will be used to receive HTTP POST notifications in the Prometheus Alertmanager JSON format:
{
"version"
: "4"
,
"groupKey"
: <string>, // key identifying the group of alerts (e.g. to deduplicate)
"truncatedAlerts"
: <int
>, // how many alerts have been truncated due to "max_alerts"
"status"
: "<resolved|firing>"
,
"receiver"
: <string>,
"groupLabels"
: <object>,
"commonLabels"
: <object>,
"commonAnnotations"
: <object>,
"externalURL"
: <string>, // backlink to the internal alert manager.
"alerts"
: [
{
"status"
: "<resolved|firing>"
,
"labels"
: <object>,
"annotations"
: <object>,
"startsAt"
: "<rfc3339>"
,
"endsAt"
: "<rfc3339>"
,
"generatorURL"
: <string>, // identifies the entity that caused the alert
"fingerprint"
: <string> // fingerprint to identify the alert
},
...
]
}
This format makes it compatible with a wide range of existing integrations. For more information, see Prometheus Integrations – Alertmanager Webhook Receiver.
{
"receiver"
: "webhook-receiver"
,
"status"
: "firing"
,
"alerts"
: [
{
"status"
: "firing"
,
"labels"
: {
"alertgroup"
: "ongoing_port_validation"
,
"alertname"
: "OngoingPortValidation"
,
"severity"
: "warning"
},
"annotations"
: {
"description"
: "Error counters have increased\nmultiple times in the last 24 hours for port 123.\n"
,
"summary"
: "Repeated error counter increases for port 123"
},
"startsAt"
: "2025-02-18T13:25:16.860604419Z"
,
"endsAt"
: "0001-01-01T00:00:00Z"
,
"generatorURL"
: "http://c6c0e7e35a8f:8880/vmalert/alert?group_id=1036955090143761274&alert_id=1074584496268461589"
,
"fingerprint"
: "abcf32454420d0f2"
}
],
"groupLabels"
: {
"alertname"
: "OngoingPortValidation"
},
"commonLabels"
: {
"alertgroup"
: "ongoing_port_validation"
,
"alertname"
: "OngoingPortValidation"
,
"severity"
: "warning"
},
"commonAnnotations"
: {
"description"
: "Error counters have increased\nmultiple times in the last 24 hours for port 123.\n"
,
"summary"
: "Repeated error counter increases for port 123"
},
"externalURL"
: "http://224f3bffc652:9093"
,
"version"
: "4"
,
"groupKey"
: "{}:{alertname=\"OngoingPortValidation\"}"
,
"truncatedAlerts"
: 0
}
Updating Webhook Receiver URL
You can update the webhook URL at any time after installation.
To do so:
Create a
webhook.yaml
file containing the new Alerts webhook receiver URL(s). You can provide a single URL or a comma-separated list. Example:http:
//alert1.example.com:9093,http://alert2.example.com:9093/webhook
Run the following script under the root user:
/opt/nvidia/nmx/scripts/alerts-webhook-url-config.sh
When prompted, you'll see the following menu:
Choose an option:
1
) Update webhook receiver URLs2
) Retrieve webhook receiver URLs and their statuses3
) Clear webhook receiver URLs4
) Exit Enter your selection: ..
To view the current configuration, select option 2.
To update the webhook URLs, select option 1.
To clear the URLs, select option 3.
New webhook URLs will begin receiving notifications shortly after the system automatically redeploys the relevant components.