Alerts Management
NMX-M continuously collects metrics from Telemetry services. Certain metrics may trigger alerts, which are sent to relevant stakeholders based on predefined alert rules.
Some alerts provide exact values or specific event descriptions, such as a port-down alert, while others indicate trends, such as port health degradation.
The conditions for triggering alerts are predefined within the system.
Port Status Warning
Port status is a collection of multiple events that influence the port operation. The following categories will result in an alert with the port ID and a description without an exact metric value. The description includes details: port number, domain, and node ID.
All the following alerts are a grouping of multiple metrics that will trigger an alert if any metric collected in this group deviates by more than two standard deviations (2STD) from the average .
When sending an alert for each event, the specific metric that triggers the alert is not mentioned – only the name of the alert group and the port ID.
Currently, there are 2 types of alerts:
Ongoing port validation – checks if within 24h span a metric increased 3+ times.
This means if we are receiving 3+ errors of a specific metric within 24h an alert will be triggered.
A notification is continuously sent along until the condition is met.
By that an external system will still be getting the status updates even if it was temporarily down and missed some.
Alert Group
ongoing_port_validation
Alert Name
OngoingPortValidation
Severity
warning
ㅤ
Anomalies detection – on every incoming metric value, NMX-M takes values of a pre-defined time span and looks for unusual trends using a pre-defined formula (spikes exceeding 2 standard deviations).
It includes the following groups:
Congestion warning
Alert Group
port_metrics_deviation
Alert Name
PortCongestion
Severity
warning
Physical Layer Retransmission warning
Alert Group
port_metrics_deviation
Alert Name
PhysicalLayerRetransmission
Severity
warning
Port Degradation warning
Alert Group
port_metrics_deviation
Alert Name
PortDegradation
Severity
warning
Packet Discard warning
Alert Group
port_metrics_deviation
Alert Name
PacketDiscard
Severity
warning
When an alert is triggered, NMX-M sends a notification. Notification is sent to a webhook URL.
Setting webhook receiver URL
You are able to set the URL during the installation process.
During the installation, you will be prompted with a request to provide a webhook URL.
You can provide a single URL or multiple webhook URLs separated by comma, all of them will be receiving the notification simultaneously.
Enter the Alerts webhook receiver URL or comma-separated list of URLs (e.g. http://alert1.example.com:9093,http://alert2.example.com:9093/webhook). To set the webhook receiver URL later, press Enter:
..
This URL (or URLs) will be used to receive HTTP POST notifications in the JSON format of Prometheus Alertmanager:
{
"version"
: "4"
,
"groupKey"
: <string>, // key identifying the group of alerts (e.g. to deduplicate)
"truncatedAlerts"
: <int
>, // how many alerts have been truncated due to "max_alerts"
"status"
: "<resolved|firing>"
,
"receiver"
: <string>,
"groupLabels"
: <object>,
"commonLabels"
: <object>,
"commonAnnotations"
: <object>,
"externalURL"
: <string>, // backlink to the internal alert manager.
"alerts"
: [
{
"status"
: "<resolved|firing>"
,
"labels"
: <object>,
"annotations"
: <object>,
"startsAt"
: "<rfc3339>"
,
"endsAt"
: "<rfc3339>"
,
"generatorURL"
: <string>, // identifies the entity that caused the alert
"fingerprint"
: <string> // fingerprint to identify the alert
},
...
]
}
This format makes it compatible with the variety of existing integrations, see Prometheus Integrations – Alertmanager Webhook Receiver.
Notification example
{
"receiver"
: "webhook-receiver"
,
"status"
: "firing"
,
"alerts"
: [
{
"status"
: "firing"
,
"labels"
: {
"alertgroup"
: "ongoing_port_validation"
,
"alertname"
: "OngoingPortValidation"
,
"severity"
: "warning"
},
"annotations"
: {
"description"
: "Error counters have increased\nmultiple times in the last 24 hours for port 123.\n"
,
"summary"
: "Repeated error counter increases for port 123"
},
"startsAt"
: "2025-02-18T13:25:16.860604419Z"
,
"endsAt"
: "0001-01-01T00:00:00Z"
,
"generatorURL"
: "http://c6c0e7e35a8f:8880/vmalert/alert?group_id=1036955090143761274&alert_id=1074584496268461589"
,
"fingerprint"
: "abcf32454420d0f2"
}
],
"groupLabels"
: {
"alertname"
: "OngoingPortValidation"
},
"commonLabels"
: {
"alertgroup"
: "ongoing_port_validation"
,
"alertname"
: "OngoingPortValidation"
,
"severity"
: "warning"
},
"commonAnnotations"
: {
"description"
: "Error counters have increased\nmultiple times in the last 24 hours for port 123.\n"
,
"summary"
: "Repeated error counter increases for port 123"
},
"externalURL"
: "http://224f3bffc652:9093"
,
"version"
: "4"
,
"groupKey"
: "{}:{alertname=\"OngoingPortValidation\"}"
,
"truncatedAlerts"
: 0
}
Updating webhook receiver URL
User is able to update the webhook URL after the installation.
To update the webhook URL after the installation, you can use the following instructions.
First, create a webhook.yaml
file and include the Alerts webhook receiver URL or comma-separated list of URLs (e.g. http://alert1.example.com:9093,http://alert2.example.com:9093/webhook)
Run the following script under the root user: /opt/nvidia/nmx/scripts/alerts-webhook-url-config.sh
You will be prompted with the following:
Choose an option:
1
) Update webhook receiver URLs
2
) Retrieve webhook receiver URLs and their statuses
3
) Clear webhook receiver URLs
4
) Exit
Enter your selection: ..
If needed, select option 2 to view the current config.
To set new URLs, choose option 1.
To clean up the URLs, choose option 3.
New URLs will start receiving notifications in a few moments as soon as the system automatically re-deploys parts of the app related to the notifications.