Observability Software#

Base Command Manager Monitoring#

NVIDIA Mission Control utilizes NVIDIA Base Command Manager’s built-in monitoring infrastructure for comprehensive system observability. This guide assumes administrators are familiar with BCM monitoring concepts, including metrics collection, measurable definitions, trigger and alert configurations, and health checks.

For detailed coverage of observability and monitoring in Base Command Manager, administrators should refer directly to:

Base Command Manager (BCM) Administrator Guide:

  • Monitoring Cluster Devices (Section 10)

  • Metrics, Health Checks, Enummetrics, and Actions (Appendix G)

For customizing monitoring alerts, health checks, and related configurations, consult these sections before proceeding.

Prometheus and Grafana#

Validate metric collection using Grafana#

Grafana should be available on your head node listening on HTTPS on the /grafana page.

By default, the administrative account is admin with password prom-operator.

Grafana login screen

Once authenticated, click on Explore on the left. Once in the Explore interface, set the query editor to code and query the metric memorytotal.

Grafana explore interface showing memorytotal query

BCM provides many metrics out of the box, you can find a list of these in Grafana by using the Metric browser selecting all metrics which include the label job with the value external-bcmexporter.

Grafana metric browser showing BCM metrics

This can also be found by inspecting the output of the BCM Prometheus exporter endpoint. Here using curl with some light processing.

curl -sk https://localhost:8081/exporter | grep -Ev '# HELP|# TYPE' | cut -d'{' -f1 | sort -u

alertlevel
...
writetime_total

Build a Grafana dashboard#

With BCM exporting metrics, we can create dashboards with these. In this example, we’ll build a device state history dashboard.

  1. Select Dashboards on the left panel, select New and select New dashboard.

    Creating a new Grafana dashboard
  2. Click Add visualization.

    Adding a visualization to the dashboard
  3. Select Prometheus as your datasource.

    Selecting Prometheus as the datasource
  4. Start by setting the visualization to Status history. Query the metric devicestatus.

    Setting up status history visualization with devicestatus metric
  5. On the right pane, scroll down to Value mappings and click on Add value mappings. This metric returns either 0 which maps to OK or 1 which maps to NOK. Add value conditions for these with a desired color. We’ll use 0 as green and 1 as red.

    Configuring value mappings for device status
  6. Delete the threshold for the value 80.

    Deleting the default threshold
  7. Expand options in the query window and set the Legend to Custom and set the value to {{hostname}}.

    Setting custom legend with hostname
  8. You may get this error:

    Too many points to visualize properly.
    
  9. You can work around this by setting the time window of the dashboard from Last 6 hours to Last 15 minutes.

    Adjusting the time window to resolve visualization issues
  10. To allow for us to use this over a longer period of time, we’ll set the Min step for our query to 5m. Set the Title of the panel to Node status.

    Setting minimum step and panel title

    Click on Back to dashboard. Resize the panel to a desired size.

    Returning to dashboard view

    Set the window of time to the Last 1 hour, take note that the dashboard rendered more status boxes.

    Resizing the panel and adjusting time window
  11. Select Save dashboard. Set the Title to BCM node status and click on Save.

    Saving the dashboard with a title

Logs#

With BCM, in-built logging aggregation comes out of the box. Each node that BCM deploys by default forwards logs to the head nodes for centralization of logs. From a head node, you can view these logs in /var/log/syslog.

Counting Xid errors#

When a GPU fault has occurred, these are logged to syslog. These can indicate user driven exceptions, errors with the driver, or in some circumstances a hardware fault. It’s good to understand the Xids encountered and the counts of these.

Xid values and descriptions can be found in this document: Xid errors

grep "Xid (PCI" /var/log/syslog | awk -v cutoff="$(date -d '12 hours ago' '+%b %d %H:%M:%S')" '
  $1 " " $2 " " $3 >= cutoff {
    split($4, host, ".");
    if (match($0, /Xid \(PCI:[^)]+\): ([0-9]+)/, id)) {
      print host[1], id[1]
    }
  }
' | sort | uniq -c | awk '{print $2 ":", $1, "occurrences of error ID", $3}'

Example output:

a07-p1-dgx-03-c08: 10158 occurrences of error ID 143
b06-p1-dgx-06-c01: 4 occurrences of error ID 149
b06-p1-dgx-06-c01: 4 occurrences of error ID 154
b06-p1-dgx-06-c02: 4 occurrences of error ID 149
b06-p1-dgx-06-c02: 3 occurrences of error ID 154
b06-p1-dgx-06-c04: 4 occurrences of error ID 149
b06-p1-dgx-06-c04: 4 occurrences of error ID 154
b06-p1-dgx-06-c05: 4 occurrences of error ID 149
b06-p1-dgx-06-c05: 4 occurrences of error ID 154
b06-p1-dgx-06-c06: 4 occurrences of error ID 149
b06-p1-dgx-06-c06: 4 occurrences of error ID 154
b06-p1-dgx-06-c07: 4 occurrences of error ID 149
b06-p1-dgx-06-c07: 2 occurrences of error ID 154
b06-p1-dgx-06-c08: 4 occurrences of error ID 149
b06-p1-dgx-06-c08: 4 occurrences of error ID 154
b06-p1-dgx-06-c09: 4 occurrences of error ID 149
b06-p1-dgx-06-c09: 4 occurrences of error ID 154
b06-p1-dgx-06-c11: 4 occurrences of error ID 149
b06-p1-dgx-06-c11: 4 occurrences of error ID 154
b06-p1-dgx-06-c12: 4 occurrences of error ID 149
b06-p1-dgx-06-c12: 3 occurrences of error ID 154
b06-p1-dgx-06-c13: 4 occurrences of error ID 149
b06-p1-dgx-06-c13: 4 occurrences of error ID 154
b06-p1-dgx-06-c13: 112 occurrences of error ID 45
b06-p1-dgx-06-c14: 4 occurrences of error ID 149
b06-p1-dgx-06-c14: 4 occurrences of error ID 154
b06-p1-dgx-06-c14: 112 occurrences of error ID 45
b06-p1-dgx-06-c15: 4 occurrences of error ID 149
b06-p1-dgx-06-c15: 4 occurrences of error ID 154
b06-p1-dgx-06-c15: 112 occurrences of error ID 45
b06-p1-dgx-06-c16: 4 occurrences of error ID 149
b06-p1-dgx-06-c16: 4 occurrences of error ID 154
b06-p1-dgx-06-c16: 112 occurrences of error ID 45
b06-p1-dgx-06-c17: 4 occurrences of error ID 149
b06-p1-dgx-06-c17: 4 occurrences of error ID 154
b06-p1-dgx-06-c17: 112 occurrences of error ID 45
b06-p1-dgx-06-c18: 4 occurrences of error ID 149
b06-p1-dgx-06-c18: 4 occurrences of error ID 154