Release Notes#

Version 1.1#

New Features:

  • Mute Alert: Added the ability to create and manage rules to mute alerts.

  • Notify Alert: Added the ability to create and manage notify alert rules. Supported channels are: - Email - Slack (via email) - Webhook

  • CVE checking: Added checks for nodes against the NVIDIA CVE database. Added as Metrics for the ability to Mute.

  • New Metrics: Added the following new metrics: - dcgm_fi_dev_clocks_event_reasons - dcgm_fi_dev_fabric_manager_status - dcgm_fi_dev_nvlink_count_symbol_ber_float - dcgm_fi_dev_nvlink_count_effective_ber_float

  • GPU Status chart: Added a GPU Status chart to the Dashboard Resource Stats, Utilization section.

  • SXID error suggested actions: Suggestions in events and error reports.

  • Summarize events: Displays a summary of the same events on a node.

  • Update Agent advice: Agent install advice now displays the latest version of the agent.

  • Display larger charts: Added a button to display larger charts in the detail and debugging pages.

  • Search by hostname: Added a search by hostname to the Inventory page.

  • Allow component telemetry: Allow component telemetry to be selected in the debugging pages.

  • Agent Precheck script: Added a precheck script to the agent install.

  • Agent should not enroll without GPUs or with incorrect GPUs: The Agent should not enroll any nodes without GPUs or with incorrect GPUs.

Fixed Bugs:

  • XID display bug: Fixed accelerator-nvidia-error-sxid has no display name bug.

  • GPU reported up wrongly bug: Fixed GPU State incorrectly remained “up” in the face of XID 94 & 95.

  • Compute Zone View stuck bug: Compute Zone View stuck in loading state on Inventory page.

  • NVLink BER metrics bug: NVLink BER metrics are showing all zeros in debugging page.

  • Error report dialog hang: Error report modal dialogue hanging after click Generate.

  • Metric X-axis truncation bug: Metric X-axis truncation bug.

  • Machine index: GPU index on machine details should match GPU Chart tooltip index

  • Agent Liveness Check Improvement: Agent liveness check improvement.

  • Panel Coordination bug: Event/alert from node detail panel should carry to Debug screen.