Fleet Intelligence - Release Notes#

Version 1.2#

New Features:

  • Fleet Intelligence API: Enable Fleet Intelligence API for customers.

  • Migrate to V2 API [1791]: Migrate from NGC V1 API to NGC V2 API.

  • Enhance agent log [1778]: Add event_id(UUID) to events info in agent message.

  • Surface DCGM version [1775, 1694, 1695, 1490]: Show the DCGM version in the Machine Details page.

  • Enable “debug” button on backend component alert [1766]: The work is done on the backend, surface the button for Agent Connectivity, Firmware Version, etc.

  • User API access [1642, 1636]: Allow users to access the API.

  • Scopes for API keys [1641]: Allow API keys to be scoped to one of 3 levels for API access.

  • Show available storage [1623, 1469]: The storage graphs show usage. Add available storage line to the graphs.

  • Power consumption as % of possible [1457]: Provide GPU power consumption as a % of possible power draw

  • Add flag to disable local metrics port [1900]: Add a flag to disable the local metrics port.

Fixed Bugs:

  • Clean up error messages [1801]: Fix a bad error message when a customer ID didn’t exist.

  • Clean up tooltips [1773]: Tooltips display the permissions needed if the user doesn’t have the permissions and the action is disabled appropriately.

  • Fix map pull [1762]: Load the map locally to enhance performance.

  • Double soft machine delete [1753]: Fix a bug where if the machine was soft deleted twice, without page refresh an error was displayed.

  • Fix graphs with divide by zero [1749]: Fix a bug where if the graph had a divide by zero, the graph would not display. Display a warning message instead.

  • Fix agent export failure [1735]: Fix a bug where the agent export would fail due to a driver hang.

  • Remove duplicate tooltips [1734, 1733, 1732]: Remove duplicate tooltips.

  • Improve Alert Detail [1731]: Add line breaks to Component/Status/Reason.

  • No NVIDIA GPU detected [1724]: When the driver is not installed, the agent should still detect the GPU and report a driver issue.

  • Better verification checking on URL [1717]: Better checking on the URL passed to the –server-url flag.

  • Security enhancements [1715, 1589]: Prevent access to the agent pod from outside the pod. Only for Helm based install. Set .fleetint files to 0700.

  • Enhance agent docs [1713]: Add –retention and –compact flags to the agent docs.

  • No events in alert timeline [1902]: No events in the alert timeline when alert is manually muted or unmuted.

  • RTX alert suppression [1876]: Suppress IMEX alerts for RTX cards since they do not apply.

  • Security hardening [1836, 1837, 1838, 1839, 1840, 1841, 1842]: Fix several HTTPS checks and add several security checks.

Version 1.1#

New Features:

  • Mute Alert: Added the ability to create and manage rules to mute alerts.

  • Notify Alert: Added the ability to create and manage notify alert rules. Supported channels are: - Email - Slack (via email) - Webhook

  • CVE checking: Added checks for nodes against the NVIDIA CVE database. Added as Metrics for the ability to Mute.

  • New Metrics: Added the following new metrics: - dcgm_fi_dev_clocks_event_reasons - dcgm_fi_dev_fabric_manager_status - dcgm_fi_dev_nvlink_count_symbol_ber_float - dcgm_fi_dev_nvlink_count_effective_ber_float

  • GPU Status chart: Added a GPU Status chart to the Dashboard Resource Stats, Utilization section.

  • SXID error suggested actions: Suggestions in events and error reports.

  • Summarize events: Displays a summary of the same events on a node.

  • Update Agent advice: Agent install advice now displays the latest version of the agent.

  • Display larger charts: Added a button to display larger charts in the detail and debugging pages.

  • Search by hostname: Added a search by hostname to the Inventory page.

  • Allow component telemetry: Allow component telemetry to be selected in the debugging pages.

  • Agent Precheck script: Added a precheck script to the agent install.

  • Agent should not enroll without GPUs or with incorrect GPUs: The Agent should not enroll any nodes without GPUs or with incorrect GPUs.

Fixed Bugs:

  • XID display bug: Fixed accelerator-nvidia-error-sxid has no display name bug.

  • GPU reported up wrongly bug: Fixed GPU State incorrectly remained “up” in the face of XID 94 & 95.

  • Compute Zone View stuck bug: Compute Zone View stuck in loading state on Inventory page.

  • NVLink BER metrics bug: NVLink BER metrics are showing all zeros in debugging page.

  • Error report dialog hang: Error report modal dialogue hanging after click Generate.

  • Metric X-axis truncation bug: Metric X-axis truncation bug.

  • Machine index: GPU index on machine details should match GPU Chart tooltip index

  • Agent Liveness Check Improvement: Agent liveness check improvement.

  • Panel Coordination bug: Event/alert from node detail panel should carry to Debug screen.