Fleet Intelligence - Release Notes#

Version 1.4.1 and 1.4.2#

Changes:

  • Agent-managed compute zones and node groups [2280, 2281]: Compute zones and node groups are now owned and managed by the Fleet Intelligence agent rather than created manually via the UI or API. The following operations have been removed: creating and renaming compute zones, creating and updating node groups, and assigning nodes to node groups. These relationships are now established automatically by the agent.

Fixed Bugs:

  • XID burst window showing identical start and end times [2262]: Fixed the XID burst analysis window displaying the same timestamp for both start and end time when multiple XID events occurred within the burst.

Version 1.4#

New Features:

  • Node health history [1447, 1863]: A new health history panel in the Machine Details sidebar shows a timeline of node health status changes over time. Operators can expand individual status entries to view component-level health detail and quickly identify nodes with recurring issues.

  • XID burst analysis in alert timeline [1238, 2231]: XID burst analyzer results are now surfaced directly in alert timelines for XID component alerts. Each detected burst shows the burst category (e.g., “Off the Bus”, “DRAM(HBM)”), job-disruption status, burst time window, per-XID breakdown with mnemonic and severity, and recommended actions. Burst analysis events are visible to cloud provider (NCP) users. A per-customer allowlist controls which customers receive alert enrichment.

  • Ampere and Ada Lovelace GPU support [2098]: Fleet Intelligence now supports NVIDIA Ampere and Ada Lovelace data center GPUs. Supported additions include A100 (40GB and 80GB, PCIe and SXM4) and Ada Lovelace (L40, L40S), which were previously rejected as unqualified.

  • Updated telemetry chart labels [1747]: Telemetry chart tooltips now display all relevant label key-value pairs in a Grafana-compatible format. Chart legends support scrolling and allow focusing on individual series.

  • Integrity Check UI updates [2080]: The Integrity Check section of the UI has been updated with a new layout and interaction model.

  • Webhook referenceId uses ncaId [2081]: Webhook alert notifications now use ncaId as the referenceId field for consistent customer identification across integrations.

  • Agent liveness via dedicated metric [1951]: Agent liveness detection now uses the dedicated fleetint_agent_up metric emitted by the agent at each export interval, enabling more reliable liveness signaling independent of node metadata updates. Older agents automatically fall back to the previous detection method.

Fixed Bugs:

  • GPU states not shown for newer agent [2286]: Fixed a regression where the gpu_states chart showed no data in the node detail panel for nodes enrolled with newer agent versions.

  • Agent enrollment with partial NVML GPU visibility [2033]: Fixed a bug where the agent failed to export initial machine info when one GPU on a multi-GPU node was not visible to NVML. The agent now exports data for all visible GPUs rather than aborting the entire export.

  • Power Violation Time Y-axis formatting [2029]: Fixed Y-axis labels in the Power Violation Time telemetry chart displaying fractional minute values (e.g., “8.33 min”) instead of clean integer labels.

  • Inconsistent decimal places in GPU Power telemetry [2027]: Fixed inconsistent decimal display in GPU Power Usage and GPU Power Utilization telemetry panels.

  • GPU Memory Utilization tooltip label [2026]: Fixed the telemetry chart tooltip showing “GPU Memory Copy Utilization” instead of “GPU Memory Utilization” for the GPU Utilization (DCGM) component.

  • XID burst analyzer accuracy [2231]: Fixed several accuracy issues in XID burst analysis: the job_disruption flag now uses authoritative catalog lookup instead of a heuristic; burst duration is now correctly computed; open (in-progress) bursts are no longer written to the timeline prematurely; VBIOS version is now correctly read from the node resource schema.

Version 1.3#

New Features:

  • Cross-org alert notification rules [1740, 1744, 1745, 1746, 1792, 1831]: Cloud providers (NCPs) can now create and manage their own alert notification rules scoped to tenant resources. NCP and tenant notify rules are fully isolated — each party sees and manages only their own rules. Mute rules remain visible to both parties as before.

  • New events card on Debugging page [1651, 1856, 1868]: The events card on the Debugging page has been redesigned with a new layout and rendering model. Events and telemetry charts now both render only after submitting the filters form, and a new summarization API returns event counts per time bucket to improve readability when there are many events in the selected period.

  • Enhanced chart legend [1893]: Chart legends now handle a large number of entries with improved scrollability and layout.

  • Incident details in alert timeline [1887, 1946]: Incident details from agent state events are now stored with alerts and surfaced in the alert timeline side panel.

  • Improved telemetry chart label display [1692]: Metrics charts now display all relevant labels in the new label format on hover, with improved handling for multiple labels per series.

  • Separate machine info export endpoint [1739]: Machine inventory information (CPU, GPU, OS, driver versions) is now exported via a dedicated endpoint rather than bundled into OTLP telemetry, following standard OpenTelemetry practices. Liveness heartbeats are also now independent of the telemetry export interval.

  • Remove redundant disk info from node side panel [1915, 1936]: Disk used data has been removed from the machine detail side panel since it is already available in the metrics charts.

Fixed Bugs:

  • Excessive decimal places in telemetry panels [1880, 2030]: Fixed telemetry panel values displaying four decimal places; values now display with two decimal places.

  • Duplicate Running PIDs telemetry for OS component [1882, 2028]: Fixed a bug where “Running PIDs” telemetry was erroneously included as the first entry for any component after the OS component had been selected once in the session.

  • XID analyzer parsing error [2001]: Fixed the XID burst analyzer to use the raw kernel message (extra_info.data.raw_kmsg) instead of the processed health-state message, so XID patterns are correctly matched.

  • GB200 GPU duplicate serial numbers [1879]: Fixed duplicate serial numbers reported for GB200 GPUs in the machine details inventory.

  • Page not refreshing after node deletion [1869]: Fixed a bug where the node list did not auto-refresh after a successful node deletion, leaving the deleted node visible and actionable.

  • hasEnrolledMachines missing from API response [1903]: Fixed the /v1/customers API response to include the hasEnrolledMachines field.

  • XID suggested action text [1560, 1442]: Fixed unmapped suggested action codes (INVESTIGATE_SW/USER, REPORT_ISSUE (IF SEEN >1 PER DAY)) displaying as raw codes instead of user-friendly text. Also fixed a typo: “Invetigatory” → “Investigatory”.

Version 1.2#

New Features:

  • Fleet Intelligence API: Enable Fleet Intelligence API for customers.

  • Migrate to V2 API [1791]: Migrate from NGC V1 API to NGC V2 API.

  • Enhance agent log [1778]: Add event_id(UUID) to events info in agent message.

  • Surface DCGM version [1775, 1694, 1695, 1490]: Show the DCGM version in the Machine Details page.

  • Enable “debug” button on backend component alert [1766]: The work is done on the backend, surface the button for Agent Connectivity, Firmware Version, etc.

  • User API access [1642, 1636]: Allow users to access the API.

  • Scopes for API keys [1641]: Allow API keys to be scoped to one of 3 levels for API access.

  • Show available storage [1623, 1469]: The storage graphs show usage. Add available storage line to the graphs.

  • Power consumption as % of possible [1457]: Provide GPU power consumption as a % of possible power draw.

  • Add flag to disable local metrics port [1900]: Add a flag to disable the local metrics port.

Fixed Bugs:

  • Clean up error messages [1801]: Fix a bad error message when a customer ID didn’t exist.

  • Clean up tooltips [1773]: Tooltips display the permissions needed if the user doesn’t have the permissions and the action is disabled appropriately.

  • Fix map pull [1762]: Load the map locally to enhance performance.

  • Double soft machine delete [1753]: Fix a bug where if the machine was soft deleted twice, without page refresh an error was displayed.

  • Fix graphs with divide by zero [1749]: Fix a bug where if the graph had a divide by zero, the graph would not display. Display a warning message instead.

  • Fix agent export failure [1735]: Fix a bug where the agent export would fail due to a driver hang.

  • Remove duplicate tooltips [1734, 1733, 1732]: Remove duplicate tooltips.

  • Improve Alert Detail [1731]: Add line breaks to Component/Status/Reason.

  • No NVIDIA GPU detected [1724]: When the driver is not installed, the agent should still detect the GPU and report a driver issue.

  • Better verification checking on URL [1717]: Better checking on the URL passed to the –server-url flag.

  • Security enhancements [1715, 1589]: Prevent access to the agent pod from outside the pod. Only for Helm based install. Set .fleetint files to 0700.

  • Enhance agent docs [1713]: Add –retention and –compact flags to the agent docs.

  • No events in alert timeline [1902]: No events in the alert timeline when alert is manually muted or unmuted.

  • RTX alert suppression [1876]: Suppress IMEX alerts for RTX cards since they do not apply.

  • Security hardening [1836, 1837, 1838, 1839, 1840, 1841, 1842]: Fix several HTTPS checks and add several security checks.

Version 1.1#

New Features:

  • Mute Alert: Added the ability to create and manage rules to mute alerts.

  • Notify Alert: Added the ability to create and manage notify alert rules. Supported channels are:

    • Email

    • Slack (via email)

    • Webhook

  • CVE checking: Added checks for nodes against the NVIDIA CVE database. Added as Metrics for the ability to Mute.

  • New Metrics: Added the following new metrics:

    • dcgm_fi_dev_clocks_event_reasons

    • dcgm_fi_dev_fabric_manager_status

    • dcgm_fi_dev_nvlink_count_symbol_ber_float

    • dcgm_fi_dev_nvlink_count_effective_ber_float

  • GPU Status chart: Added a GPU Status chart to the Dashboard Resource Stats, Utilization section.

  • SXID error suggested actions: Suggestions in events and error reports.

  • Summarize events: Displays a summary of the same events on a node.

  • Update Agent advice: Agent install advice now displays the latest version of the agent.

  • Display larger charts: Added a button to display larger charts in the detail and debugging pages.

  • Search by hostname: Added a search by hostname to the Inventory page.

  • Allow component telemetry: Allow component telemetry to be selected in the debugging pages.

  • Agent Precheck script: Added a precheck script to the agent install.

  • Agent should not enroll without GPUs or with incorrect GPUs: The Agent should not enroll any nodes without GPUs or with incorrect GPUs.

Fixed Bugs:

  • XID display bug: Fixed accelerator-nvidia-error-sxid has no display name bug.

  • GPU reported up wrongly bug: Fixed GPU State incorrectly remained “up” in the face of XID 94 & 95.

  • Compute Zone View stuck bug: Compute Zone View stuck in loading state on Inventory page.

  • NVLink BER metrics bug: NVLink BER metrics are showing all zeros in debugging page.

  • Error report dialog hang: Error report modal dialog hangs after clicking Generate.

  • Metric X-axis truncation bug: Metric X-axis truncation bug.

  • Machine index: GPU index on machine details should match GPU Chart tooltip index.

  • Agent Liveness Check Improvement: Agent liveness check improvement.

  • Panel Coordination bug: Event/alert from node detail panel should carry to Debug screen.