Fleet Intelligence - Release Notes#
Version 1.2#
New Features:
Fleet Intelligence API: Enable Fleet Intelligence API for customers.
Migrate to V2 API [1791]: Migrate from NGC V1 API to NGC V2 API.
Enhance agent log [1778]: Add event_id(UUID) to events info in agent message.
Surface DCGM version [1775, 1694, 1695, 1490]: Show the DCGM version in the Machine Details page.
Enable “debug” button on backend component alert [1766]: The work is done on the backend, surface the button for Agent Connectivity, Firmware Version, etc.
User API access [1642, 1636]: Allow users to access the API.
Scopes for API keys [1641]: Allow API keys to be scoped to one of 3 levels for API access.
Show available storage [1623, 1469]: The storage graphs show usage. Add available storage line to the graphs.
Power consumption as % of possible [1457]: Provide GPU power consumption as a % of possible power draw
Add flag to disable local metrics port [1900]: Add a flag to disable the local metrics port.
Fixed Bugs:
Clean up error messages [1801]: Fix a bad error message when a customer ID didn’t exist.
Clean up tooltips [1773]: Tooltips display the permissions needed if the user doesn’t have the permissions and the action is disabled appropriately.
Fix map pull [1762]: Load the map locally to enhance performance.
Double soft machine delete [1753]: Fix a bug where if the machine was soft deleted twice, without page refresh an error was displayed.
Fix graphs with divide by zero [1749]: Fix a bug where if the graph had a divide by zero, the graph would not display. Display a warning message instead.
Fix agent export failure [1735]: Fix a bug where the agent export would fail due to a driver hang.
Remove duplicate tooltips [1734, 1733, 1732]: Remove duplicate tooltips.
Improve Alert Detail [1731]: Add line breaks to Component/Status/Reason.
No NVIDIA GPU detected [1724]: When the driver is not installed, the agent should still detect the GPU and report a driver issue.
Better verification checking on URL [1717]: Better checking on the URL passed to the –server-url flag.
Security enhancements [1715, 1589]: Prevent access to the agent pod from outside the pod. Only for Helm based install. Set .fleetint files to 0700.
Enhance agent docs [1713]: Add –retention and –compact flags to the agent docs.
No events in alert timeline [1902]: No events in the alert timeline when alert is manually muted or unmuted.
RTX alert suppression [1876]: Suppress IMEX alerts for RTX cards since they do not apply.
Security hardening [1836, 1837, 1838, 1839, 1840, 1841, 1842]: Fix several HTTPS checks and add several security checks.
Version 1.1#
New Features:
Mute Alert: Added the ability to create and manage rules to mute alerts.
Notify Alert: Added the ability to create and manage notify alert rules. Supported channels are: - Email - Slack (via email) - Webhook
CVE checking: Added checks for nodes against the NVIDIA CVE database. Added as Metrics for the ability to Mute.
New Metrics: Added the following new metrics: - dcgm_fi_dev_clocks_event_reasons - dcgm_fi_dev_fabric_manager_status - dcgm_fi_dev_nvlink_count_symbol_ber_float - dcgm_fi_dev_nvlink_count_effective_ber_float
GPU Status chart: Added a GPU Status chart to the Dashboard Resource Stats, Utilization section.
SXID error suggested actions: Suggestions in events and error reports.
Summarize events: Displays a summary of the same events on a node.
Update Agent advice: Agent install advice now displays the latest version of the agent.
Display larger charts: Added a button to display larger charts in the detail and debugging pages.
Search by hostname: Added a search by hostname to the Inventory page.
Allow component telemetry: Allow component telemetry to be selected in the debugging pages.
Agent Precheck script: Added a precheck script to the agent install.
Agent should not enroll without GPUs or with incorrect GPUs: The Agent should not enroll any nodes without GPUs or with incorrect GPUs.
Fixed Bugs:
XID display bug: Fixed accelerator-nvidia-error-sxid has no display name bug.
GPU reported up wrongly bug: Fixed GPU State incorrectly remained “up” in the face of XID 94 & 95.
Compute Zone View stuck bug: Compute Zone View stuck in loading state on Inventory page.
NVLink BER metrics bug: NVLink BER metrics are showing all zeros in debugging page.
Error report dialog hang: Error report modal dialogue hanging after click Generate.
Metric X-axis truncation bug: Metric X-axis truncation bug.
Machine index: GPU index on machine details should match GPU Chart tooltip index
Agent Liveness Check Improvement: Agent liveness check improvement.
Panel Coordination bug: Event/alert from node detail panel should carry to Debug screen.