User Guide (v1.2)#

Fleet Intelligence is designed to provide organizations with robust inventory, integrity, and health monitoring capabilities for their NVIDIA GPU fleet. By installing a host-based agent that communicates with NVIDIA’s Fleet Intelligence service, you have access to a near-real-time visualization of all hosts with GPUs enrolled.

Accessing Fleet Intelligence#

Open the Fleet Intelligence dashboard. When the page loads, log in with NVIDIA GPU Cloud (NGC).

After you log in, you see the accounts, organizations, and teams you can access. Select the team your administrator enabled for Fleet Intelligence, then click Continue. For more information about NGC accounts, organizations, and teams, see the NGC User Guide.

If NGC Catalog opens after you log in, use the upper-left menu: click NGC Catalog, then select Fleet Intelligence.

Selecting Fleet Intelligence from NGC Catalog.

You are now on the Fleet Intelligence home screen. The following section walks through the dashboard: customizing the view, adding and grouping hosts, inspecting machine details, reviewing events, and generating reports.

The Dashboard#

The dashboard, shown in the following figure, is the central panel for Fleet Intelligence. The core features of the header include, from left to right:

  1. NVIDIA GPU Cloud tool (upper-left) — Fleet Intelligence.

  2. Fleet Intelligence tabs — Dashboard, Inventory, Debugging, Reports, Settings.

  3. Info icon — shows the current version of Fleet Intelligence.

  4. Alerts — opens active alerts for your GPU fleet.

  5. Help (?) — opens NVIDIA GPU Cloud help.

  6. NVIDIA GPU Cloud menu.

Dashboard overview.

The main panel provides an inventory overview, alert summary, and a time-based overview of the usage and utilization of your entire GPU fleet. The main panel view can easily be switched to GPU power, speed, and memory usage by clicking the Others button in the Resource Status section.

Selecting the View All button in the All Compute Zones, All Node Groups, or All Total Machines panels opens the corresponding pages in the Inventory tab. See the Inventory section of this guide for details.

The Total # of Alerts panel View All button and the Alerts button in the header open a quick summary of all active alerts in your GPU fleet. Alerts are grouped by nodes and can be expanded to show the type of alert for each host.

Inventory#

Under the Inventory tab, you see three views and an Add New button. Start by adding a node; a new window opens with the installation wizard.

The wizard walks you through downloading the .deb or .rpm package from the Fleet Intelligence Agent GitHub page, or through instructions for a Helm chart installation on Kubernetes.

Select the installation method and the commands to install the agent for your selected method will be presented (for example, Ubuntu and Red Hat derivatives), including adding the CUDA repository to your system for the DCGM installation if needed. If the NVIDIA Data Center GPU Manager (DCGM) repository is added, the Fleet Intelligence agent installation will include it as a dependency.

In step 3, follow the link to the Fleet Intelligence enrollment token management page. Create a new token or reuse an existing, unexpired token. If you generate a new token, choose the validity period and click Next. The token is shown for you to copy.

Note

Enrollment tokens are shown only once. Copy and store them securely before exiting the wizard.

Lastly, run the command shown under Enable Backend Communication to enroll the agent with the Fleet Intelligence service.

Once the agent initializes, it should take about one to five minutes for the node information to appear in the Inventory -> Machine View list. The Compute Zones and Node Groups will appear as Unassigned. Use the next section to create groups and assign your agents.

Machine Details#

In any of the inventory panels, when you click a machine’s host name, a side panel opens on the right with details about the state of the machine. There are four tabs on this panel. The Detail tab presents metadata about the machine, including the Fleet Intelligence agent, GPU driver, CUDA, kernel, and OS versions. Sub-tabs of the Detail tab (not shown) provide collected metadata for GPU, CPU, network, and disk.

The Status tab summarizes major component status, includes the GPU integrity check, and shows overall node status.

The third tab, Telemetry, shows graphs of the collected telemetry for utilization, temperature, and power consumption for major components such as CPU, GPU, memory, and disk.

The last tab, Alerts, shows the current alerts that cause the node to be marked as “Unhealthy”. If the node is healthy, there are no critical alerts. There can still be active alerts for non-critical issues such as low disk space, low memory, and security alerts.

Alerts can have two or three levels of detail. Select the menuicon menu next to an alert to mute events for the whole fleet or to create a notification rule for email, a Slack channel, or a webhook. See the Alert Configuration section for more details.

The machine Alerts panel can also use “Historical Alerts” mode, which shows alerts that occurred in the past but have since been resolved.

Events/Alerts#

For a fleet-wide view of alert and event status, in the upper menu bar, click alertcon. An Alerts panel opens on the right. Unlike the machine-specific panel, it lists alerts for all machines in the fleet. Each alert has multiple levels you can drill into to see individual alerts and each reported instance.

Alerts can have two or three levels of detail. Select the menuicon menu next to an alert to mute events for the whole fleet or to create a notification rule for email, a Slack channel, or a webhook. See the Alert Configuration section for more details.

The Alerts panel can also use “Historical Alerts” mode, which shows alerts that occurred in the past but have since been resolved.

Standard Groupings#

Node groups and Compute Zone groups can be created to group and manage a larger number of nodes. Node groups are for machines of a similar type (for example, similar GPUs, CPUs, or hardware manufacturer), while Compute Zone groups are for machines in similar locations (for example, data centers, states, or countries). Nodes can only reside in a single Node Group and a single Compute Zone. Use Tags for customized groups.

Once standard groupings are created, nodes can be assigned to Node groups and Compute Zones by selecting a single machine or multiple machines and clicking the menu icon menuicon on the right side of one of the machines, then selecting Assign Machine(s).

Tags#

Tags can be added to or removed from machines by clicking on the machine name. In the Machine Details page, click the tags icon (at the bottom) to open the Machine Tags dialog, where tags can be added or removed from machines/nodes.

Debugging - Detailed Analysis#

In the Debugging tab, use the Debugging Scope control (filter) on the left to choose a single node for detailed telemetry. You must also select a telemetry component group and a time range. Within that group you can select multiple metrics to show on one screen.

Reports#

On the Reports tab, you can run reports on your inventory and on errors reported in your fleet. When you select a report, choose a time range, a Compute Zone, and a Node Group, and use the Filter Machines by Tag option. Node Group and Compute Zone are optional; machines can also be filtered by Tags.

You can download the inventory report as a CSV file using the button in the upper right. Use the Edit report button to change the report selection criteria.

Settings#

The Settings tab lets you configure Fleet Intelligence options such as: * Managing agent enrollment tokens. * Data retention period.

Agent Enrollment Token Management#

The Agent Enrollment Token Management page lets you create new tokens, view existing token expiration dates, and revoke existing tokens.

Data Retention Period#

The Data Retention Period page lets you set how long Fleet Intelligence retains data. The default retention period is 365 days.

Alert Configuration#

The Alert Configuration page configures notification and mute settings for Fleet Intelligence. It summarizes alert rules, which you can create or delete from individual machine pages.

There are two primary tabs on the page:

  • Notify Alert: Create and manage notify alert rules for specific components and scopes.

  • Mute Alert: Create and manage mute alert rules for specific components; mute rules always apply to the entire fleet.

Each tab lists the full set of alert rules (notify or mute). The Notify tab also includes a delivery log of sent notifications.

To create an alert rule, click Create Alert Rule in the upper-right corner of the tab. The Create Alert Rule dialog opens. Use it to choose the component and scope, notification method (email, Slack, webhook), and notification frequency.

To edit or delete an existing alert rule, click the menuicon button to the right of the rule.

Agent Enrollment Troubleshooting#

sudo journalctl -f -u fleetintd