1. Fleet Intelligence User Guide (EA)#

Fleet Intelligence is designed to provide organizations with robust inventory, integrity, and health monitoring capabilities for their NVIDIA GPU fleet. By installing a host-based agent that communicates with NVIDIA’s Fleet Intelligence service, you have access to a near-real-time visualization of all hosts with GPUs enrolled.

1.1. Accessing Fleet Intelligence#

Navigate to the link provided for the Fleet Intelligence dashboard. After you arrive at the dashboard, you will be prompted to log in with NVIDIA GPU Cloud (NGC), as shown below.

Login Box in UI.

After logging in, you will see a list of Accounts/Organizations/Teams you can access. Select the Team your administrator has enabled for Fleet Intelligence, then click “Continue”.

Team selection in Login UI.

You may land on the NGC Catalog after logging in (see the label in the upper-left corner). If so, click NGC Catalog and select Fleet Intelligence.

Selecting Fleet Intelligence from NGC Catalog.

At this point you will be at the initial screen of the Fleet Intelligence dashboard as shown below. The next section will provide you with a tour of the dashboard and how to customize the dashboard, adding and grouping hosts, inspecting machine details, assessing events and generating reports.

1.2. The Dashboard#

The dashboard as shown in the following figure is the central panel for Fleet Intelligence. The core features of the dashboard header include, from left to right:

  1. NVIDIA GPU Cloud tool (upper-left) — “Fleet Intelligence”.

  2. Fleet Intelligence tabs — Dashboard, Inventory, Debugging, Reports.

  3. Backend indicator — shows which backend is active for Fleet Intelligence.

  4. Alerts — opens active alerts for your GPU fleet.

  5. Help (?) — opens NVIDIA GPU Cloud help.

  6. NVIDIA GPU Cloud menu.

Dashboard overview.

The main panel provides an inventory overview, alert summary, and a time-based overview of the usage and utilization of your entire GPU fleet. The main panel view can easily be switched to GPU power, speed, and memory usage by clicking the “Other” button in the “Resource Status” section.

Selecting the “View all” buttons in the Compute Zones, Node Groups or Total Machines panels will take you to the respective pages within the Inventory tab. Please see the ‘Inventory’ section of this manual for details.

The Total Alerts panel “View all” button and the “Alerts” button in the header will take you to a quick summary of all ‘active’ alerts in your GPU fleet. Shown below, you can see alerts are grouped by nodes in this view and can be expanded to show the type of alert for each host.

Alerts overview.

1.3. Inventory#

Under the “Inventory” tab, you will see 3 views and an “Add New” button. We will start by adding a node.

Install wizard panel 1.

The wizard will walk you through the steps to download the .deb or .rpm package from the Fleet Intelligence Agent GitHub page.

Add new overview.

After you download the agent, the commands to install the agent will be presented for a number of operating systems (for example, Ubuntu and Red Hat derivatives), including adding the CUDA repository to your system for the DCGM installation if needed (if the NVIDIA Data Center GPU Manager (DCGM) repository is added, the Fleet Intelligence agent installation will include it as a dependency).

Install wizard panel 2.

In step 3, create a new Fleet Intelligence service key, or reuse an existing, unexpired key. If generating a new key, choose the validity period and click Next. The key will be displayed for you to copy.

Install wizard panel 3.

The final screen will provide the commands to run in order to enroll the agent with the service.

Install wizard panel 4.

Note

Service keys are shown only once. Copy and store them securely before exiting the wizard.

Note (Early Access): Revoking edit keys is not yet supported; this will be added for General Availability.

Once the agent initializes, it should take about one to five minutes for the node information to appear in the Inventory -> Machine View list. The Compute Zones and Node Groups will appear as Unassigned (see below). Use the next section to create groups and assign your agents.

Machine View overview.

1.3.1. Machine Details#

In any of the inventory panels, upon clicking on a machine’s host name, a side panel will appear on the right and present details about the state of the machine. There are four tabs on this panel. The Detail tab presents metadata about the machine, including the Fleet Intelligence agent, GPU driver, CUDA, kernel, and OS versions. Sub-tabs of the Detail tab (not shown) provide collected metadata information about GPU, CPU, Network and Disk.

The Status tab provides a brief summary of the status of major components and provides the integrity check of the GPU along with an overall status for the node.

The third tab, Telemetry, shows graphs of the collected telemetry for utilization, temperature, and power consumption for major components such as CPU, GPU, memory, and disk.

The last tab, Alerts, shows the current events causing the node to be marked as “Unhealthy”. If the node is healthy, there will be no alerts. Alerts can have 2-3 levels of detail below them.

1.3.2. Events/Alerts#

For an overall view of the fleet alert/event status, in the upper menu bar, click on the alertcon, an ‘Alerts’ panel will appear on the right. This panel, unlike the machine specific panel, will show alerts for all machines in the fleet. Each alert has multiple levels which can be drilled into to see individual alerts and every reported instance.

Alerts panel overview.

1.3.3. Standard Groupings#

Node groups and Compute Zone groups can be created to group and manage a larger number of nodes. Node groups are similar type machines (i.e., similar GPUs, CPUs, hardware manufacturer, etc.) while Compute Zone groups are machines in similar locations (i.e., Data Centers, States, Countries, etc.). Nodes can only reside in a single Node Group and a single Compute Zone. Use tags for customized groups.

Once standard groupings are created, nodes may be assigned to Node groups and Compute Zones by selecting a single machine or multiple machines and clicking the menu icon menuiconon the right side of one of the machines and selecting “Assign Machine(s)”

Assign Machine Menu. Assign Machine Wizard.

1.3.4. Tags#

Tags can be added to or removed from machines by clicking on the machine name. In the Machine Details page, click tagsicon(at the bottom) to open the Machine Tags dialog, where tags can be added or removed from machines/nodes as shown below.

Tags Dialog Box.

1.4. Debugging - Detailed Analysis#

In the “Debugging” tab you can use the “Debugging Scope” selection (Filter) on the left to select a single node from which to get detailed telemetry. Furthermore, you must select an individual component group of telemetry and a time range. Within the component group you can select multiple telemetry metrics to show on a single screen.

1.5. Reports#

Under the “Reports” tab you will find available reports to run about your inventory and errors reported within your fleet. Selecting a report will provide you with the ability to select a time range, Compute Zone, and Node Group, along with a “Filter Machines by Tag” option. Node Group and Compute Zone are optional filters; machines can also be filtered by Tag.

The inventory report may be downloaded as a .csv file using the button in the upper right. The editreporticoncan be used to change the report selection criteria.

Diagnostics Report.

1.5.1. Agent Enrollment Troubleshooting#

sudo journalctl -f -u fleetintd