1. Fleet Intelligence User Guide (EA)#
Fleet Intelligence is designed to provide organizations with robust inventory, integrity, and health monitoring capabilities for their NVIDIA GPU fleet. By installing a host-based agent that communicates with NVIDIA’s Fleet Intelligence service, you have access to a near-real-time visualization of all hosts with GPUs enrolled.
1.1. Accessing Fleet Intelligence#
Navigate to the link provided for the Fleet Intelligence dashboard. After you arrive at the dashboard, you will be prompted to log in with NVIDIA GPU Cloud (NGC), as shown below.
After logging in, you will see a list of Accounts/Organizations/Teams you can access. Select the Team your administrator has enabled for Fleet Intelligence, then click “Continue”.
You may land on the NGC Catalog after logging in (see the label in the upper-left corner). If so, click NGC Catalog and select Fleet Intelligence.
At this point you will be at the initial screen of the Fleet Intelligence dashboard as shown below. The next section will provide you with a tour of the dashboard and how to customize the dashboard, adding and grouping hosts, inspecting machine details, assessing events and generating reports.
1.2. The Dashboard#
The dashboard as shown in the following figure is the central panel for Fleet Intelligence. The core features of the dashboard header include, from left to right:
NVIDIA GPU Cloud tool (upper-left) — “Fleet Intelligence”.
Fleet Intelligence tabs — Dashboard, Inventory, Debugging, Reports.
Backend indicator — shows which backend is active for Fleet Intelligence.
Alerts — opens active alerts for your GPU fleet.
Help (?) — opens NVIDIA GPU Cloud help.
NVIDIA GPU Cloud menu.
The main panel provides an inventory overview, alert summary, and a time-based overview of the usage and utilization of your entire GPU fleet. The main panel view can easily be switched to GPU power, speed, and memory usage by clicking the “Other” button in the “Resource Status” section.
Selecting the “View all” buttons in the Compute Zones, Node Groups or Total Machines panels will take you to the respective pages within the Inventory tab. Please see the ‘Inventory’ section of this manual for details.
The Total Alerts panel “View all” button and the “Alerts” button in the header will take you to a quick summary of all ‘active’ alerts in your GPU fleet. Shown below, you can see alerts are grouped by nodes in this view and can be expanded to show the type of alert for each host.
1.3. Inventory#
Under the “Inventory” tab, you will see 3 views and an “Add New” button. We will start by adding a node.
The wizard will walk you through the steps to download the .deb or .rpm package from the Fleet Intelligence Agent GitHub page.
After you download the agent, the commands to install the agent will be presented for a number of operating systems (for example, Ubuntu and Red Hat derivatives), including adding the CUDA repository to your system for the DCGM installation if needed (if the NVIDIA Data Center GPU Manager (DCGM) repository is added, the Fleet Intelligence agent installation will include it as a dependency).
In step 3, create a new Fleet Intelligence service key, or reuse an existing, unexpired key. If generating a new key, choose the validity period and click Next. The key will be displayed for you to copy.
The final screen will provide the commands to run in order to enroll the agent with the service.
Note
Service keys are shown only once. Copy and store them securely before exiting the wizard.
Note (Early Access): Revoking edit keys is not yet supported; this will be added for General Availability.
Once the agent initializes, it should take about one to five minutes for the node information to appear in the Inventory -> Machine View list. The Compute Zones and Node Groups will appear as Unassigned (see below). Use the next section to create groups and assign your agents.
1.3.1. Machine Details#
In any of the inventory panels, upon clicking on a machine’s host name, a side panel will appear on the right and present details about the state of the machine. There are four tabs on this panel. The Detail tab presents metadata about the machine, including the Fleet Intelligence agent, GPU driver, CUDA, kernel, and OS versions. Sub-tabs of the Detail tab (not shown) provide collected metadata information about GPU, CPU, Network and Disk.
The Status tab provides a brief summary of the status of major components and provides the integrity check of the GPU along with an overall status for the node.
The third tab, Telemetry, shows graphs of the collected telemetry for utilization, temperature, and power consumption for major components such as CPU, GPU, memory, and disk.
The last tab, Alerts, shows the current events causing the node to be marked as “Unhealthy”. If the node is healthy, there will be no alerts. Alerts can have 2-3 levels of detail below them.
1.3.2. Events/Alerts#
For an overall view of the fleet alert/event status, in the upper menu
bar, click on the
, an ‘Alerts’ panel will appear on the right.
This panel, unlike the machine specific panel, will show alerts for all
machines in the fleet. Each alert has multiple levels which can be
drilled into to see individual alerts and every reported instance.
1.3.3. Standard Groupings#
Node groups and Compute Zone groups can be created to group and manage a larger number of nodes. Node groups are similar type machines (i.e., similar GPUs, CPUs, hardware manufacturer, etc.) while Compute Zone groups are machines in similar locations (i.e., Data Centers, States, Countries, etc.). Nodes can only reside in a single Node Group and a single Compute Zone. Use tags for customized groups.
Once standard groupings are created, nodes may be assigned to Node
groups and Compute Zones by selecting a single machine or multiple
machines and clicking the menu icon
on the right side of
one of the machines and selecting “Assign Machine(s)”
1.4. Debugging - Detailed Analysis#
In the “Debugging” tab you can use the “Debugging Scope” selection (Filter) on the left to select a single node from which to get detailed telemetry. Furthermore, you must select an individual component group of telemetry and a time range. Within the component group you can select multiple telemetry metrics to show on a single screen.
1.5. Reports#
Under the “Reports” tab you will find available reports to run about your inventory and errors reported within your fleet. Selecting a report will provide you with the ability to select a time range, Compute Zone, and Node Group, along with a “Filter Machines by Tag” option. Node Group and Compute Zone are optional filters; machines can also be filtered by Tag.
The inventory report may be downloaded as a .csv file using the button
in the upper right. The
can be used to change the report
selection criteria.
1.5.1. Agent Enrollment Troubleshooting#
sudo journalctl -f -u fleetintd