Troubleshooting#

NVIDIA Fleet Intelligence (NFI) has two components:

  1. The agent

  2. The service

Since the service is hosted by NVIDIA, most troubleshooting will occur at the agent, which is installed on one or many nodes within the customer environment.

Often the nodes are similar or identical, so it may be possible to debug a single representative node, but nodes may be different, so “representative nodes” may be difficult to identify and may vary based on network, location, data center, and so on.

There are two main methods to install the NFI agent:

  1. a base operating system install (for example, Red Hat or Debian/Ubuntu), and

  2. a Helm installation of the agent as a DaemonSet within Kubernetes.

The triage will be slightly different between the two installation methods.

Agent Troubleshooting#

Note

For base OS installation problems, it is important that a CUDA repository has been enabled, as the Fleet Intelligence agent installation will pull in required dependencies (Ubuntu and RedHat), including Data Center GPU Manager (DCGM) and corelib.

Note

For Helm installations, NFI requires DCGM to already be running. In Kubernetes installations, this is often accomplished through the GPU Operator. DCGM must be installed in Host Engine mode (i.e., shared mode). This shared DCGM mode allows multiple services to access it simultaneously. Please see the details in the Helm install documentation.

If installation of the package is successful, the agent will start and log to the journal. To check the status of the agent service, run the following command:

sudo systemctl status fleetintd

To view the logs, run the following command:

sudo journalctl -f -u fleetintd

Confirm that the GPUs are present and the driver is loaded by running nvidia-smi:

nvidia-smi

Agent Enrollment Troubleshooting#

Run the precheck sequence manually to confirm the configuration is correct:

fleetint precheck

Gather the machine information as the NFI agent sees it to confirm configuration:

fleetint machine-info

Note

Messages in logs like “msg”:”failed to get UUID from dmidecode, trying to read from file” are warnings; they are expected and not a problem.

Confirm connectivity:

ping data.fleet-intelligence.nvidia.com

This confirms DNS and basic connectivity, but keep in mind that ICMP (ping) may be blocked by a firewall or network configuration.

Confirm HTTPS connectivity:

curl -I -m 30 https://data.fleet-intelligence.nvidia.com/test

This should quickly return a 404 (file not found) error, indicating that the connection was OK. If it takes a long time and times out, it indicates a problem with HTTP connectivity. Check firewall rules and network configuration to confirm that the agent can access the NFI service.