Troubleshooting#
NVIDIA Fleet Intelligence (NFI) has two components:
The agent
The service
Since the service is hosted by NVIDIA, most troubleshooting will occur at the agent, which is installed on one or many nodes within the customer environment.
Often the nodes are similar or identical, so it may be possible to debug a single representative node, but nodes may be different, so “representative nodes” may be difficult to identify and may vary based on network, location, data center, and so on.
There are two main methods to install the NFI agent:
a base operating system install (for example, Red Hat or Debian/Ubuntu), and
a Helm installation of the agent as a DaemonSet within Kubernetes.
The triage will be slightly different between the two installation methods.
Agent Troubleshooting#
Note
For base OS installation problems, it is important that a CUDA repository has been enabled, as the Fleet Intelligence agent installation will pull in required dependencies (Ubuntu and RedHat), including Data Center GPU Manager (DCGM) and corelib.
Note
For Helm installations, NFI requires DCGM to already be running. In Kubernetes installations, this is often accomplished through the GPU Operator. DCGM must be installed in Host Engine mode (i.e., shared mode). This shared DCGM mode allows multiple services to access it simultaneously. Please see the details in the Helm install documentation.
If installation of the package is successful, the agent will start and log to the journal. To check the status of the agent service, run the following command:
sudo systemctl status fleetintd
To view the logs, run the following command:
sudo journalctl -f -u fleetintd
Confirm that the GPUs are present and the driver is loaded by running nvidia-smi:
nvidia-smi
Agent Enrollment Troubleshooting#
Run the precheck sequence manually to confirm the configuration is correct:
fleetint precheck
Gather the machine information as the NFI agent sees it to confirm configuration:
fleetint machine-info
Note
Messages in logs like “msg”:”failed to get UUID from dmidecode, trying to read from file” are warnings; they are expected and not a problem.
Confirm connectivity:
ping data.fleet-intelligence.nvidia.com
This confirms DNS and basic connectivity, but keep in mind that ICMP (ping) may be blocked by a firewall or network configuration.
Confirm HTTPS connectivity:
curl -I -m 30 https://data.fleet-intelligence.nvidia.com/test
This should quickly return a 404 (file not found) error, indicating that the connection was OK. If it takes a long time and times out, it indicates a problem with HTTP connectivity. Check firewall rules and network configuration to confirm that the agent can access the NFI service.