Troubleshooting - NVIDIA Docs

NVIDIA Docs Hub NVIDIA Fleet Command Fleet Command User Guide Troubleshooting

NVIDIA Fleet Command - (Latest Version)

Fleet Command Logs

Fleet Command collects log messages from edge systems and locations. The edge system sources include Kubernetes services, running Kubernetes system pods, Fleet Command stack system pods, and user application pod output. The log messages are recorded, tagged with additional keys and values, and aggregated into a central database. You can then query this database with the web UI and the command-line interface, using the keys/values as filters.

Fleet Command allocates 400 GB of storage for log rotation. Older logs are removed when 400 GB is reached.

There are two settings for logging messages from edge locations in Fleet Command:

System and location logging
Application and deployment logging

Both settings are available on the Fleet Command > Settings page.

By default, both logging options are disabled and can be enabled by an administrator.

Enabling all system and location or application and deployment logging increases log utilization on Fleet Command. Refer to Fleet Command Usage Analytics to learn how to view the log usage.

System and Location Logging

System and location logging has two options: Fleet Command Only or All Logs. Selecting Fleet Command Only will only send a subset of logs from components running on the edge location systems, while All Logs will send all logs from system components back to the Fleet Command logging service.

Logs from edge systems are categorized by the “component” which corresponds to the service running on the system generating the log. These components can be used in the logging screen to filter messages based on the component generating them.

When Fleet Command Only is selected, logs from the following components are sent:

Component	Description	Used for
kernel	Linux kernel messages	Hardware-level issues
kubelet.service	Kubernetes node agent
RemoteConsole	Remote Console access messages ( collected from the cloud service, not the edge system)

When All Logs is selected, logs from the following components are sent in addition to the components for Fleet Command Only:

Component	Description	Used for
auditd.service	System event auditing	Hardware-level issues
sshd.service	Secure shell server
containerd.service	Container system services (was dockerd.service in older versions)	Errors/messages relating to downloading and running container images
egx-bootstrap.service	Bootstrap service
egxd-cred-proxy.service	Credential Proxy Service
egx-nlm.service	Node Lifecycle Management service	Remote Application Access
helm	Helm Operator	Errors/messages relating to fetching, installing application charts
kube-proxy	Kubernetes application proxy
kube-apiserver	Kubernetes API services
calico-kube-controllers	Kubernetes networking services
calico-node	Kubernetes networking services
kube-scheduler	Kubernetes resource scheduling
etcd	Kubernetes configuration storage
nvidia-device-plugin-ds	NVIDIA device plugin for Kubernetes
eac	Edge admission controller	Errors/messages related to allowing or denying application deployments based on requested system resources (e.g. hostPaths, etc)
fluentbit	Log forwarding
efa	Edge federation agent

The components listed in the previous table are subject to change.

All system logs from a location

You can view system logs for a location by selecting View Logs from the options button for the location.

All system logs from Fleet Command

To view system logs from Fleet Command, select View Logs from the action menu of the system.

To get more specific logs for your application, specify a search term and multiple filters as shown below:

Application and Deployment Logging

You can enable or disable application and deployment logging. When this option is enabled, edge locations send logs from your application deployments to the Fleet Command logging service. When it is disabled, application deployment logs are not sent. Existing messages are available until the fourteen-day deletion policy.

Logs from application deployments are categorized by deployment name. The logging messages are created by output (stdout and stderr) from containers running in the deployment. You can use the deployment name to filter messages from a particular deployment only.

Application deployment logs only contain messages from running containers in deployments and do not include any messages from system components that might be related to creating and launching a deployment.

For example, a deployment could fail because the Helm chart could not be fetched. In this case, there are no messages for the deployment name in the logging screen. However, there log messages from the Helm component might describe the issue with fetching the Helm chart.

Viewing Logs

Select Fleet Command > Logs.

You can select values from the following filters to limit the number of log messages:

Location: The location name of that organization.
System: The system name associated with a location.
Deployment: The deployment name from the drop-down list.
Component: Select one of the following components from the drop-down.

Adjust the logs timeframe from pre-selected values from the time interval menu or specific a date range.

Logs messages are limited to 60,000 only. If it exceeds, you will see the above warning. To avoid this, provide a more specific query.

Deployment Logs - All Locations

Select Fleet Command > Deployments.
Click the actions button and select View Logs.

Deployment Logs Specific - Locations

Select Fleet Command > Deployments.
Select the deployment from the table of deployments.
On the deployment details page, click the options button for the location and select View Logs.

Troubleshooting Deployments

The Fleet Command search dashboard allows for additional keywords to be used to troubleshoot or pull fine-grained logs specific to each system, component, etc.

To see the status of all deployments for a location by viewing the helm logs:
To pull more fine-grained Helm logs for a deployment:
To see the status of all applications for a location by viewing the kubelet logs:
To pull more fine-grained logs to see if an application is running or failed:
To get application logs from stdout/stderr streams:

Downloading Logs

To download logs with the web interface, perform the following steps:

Select Fleet Command > Logs.
Select the filters to apply and click Export to download the logs to a CSV file.

To download logs with the NGC CLI, perform the following steps:

Run the ngc fleet-command logs command:
Copy

Copied!
```
            
            $ ngc fleet-command log download --range 30 --system demo-system-0 --location demo-location --component helm --name fc.log
        
```
Log messages are download to the fc.log file. The file includes all log messages over the last 30 seconds from the helm component running on the system demo-system-0 in location demo-location.

Refer to the Fleet Command CLI documentation for more information.

Debugging with the Remote Console

You can also use the remote console feature of Fleet Command to help you troubleshoot issues with your deployments. Refer to Remote Console for more information.

Previous Monitoring

Next Deployment Example