Monitoring

Fleet Command provides monitoring capabilities with system logging, usage analytics, and user activity tracking.

Fleet Command Logs

Fleet Command allows you to access your logs from the Fleet Command Console.

By default, only minimum system logs are collected. To enable detailed system and applications logs, go to the Settings Page. Enable All Logs for System and Location and enable toggle bar for Application and Deployment logs.

../_images/logs-04.png

Note

This will increase log storage on Fleet Command. You can view the usage on Fleet Command Usage Analytics. Logs more than 14 days or longer are not accessible.

In Fleet Command user interface, navigate to Logs.

../_images/logs-01.png

After navigating to Logs the page will display as shown below.

../_images/logs-02.png
  • You can select the following values from the drop-down to view the corresponding logs.

    • Location: The location name on that organization.

    • System: The system name that is associated with a location.

    • Deployment: The deployment name from the drop-down list.

    • Component: Select one of the following components from the drop-down.

Service

Query

eac

component:eac

efa

component:efa

egx-bootstrap.service

component:egx-bootstrap.service

egxd-cred-proxy.service

component:egxd-cred-proxy.service

egx installer_syslog

(Available only if edge system is a real bare metal system)

component:installer_syslog

Ext-auth agent

component:extauth

fluentbit

component:fluentbit

helm-operator

component:helm

egx KRS (used for kubelet TLS Bootstrap)

component:egx-krs

remote-management

component:vnclog (Remote Console 1.0)

component:RemoteConsole (Remote Console 2.0)

etcd

component:etcd

kube-apiserver

component:kube-apiserver

kube-controller-manager

component:kube-controller-manager

kube-proxy

component:kube-proxy

kube-scheduler

component:kube-scheduler

kubelet

component:kubelet.service

calico-node

component:calico-node

calico-kube-controllers

component:calico-kube-controllers

core-dns

component:kube-dns

nvidia-device-plugin

component:nvidia-device-plugin-ds

docker

component:docker.service

kernel

component:kernel

containerd

component:containerd.service

ssh

component:ssh.service

  • Adjust the logs timeframe from pre-selected values from the drop-down or use custom value and then select the specific date range below.

    ../_images/logs-03.png ../_images/logs-17.png

Note

For logs, the number of pages are restricted to 60,000 only. If it exceeds, you will see the above warning. To avoid this, provide a more specific query.

Deployment Logs - All Locations

  • Deployment logs for all locations can be viewed by clicking the ellipsis.

    ../_images/logs-05.png

Deployment Logs Specific - Locations

  • To query deployment logs for a specific location, click the ellipsis from the location under that deployment.

  • This will open a tab/window to the Graylog dashboard with the query shown below.

    ../_images/logs-06.png

Troubleshooting Deployments

The Fleet Command search dashboard allows for additional keywords to be used to troubleshoot or pull fine-grained logs specific to each system, component, etc.

  • To see the status of all deployments for a location by viewing the helm logs:

    ../_images/logs-07.png
  • To pull more fine-grained Helm logs for a deployment:

    ../_images/logs-08.png
  • To see the status of all applications for a location by viewing the kubelet logs:

    ../_images/logs-09.png
  • To pull more fine-grained logs to see if an application is running or failed:

    ../_images/logs-10.png
  • To get application logs from stdout/stderr streams:

    ../_images/logs-11.png

System Logs

All system logs from a location

System logs for a location can be viewed by clicking the ellipsis under the location.

../_images/logs-12.png

All system logs from an EGX System

To view system logs from an EGX System, click on the action menu option from the specific system under the location.

../_images/logs-13.png
  • To get more specific logs for your application, you can select multiple values as shown below:

    ../_images/logs-14.png

Note

Select the value from each dropdown to combine the queries to get more accurate matches.

Downloading Logs

It is also possible to export your search results as a CSV file. Navigate to the Fleet Command user interface and select Logs.

../_images/logs-15.png

Click on the Export button to download the logs as a CSV file.

../_images/logs-16.png

Usage Analytics

Fleet Command usage analytics provide insight into GPU usage, registry usage, and logs usage.

Note

Fleet Command Stack 0.4.47 and above is required for usage analytics.

To view the Usage, navigate to the NGC user interface and select Organization. Next, under Organization, choose Fleet Command.

../_images/usage-analytics-01.png

After selecting Fleet Command, the following page displays your organization’s Fleet Command usage.

../_images/usage-analytics-02.png

On the usage page, you will find the following KPIs:

  • System Name: Choose a name for the new system.

  • Current: Displays the current GPUs usage count.

  • Max: Displays the Max GPU Usage in the current month.

  • Private Registry: Displays the storage used by the private registry.

  • Logs: Displays the current Logs storage usage.

Users can export the usage analytics to a CSV with the Download CSV button.

To disable or enable the view of each section, toggle the slider located in the section:

  • GPUs Under Management: Displays the GPU’s usage over the period in the current month.

  • Maximum GPUs Under Management: Displays the Maximum GPU usage in each month over the current year.

  • Storage: Displays both Private Registry and Logs storage usage in each month over the current year.

  • GPU Inventory: List the location, systems with the GPU type, and number of GPUs attached.

The location name and managed GPUs (GPU names) can be selected under the GPU inventory option.

User Activity

Fleet Command user activity provides insights into the user’s activities within the organization.

  • To view the user activity, navigate to the NGC User interface and select Organization.

  • Next, under the Organization, select Audit.

    ../_images/user-activity-01.png
  • After selecting Audit, the following page displays your organization’s user activity report.

    ../_images/user-activity-02.png
  • Select the date range that reflects the time frame of user activity and then Create a New Request. This action will generate a downloadable report that you can use to view the user activity.

Metrics

Fleet Command allows you to view system and application-specific metrics for your deployments at edge locations. Metrics are numerical values that measure aspects of your resources at regular intervals. With metrics, you can monitor and analyze the performance of your deployed machine learning inference solutions over time. This information can help you make adjustments to improve resource consumption and the overall performance of your deployments.

The following metrics are available in Fleet Command:

  • System metrics: measurement of edge system resource utilization, including

    • Total CPU Utilization (per core)

    • Total RAM Utilization

    • Total GPU Utilization

    • Total Storage Utilization

    • Total Network Utilization

  • Application utilization metrics: measurement of application system resource utilization, including

    • App CPU Utilization

    • App RAM Utilization

    • App GPU Utilization

    • App Storage Utilization

    • App Network Utilization

Using the Fleet Command metric CLI, you can view detailed metrics under several categories (“buckets”) across organizations, view all metrics in a bucket, or view summary metrics for a particular organization within a given time period. The metrics categories include the following:

Bucket

Description

app-utilization

Application usage metrics for memory, power, GPU, etc.

custom-app

Custom metrics exposed by applications.

system

Metrics for disk, memory, CPU, and network usage.

The following are examples of using the metric command:

  • To see what buckets are contained within an org:

ngc fleet-command buckets --org <org-name>
  • To retrieve a list of defined metrics within a bucket:

ngc fleet-command metric --bucket <bucket-name> --org <org-name>
  • To see a summary of information about a given metric:

ngc fleet-command metric <metric-name> --bucket <bucket-name> -- from-date <from-date> --to-date <to-date>
  • A raw Flux query passthrough is also available that returns a JSON response:

ngc fleet-command metric query <query>

For more information on the metrics command and options, refer to the NGC CLI documentation.

Custom Application Metrics

You can also define and provide custom application metrics from deployed applications and access these aggregated metrics for all the deployments.

Custom application metrics are expected to be exposed as a Prometheus metrics exporter endpoint. For more information on writing custom Prometheus exporters, refer to the Prometheus development documentation.

To expose application metrics, use the following annotations on your application Pod that serves metrics.

  • prometheus.io/scrape Enable scraping for this pod.

  • prometheus.io/scheme Default value is http.

  • prometheus.io/path Override the path for the metrics endpoint on the service (default: '/metrics').

  • prometheus.io/port Used to override the port (default: 9102).

You can retrieve custom metrics through the CLI underneath the ngc fleet-command metric command by using the custom-app bucket.