Monitoring

Fleet Command provides monitoring capabilities with usage analytics, user activity tracking and metrics.

Fleet Command usage analytics provide insights into GPU usage, registry usage, and log usage.

Note

Fleet Command Stack version 0.4.47 or later is required for usage analytics.

To view the usage, navigate to the NGC application and select Organization > Usage in the left navigation.

usage-analytics-01.png

On the Usage page, select Entitled Products > Fleet Command.

usage-analytics-02.png

The Overview tab displays a summary of the GPU and storage usage, as well as system types and systems in use.

  • GPU Usage Overview: Displays the current and monthly peak total GPU usage.

  • Storage Usage Overview: Displays the current logs and private registry storage usage.

  • System Types: Displays various system types in use.

  • System Inventory: Displays a list of systems in use.

The GPU Usage tab displays the following:

  • Current GPU Usage: Displays the GPU usage by type for the current month.

    usage-analytics-04.png


  • Monthly Peak: Displays the monthly GPU peak usage by GPU type. You can adjust the month under Timeframe.

  • Daily Peak: Displays the daily GPU peak usage for the selected GPU type and timeframe. You can change the GPU type to display using the dropdown.

    usage-analytics-05.png


  • Total Monthly Peak Trend: Displays the current year’s monthly GPU peak trend. Mouse over the graph to see the exact count on the timeline. You can change the date range using the Timeframe selector.

    usage-analytics-06.png


The Storage Usage tab displays both the Private Registry and Logs storage usage in GB for each month in the current year. You can select a different year using the Timeframe selector.

  • Logs: Displays the current Logs storage usage.

  • Private Registry: Displays the storage used by the private registry.

Mouse over the graph to view the specific storage size.

usage-analytics-07.png


If you wish to export the GPU and storage usage analytics to a CSV, click on Download CSV, and the data will be bundled and downloaded to a ZIP file.

Fleet Command user activity provides insights into the user’s activities within the organization.

  • To view the user activity, navigate to the NGC User interface and select Organization.

  • Next, under the Organization, select Audit.

    user-activity-01.png


  • After selecting Audit, the following page displays your organization’s user activity report.

    user-activity-02.png


  • Select the date range that reflects the time frame of user activity and then Create a New Request. This action will generate a downloadable report that you can use to view the user activity.

Fleet Command lets you to view system and application-specific metrics for your deployments at edge locations. Metrics are numerical values that measure aspects of your resources at regular intervals. With metrics, you can monitor and analyze the performance of your deployed machine learning inference solutions over time. This information can help you make adjustments to improve resource consumption and the overall performance of your deployments.

The following metrics are available in Fleet Command:

  • System metrics: measurement of edge system resource utilization, including

    • Total CPU Utilization (per core)

    • Total RAM Utilization

    • Total GPU Utilization

    • Total Storage Utilization

    • Total Network Utilization

  • Application utilization metrics: measurement of application system resource utilization, including

    • App CPU Utilization

    • App RAM Utilization

    • App GPU Utilization

    • App Storage Utilization

    • App Network Utilization

Using the Fleet Command metric CLI, you can view detailed metrics under several categories (“buckets”) across organizations, view all metrics in a bucket, or view summary metrics for a particular organization within a given time period. The metrics categories include the following:

Bucket

Description

app-utilization

Application usage metrics for memory, power, GPU, etc.

custom-app

Custom metrics exposed by applications.

system

Metrics for disk, memory, CPU, and network usage.

The following are examples of using the metric command:

  • To see what buckets are contained within an org:

Copy
Copied!
            

ngc fleet-command metric buckets --org <org-name>

  • To retrieve a list of defined metrics within a bucket:

Copy
Copied!
            

ngc fleet-command metric list --bucket <bucket-name> --org <org-name>

  • To see a summary of information about a given metric:

Copy
Copied!
            

ngc fleet-command metric summary <metric-name> --bucket <bucket-name> --from-date <from-date> --to-date <to-date>

  • A raw Flux query passthrough is also available that returns a JSON response:

Copy
Copied!
            

ngc fleet-command metric query <query>

For more information on the metrics command and options, refer to the NGC CLI documentation.

Custom Application Metrics

You can also define and provide custom application metrics from deployed applications and access these aggregated metrics for all the deployments.

Custom application metrics are expected to be exposed as a Prometheus metrics exporter endpoint. For more information on writing custom Prometheus exporters, refer to the Prometheus development documentation.

To expose application metrics, use the following annotations on your application Pod that serves metrics.

  • prometheus.io/scrape Enable scraping for this pod.

  • prometheus.io/scheme Default value is http.

  • prometheus.io/path Override the path for the metrics endpoint on the service (default: '/metrics').

  • prometheus.io/port Used to override the port (default: 9102).

Using the custom-app bucket, you can retrieve custom metrics using the CLI ngc fleet-command metric command.

© Copyright 2022-2023, NVIDIA. Last updated on Oct 3, 2023.