Fleet Command User Guide
NVIDIA Fleet Command - (Latest Version)

Monitoring

Fleet Command provides monitoring capabilities with usage analytics, user activity tracking and metrics.

Fleet Command usage analytics provide insights into GPU usage, registry usage, and log usage.

  1. Select Organization > Usage.

  2. On the Usage page, select Entitled Products > Fleet Command.

    usage-analytics-02.png

The Overview tab displays a summary of the GPU and storage usage, as well as system types and systems in use.

  • GPU Usage Overview: Displays the current and monthly peak total GPU usage.

  • Storage Usage Overview: Displays the current logs and private registry storage usage.

  • System Types: Displays various system types in use.

  • System Inventory: Displays a list of systems in use.

The GPU Usage tab displays the following:

  • Current GPU Usage: Displays the GPU usage by type for the current month.

    usage-analytics-04.png

  • Monthly Peak: Displays the monthly GPU peak usage by GPU type. You can adjust the month under Timeframe.

  • Daily Peak: Displays the daily GPU peak usage for the selected GPU type and timeframe. You can change the GPU type to display using the dropdown.

    usage-analytics-05.png

  • Total Monthly Peak Trend: Displays the current year’s monthly GPU peak trend. Mouse over the graph to see the exact count on the timeline. You can change the date range using the Timeframe selector.

    usage-analytics-06.png

The Storage Usage tab displays both the Private Registry and Logs storage usage in GB for each month in the current year. You can select a different year using the Timeframe selector.

  • Logs: Displays the current Logs storage usage.

  • Private Registry: Displays the storage used by the private registry.

Mouse over the graph to view the specific storage size.

usage-analytics-07.png

If you wish to export the GPU and storage usage analytics to a CSV, click on Download CSV, and the data will be bundled and downloaded to a ZIP file.

Fleet Command user activity provides insights into the user’s activities within the organization.

  1. Select Organization > Audit.

    user-activity-02.png

  2. Select the date range that reflects the time frame of user activity and then Create Report. This action generates a downloadable report that you can use to view the user activity.

Fleet Command lets you to view system and application-specific metrics for your deployments at edge locations. Metrics are numerical values that measure aspects of your resources. The metrics are collected approximately every 10 seconds.

With metrics, you can monitor and analyze the performance of your deployed machine learning inference solutions over time. This information can help you make adjustments to improve resource consumption and the overall performance of your deployments.

The following metrics are available in Fleet Command:

  • System metrics: measurement of edge system resource utilization, including

    • Total CPU Utilization (per core)

    • Total RAM Utilization

    • Total GPU Utilization

    • Total Storage Utilization

    • Total Network Utilization

  • Application utilization metrics: measurement of application system resource utilization, including

    • App CPU Utilization

    • App RAM Utilization

    • App GPU Utilization

    • App Storage Utilization

    • App Network Utilization

  • Alert metrics: threshold-crossing events for CPU, memory, and disk usage.

Using the Fleet Command metric CLI, you can view detailed metrics under several categories (“buckets”) across organizations, view all metrics in a bucket, or view summary metrics for a particular organization within a given time period. The metrics categories include the following:

Bucket

Description

alert Alerts for CPU, memory, and disk usage.
app-utilization Application usage metrics for memory, power, GPU, and so on.
custom-app Custom metrics exposed by applications.
system Metrics for disk, memory, CPU, and network usage.

The following are examples of using the metric command:

  • To see what buckets are contained within an org:

    Copy
    Copied!
                

    $ ngc fleet-command metric buckets --org <org-name>

  • To retrieve a list of defined metrics within a bucket:

    Copy
    Copied!
                

    $ ngc fleet-command metric list --bucket <bucket-name> --org <org-name>

  • To see a summary of information about a given metric:

    Copy
    Copied!
                

    $ ngc fleet-command metric summary <metric-name> --bucket <bucket-name> --from-date <from-date> --to-date <to-date>

  • To submit a Flux query, passthrough is also available and returns a JSON response:

    Copy
    Copied!
                

    $ ngc fleet-command metric query <query>

    Refer to Query data with Flux for information about the query syntax.

For the latest information on the metric command and options, refer to the NGC CLI documentation.

Custom Application Metrics

You can also define and provide custom application metrics from deployed applications and access these aggregated metrics for all the deployments.

Custom application metrics are expected to be exposed as a Prometheus metrics exporter endpoint. For more information on writing custom Prometheus exporters, refer to the Prometheus development documentation.

To expose application metrics, use the following annotations on your application Pod that serves metrics.

  • prometheus.io/scrape Enable scraping for this pod.

  • prometheus.io/scheme Default value is http.

  • prometheus.io/path Override the path for the metrics endpoint on the service (default: '/metrics').

  • prometheus.io/port Used to override the port (default: 9102).

Using the custom-app bucket, you can retrieve custom metrics using the CLI ngc fleet-command metric command.

Fleet Command enables you to configure alerting rules for CPU, memory, and disk usage metrics and conveniently query the state of these alerts using the Fleet Command API or CLI. This feature provides the ability to manage and monitor alerts across Locations, all from a central place.

Every five minutes, Fleet Command evaluates the metrics collected during the preceding five minutes to determine if an alert threshold is crossed. When a threshold is crossed, Fleet Command sets the alert status and level on the observed metric to identify the location, system, and resource. You do not perform any action to clear an alert. When resource usage no longer triggers an alert, the metrics simply no longer include an alert status.

Fleet Command stores alert data for 30 days.

Default Alerting Rules

The following table summarizes the default alerting rules. Each rule identifies the monitored resource, the threshold level, and threshold value.

Resource

Warning Level

Critical Level

Measurement

Description

CPU 15 10 usage_idle When CPU idle time falls below 15%, a warning-level alert is raised. When CPU idle time falls below 10%, a critical-level alert is raised.
Memory 15 10 free When free memory falls below 15%, a warning-level alert is raised. When free memory falls below 10%, a critical-level alert is raised.
Disk 15 10 free When free disk space falls below 15%, a warning-level alert is raised. When free disk space falls below 10%, a critical-level alert is raised.

Fleet Command monitors the disk space for all volumes on the host, except the volumes that use the following file systems:

  • aufs

  • devfs

  • devtmpfs

  • iso9660

  • overlay

  • squashfs

  • tmpfs

Configuring Alerting Rules

To configure alerting rules, use the following Fleet Command CLI:

Copy
Copied!
            

ngc fleet-command settings set [--alert-critical-level <alert-critical-level>] [--alert-critical-message <alert-critical-message>] [--alert-key <alert-key>] [--alert-warning-level <alert-warning-level>] [--alert-warning-message <alert-warning-message>] ...

where

--alert-key

Set the alert settings for particular key. Options: CPU,MEM,DISK

--alert-critical-level

Set the threshold for which warning messages should be logged into alerts

--alert-critical-message

Set the message for your report when critical level is reached

--alert-warning-level

Set the threshold for which warning messages should be logged into alerts

--alert-warning-message

Set the message for your alert when warning level is reached

The following are examples of setting alert keys with various critical and warning threshold levels for CPU, memory, and disk:

  • To set the CPU alert with the critical level at 5% idle time and a warning level at 10%:

    Copy
    Copied!
                

    $ ngc fleet-command settings set --alert-key "CPU" --alert-critical-level 5 --alert-critical-message "CPU Critical" --alert-warning-level 10 --alert-warning-message "CPU Warning"

  • To set the memory alert with the critical level at 10% free memory and a warning level at 20%:

    Copy
    Copied!
                

    $ ngc fleet-command settings set --alert-key "MEM" --alert-critical-level 10 --alert-critical-message "Memory Critical" --alert-warning-level 20 --alert-warning-message "Memory Warning"

  • To set the disk alert with the critical level at 15% free disk space and a warning level at 30%:

    Copy
    Copied!
                

    $ ngc fleet-command settings set --alert-key "DISK" --alert-critical-level 15 --alert-critical-message "Disk Critical" --alert-warning-level 30 --alert-warning-message "Disk Warning"

Viewing Alerting Rules

To view the alerting rules, use the following Fleet Command CLI command:

Copy
Copied!
            

$ ngc fleet-command settings current

Example Output

Copy
Copied!
            

---------------------------------------------- Fleet Command Settings Remote Management: Enabled ... ---------------------------------------------- Alert Settings for MEM Critical Level: 10 Critical Message: critical Warning Level: 15 Warning Message: warning Alert Settings for DISK Critical Level: 10 Critical Message: critical Warning Level: 15 Warning Message: warning Alert Settings for CPU Critical Level: 10 Critical Message: critical Warning Level: 15 ...

Querying Alert State

To query the alert state, use the Fleet Command CLI:

Copy
Copied!
            

# To view summary statistics ngc fleet-command metric summary <cpu|mem|disk> --bucket alert --from-date <from-date> --to-date <to-date> [--field <field-name>] # To view alert state ngc fleet-command metric query <query>

where

--from-date

Start of date range. (Format: yyyy-MM-dd::HH:mm:ss)

--to-date

End of date range. (Format: yyyy-MM-dd::HH:mm:ss)

--field

Valid options are:

  • For mem or disk: free, total, used

  • For cpu: usage_guest_nice, usage_idle, usage_iowait, usage_irq, usage_nice, usage_softirq, usage_system, usage_user

<query>

Specifies a Flux query.

The following examples show how to get the alert status.

  • To get CPU alerts across all locations and systems for the last 15 minutes:

    Copy
    Copied!
                

    $ ngc fleet-command metric query 'from(bucket:"alert") |> range(start:-15m) |> filter(fn: (r) => r._measurement == "cpu" and exists r.alertstatus)'

    Replace the value in the r._measurement filter to query for mem or disk alerts.

    Partial Output

    Copy
    Copied!
                

    "values": { "_field": "usage_idle", "_measurement": "cpu", "_start": 1693383674.3168387, "_stop": 1693384274.3168387, "_time": 1693384001.0, "_value": 0.010593500889363162, "alertstatus": "critical", "cpu": "cpu-total", "host": "demo-system-1.example.com", "location": "demo-location", "result": "_result", "system": "demo system 1", "table": 1 }

  • To get disk alerts from systems at the demo-location location for the last 5 minutes:

    Copy
    Copied!
                

    $ ngc fleet-command metric query 'from(bucket:"alert") |> range(start:-5m) |> filter(fn: (r) => r._measurement == "disk" and r._location == "demo-location" and exists r.alertstatus)'

Refer to Query data with Flux for information about the query syntax. For the latest information on the metric command and options, refer to the NGC CLI documentation.

Previous Advanced Features
Next Troubleshooting
© Copyright 2022-2024, NVIDIA. Last updated on Jun 11, 2024.