Monitoring - NVIDIA Docs

Fleet Command provides monitoring capabilities with usage analytics, user activity tracking and metrics.

Usage Analytics

Fleet Command usage analytics provide insights into GPU usage, registry usage, and log usage.

Select Organization > Usage.
On the Usage page, select Entitled Products > Fleet Command.

The Overview tab displays a summary of the GPU and storage usage, as well as system types and systems in use.

GPU Usage Overview: Displays the current and monthly peak total GPU usage.
Storage Usage Overview: Displays the current logs and private registry storage usage.
System Types: Displays various system types in use.
System Inventory: Displays a list of systems in use.

The GPU Usage tab displays the following:

Current GPU Usage: Displays the GPU usage by type for the current month.
Monthly Peak: Displays the monthly GPU peak usage by GPU type. You can adjust the month under Timeframe.
Daily Peak: Displays the daily GPU peak usage for the selected GPU type and timeframe. You can change the GPU type to display using the dropdown.
Total Monthly Peak Trend: Displays the current year’s monthly GPU peak trend. Mouse over the graph to see the exact count on the timeline. You can change the date range using the Timeframe selector.

The Storage Usage tab displays both the Private Registry and Logs storage usage in GB for each month in the current year. You can select a different year using the Timeframe selector.

Logs: Displays the current Logs storage usage.
Private Registry: Displays the storage used by the private registry.

Mouse over the graph to view the specific storage size.

If you wish to export the GPU and storage usage analytics to a CSV, click on Download CSV, and the data will be bundled and downloaded to a ZIP file.

User Activity

Fleet Command user activity provides insights into the user’s activities within the organization.

Select Organization > Audit.
Select the date range that reflects the time frame of user activity and then Create Report. This action generates a downloadable report that you can use to view the user activity.

Fleet Command lets you to view system and application-specific metrics for your deployments at edge locations. Metrics are numerical values that measure aspects of your resources. The metrics are collected approximately every 10 seconds.

With metrics, you can monitor and analyze the performance of your deployed machine learning inference solutions over time. This information can help you make adjustments to improve resource consumption and the overall performance of your deployments.

The following metrics are available in Fleet Command:

System metrics: measurement of edge system resource utilization, including
- Total CPU Utilization (per core)
- Total RAM Utilization
- Total GPU Utilization
- Total Storage Utilization
- Total Network Utilization
Application utilization metrics: measurement of application system resource utilization, including
- App CPU Utilization
- App RAM Utilization
- App GPU Utilization
- App Storage Utilization
- App Network Utilization
Alert metrics: threshold-crossing events for CPU, memory, and disk usage.

Using the Fleet Command metric CLI, you can view detailed metrics under several categories (“buckets”) across organizations, view all metrics in a bucket, or view summary metrics for a particular organization within a given time period. The metrics categories include the following:

Bucket	Description
alert	Alerts for CPU, memory, and disk usage.
app-utilization	Application usage metrics for memory, power, GPU, and so on.
custom-app	Custom metrics exposed by applications.
system	Metrics for disk, memory, CPU, and network usage.

The following are examples of using the metric command:

To see what buckets are contained within an org:

Copy
Copied!

            
            $ ngc fleet-command metric buckets --org <org-name>

To retrieve a list of defined metrics within a bucket:

Copy
Copied!

            
            $ ngc fleet-command metric list --bucket <bucket-name> --org <org-name>

To see a summary of information about a given metric:

Copy
Copied!

            
            $ ngc fleet-command metric summary <metric-name> --bucket <bucket-name> --from-date <from-date> --to-date <to-date>

To submit a Flux query, passthrough is also available and returns a JSON response:
Copy

Copied!
```
            
            $ ngc fleet-command metric query <query>
        
```
Refer to Query data with Flux for information about the query syntax.

For the latest information on the metric command and options, refer to the NGC CLI documentation.

Custom Application Metrics

You can also define and provide custom application metrics from deployed applications and access these aggregated metrics for all the deployments.

Custom application metrics are expected to be exposed as a Prometheus metrics exporter endpoint. For more information on writing custom Prometheus exporters, refer to the Prometheus development documentation.

To expose application metrics, use the following annotations on your application Pod that serves metrics.

prometheus.io/scrape Enable scraping for this pod.
prometheus.io/scheme Default value is http.
prometheus.io/path Override the path for the metrics endpoint on the service (default: '/metrics').
prometheus.io/port Used to override the port (default: 9102).

Using the custom-app bucket, you can retrieve custom metrics using the CLI ngc fleet-command metric command.

Alerts

Fleet Command enables you to configure alerting rules for CPU, memory, and disk usage metrics and conveniently query the state of these alerts using the Fleet Command API or CLI. This feature provides the ability to manage and monitor alerts across Locations, all from a central place.

Every five minutes, Fleet Command evaluates the metrics collected during the preceding five minutes to determine if an alert threshold is crossed. When a threshold is crossed, Fleet Command sets the alert status and level on the observed metric to identify the location, system, and resource. You do not perform any action to clear an alert. When resource usage no longer triggers an alert, the metrics simply no longer include an alert status.

Fleet Command stores alert data for 30 days.

Default Alerting Rules

The following table summarizes the default alerting rules. Each rule identifies the monitored resource, the threshold level, and threshold value.

Resource	Warning Level	Critical Level	Measurement	Description
CPU	15	10	`usage_idle`	When CPU idle time falls below 15%, a warning-level alert is raised. When CPU idle time falls below 10%, a critical-level alert is raised.
Memory	15	10	`free`	When free memory falls below 15%, a warning-level alert is raised. When free memory falls below 10%, a critical-level alert is raised.
Disk	15	10	`free`	When free disk space falls below 15%, a warning-level alert is raised. When free disk space falls below 10%, a critical-level alert is raised.

Fleet Command monitors the disk space for all volumes on the host, except the volumes that use the following file systems:

aufs
devfs
devtmpfs
iso9660
overlay
squashfs
tmpfs

Configuring Alerting Rules

To configure alerting rules, use the following Fleet Command CLI:

Copy
Copied!

            
            ngc fleet-command settings set [--alert-critical-level <alert-critical-level>]
                               [--alert-critical-message <alert-critical-message>]
                               [--alert-key <alert-key>]
                               [--alert-warning-level <alert-warning-level>]
                               [--alert-warning-message <alert-warning-message>]
                               ...

where

--alert-key
--alert-critical-level
--alert-critical-message
--alert-warning-level
--alert-warning-message

The following are examples of setting alert keys with various critical and warning threshold levels for CPU, memory, and disk:

To set the CPU alert with the critical level at 5% idle time and a warning level at 10%:

Copy
Copied!

            
            $ ngc fleet-command settings set --alert-key "CPU" --alert-critical-level 5 --alert-critical-message "CPU Critical" --alert-warning-level 10 --alert-warning-message "CPU Warning"

To set the memory alert with the critical level at 10% free memory and a warning level at 20%:

Copy
Copied!

            
            $ ngc fleet-command settings set --alert-key "MEM" --alert-critical-level 10 --alert-critical-message "Memory Critical" --alert-warning-level 20 --alert-warning-message "Memory Warning"

To set the disk alert with the critical level at 15% free disk space and a warning level at 30%:

Copy
Copied!

            
            $ ngc fleet-command settings set --alert-key "DISK" --alert-critical-level 15 --alert-critical-message "Disk Critical" --alert-warning-level 30 --alert-warning-message "Disk Warning"

Viewing Alerting Rules

To view the alerting rules, use the following Fleet Command CLI command:

Copy
Copied!

            
            $ ngc fleet-command settings current

Example Output

Copy
Copied!

            
            ----------------------------------------------
Fleet Command Settings
Remote Management: Enabled
...
----------------------------------------------
Alert Settings for MEM
Critical Level: 10
Critical Message: critical
Warning Level: 15
Warning Message: warning
Alert Settings for DISK
Critical Level: 10
Critical Message: critical
Warning Level: 15
Warning Message: warning
Alert Settings for CPU
Critical Level: 10
Critical Message: critical
Warning Level: 15
...

Querying Alert State

To query the alert state, use the Fleet Command CLI:

Copy
Copied!

            
            # To view summary statistics
ngc fleet-command metric summary <cpu|mem|disk> --bucket alert --from-date <from-date> --to-date <to-date> [--field <field-name>]

# To view alert state
ngc fleet-command metric query <query>

where

--from-date
--to-date
--field
<query>

The following examples show how to get the alert status.

To get CPU alerts across all locations and systems for the last 15 minutes:

Copy
Copied!

            
            $ ngc fleet-command metric query 'from(bucket:"alert") |> range(start:-15m) |> filter(fn: (r) => r._measurement == "cpu" and exists r.alertstatus)'

Replace the value in the r._measurement filter to query for mem or disk alerts.

Partial Output

Copy
Copied!

            
            "values": {
   "_field": "usage_idle",
   "_measurement": "cpu",
   "_start": 1693383674.3168387,
   "_stop": 1693384274.3168387,
   "_time": 1693384001.0,
   "_value": 0.010593500889363162,
   "alertstatus": "critical",
   "cpu": "cpu-total",
   "host": "demo-system-1.example.com",
   "location": "demo-location",
   "result": "_result",
   "system": "demo system 1",
   "table": 1
 }

To get disk alerts from systems at the demo-location location for the last 5 minutes:

Copy
Copied!

            
            $ ngc fleet-command metric query 'from(bucket:"alert") |> range(start:-5m) |> filter(fn: (r) => r._measurement == "disk" and r._location == "demo-location" and exists r.alertstatus)'

Refer to Query data with Flux for information about the query syntax. For the latest information on the metric command and options, refer to the NGC CLI documentation.