Monitoring
Fleet Command provides monitoring capabilities with usage analytics, user activity tracking and metrics.
Fleet Command usage analytics provide insights into GPU usage, registry usage, and log usage.
Select Organization > Usage.
On the Usage page, select Entitled Products > Fleet Command.
The Overview tab displays a summary of the GPU and storage usage, as well as system types and systems in use.
GPU Usage Overview: Displays the current and monthly peak total GPU usage.
Storage Usage Overview: Displays the current logs and private registry storage usage.
System Types: Displays various system types in use.
System Inventory: Displays a list of systems in use.
The GPU Usage tab displays the following:
Current GPU Usage: Displays the GPU usage by type for the current month.
Monthly Peak: Displays the monthly GPU peak usage by GPU type. You can adjust the month under Timeframe.
Daily Peak: Displays the daily GPU peak usage for the selected GPU type and timeframe. You can change the GPU type to display using the dropdown.
Total Monthly Peak Trend: Displays the current year’s monthly GPU peak trend. Mouse over the graph to see the exact count on the timeline. You can change the date range using the Timeframe selector.
The Storage Usage tab displays both the Private Registry and Logs storage usage in GB for each month in the current year. You can select a different year using the Timeframe selector.
Logs: Displays the current Logs storage usage.
Private Registry: Displays the storage used by the private registry.
Mouse over the graph to view the specific storage size.
If you wish to export the GPU and storage usage analytics to a CSV, click on Download CSV, and the data will be bundled and downloaded to a ZIP file.
Fleet Command user activity provides insights into the user’s activities within the organization.
Select Organization > Audit.
Select the date range that reflects the time frame of user activity and then Create Report. This action generates a downloadable report that you can use to view the user activity.
Fleet Command lets you to view system and application-specific metrics for your deployments at edge locations. Metrics are numerical values that measure aspects of your resources. The metrics are collected approximately every 10 seconds.
With metrics, you can monitor and analyze the performance of your deployed machine learning inference solutions over time. This information can help you make adjustments to improve resource consumption and the overall performance of your deployments.
The following metrics are available in Fleet Command:
System metrics: measurement of edge system resource utilization, including
Total CPU Utilization (per core)
Total RAM Utilization
Total GPU Utilization
Total Storage Utilization
Total Network Utilization
Application utilization metrics: measurement of application system resource utilization, including
App CPU Utilization
App RAM Utilization
App GPU Utilization
App Storage Utilization
App Network Utilization
Alert metrics: threshold-crossing events for CPU, memory, and disk usage.
Using the Fleet Command metric
CLI, you can view detailed metrics under several categories (“buckets”) across organizations, view all metrics in a bucket, or view summary metrics for a particular organization within a given time period. The metrics categories include the following:
Bucket |
Description |
---|---|
alert | Alerts for CPU, memory, and disk usage. |
app-utilization | Application usage metrics for memory, power, GPU, and so on. |
custom-app | Custom metrics exposed by applications. |
system | Metrics for disk, memory, CPU, and network usage. |
The following are examples of using the metric
command:
To see what buckets are contained within an org:
$ ngc fleet-command metric buckets --org <org-name>
To retrieve a list of defined metrics within a bucket:
$ ngc fleet-command metric list --bucket <bucket-name> --org <org-name>
To see a summary of information about a given metric:
$ ngc fleet-command metric summary <metric-name> --bucket <bucket-name> --from-date <from-date> --to-date <to-date>
To submit a Flux query, passthrough is also available and returns a JSON response:
$ ngc fleet-command metric query <query>
Refer to Query data with Flux for information about the query syntax.
For the latest information on the metric command and options, refer to the NGC CLI documentation.
Custom Application Metrics
You can also define and provide custom application metrics from deployed applications and access these aggregated metrics for all the deployments.
Custom application metrics are expected to be exposed as a Prometheus metrics exporter endpoint. For more information on writing custom Prometheus exporters, refer to the Prometheus development documentation.
To expose application metrics, use the following annotations on your application Pod that serves metrics.
prometheus.io/scrape
Enable scraping for this pod.prometheus.io/scheme
Default value ishttp
.prometheus.io/path
Override the path for the metrics endpoint on the service (default:'/metrics'
).prometheus.io/port
Used to override the port (default:9102
).
Using the custom-app bucket, you can retrieve custom metrics using the CLI ngc fleet-command metric
command.
Fleet Command enables you to configure alerting rules for CPU, memory, and disk usage metrics and conveniently query the state of these alerts using the Fleet Command API or CLI. This feature provides the ability to manage and monitor alerts across Locations, all from a central place.
Every five minutes, Fleet Command evaluates the metrics collected during the preceding five minutes to determine if an alert threshold is crossed. When a threshold is crossed, Fleet Command sets the alert status and level on the observed metric to identify the location, system, and resource. You do not perform any action to clear an alert. When resource usage no longer triggers an alert, the metrics simply no longer include an alert status.
Fleet Command stores alert data for 30 days.
Default Alerting Rules
The following table summarizes the default alerting rules. Each rule identifies the monitored resource, the threshold level, and threshold value.
Resource |
Warning Level |
Critical Level |
Measurement |
Description |
---|---|---|---|---|
CPU | 15 | 10 | usage_idle |
When CPU idle time falls below 15%, a warning-level alert is raised. When CPU idle time falls below 10%, a critical-level alert is raised. |
Memory | 15 | 10 | free |
When free memory falls below 15%, a warning-level alert is raised. When free memory falls below 10%, a critical-level alert is raised. |
Disk | 15 | 10 | free |
When free disk space falls below 15%, a warning-level alert is raised. When free disk space falls below 10%, a critical-level alert is raised. |
Fleet Command monitors the disk space for all volumes on the host, except the volumes that use the following file systems:
aufs
devfs
devtmpfs
iso9660
overlay
squashfs
tmpfs
Configuring Alerting Rules
To configure alerting rules, use the following Fleet Command CLI:
ngc fleet-command settings set [--alert-critical-level <alert-critical-level>]
[--alert-critical-message <alert-critical-message>]
[--alert-key <alert-key>]
[--alert-warning-level <alert-warning-level>]
[--alert-warning-message <alert-warning-message>]
...
where
--alert-key
--alert-critical-level
--alert-critical-message
--alert-warning-level
--alert-warning-message
Set the alert settings for particular key. Options: CPU,MEM,DISK
Set the threshold for which warning messages should be logged into alerts
Set the message for your report when critical level is reached
Set the threshold for which warning messages should be logged into alerts
Set the message for your alert when warning level is reached
The following are examples of setting alert keys with various critical and warning threshold levels for CPU, memory, and disk:
To set the CPU alert with the critical level at 5% idle time and a warning level at 10%:
$ ngc fleet-command settings set --alert-key "CPU" --alert-critical-level 5 --alert-critical-message "CPU Critical" --alert-warning-level 10 --alert-warning-message "CPU Warning"
To set the memory alert with the critical level at 10% free memory and a warning level at 20%:
$ ngc fleet-command settings set --alert-key "MEM" --alert-critical-level 10 --alert-critical-message "Memory Critical" --alert-warning-level 20 --alert-warning-message "Memory Warning"
To set the disk alert with the critical level at 15% free disk space and a warning level at 30%:
$ ngc fleet-command settings set --alert-key "DISK" --alert-critical-level 15 --alert-critical-message "Disk Critical" --alert-warning-level 30 --alert-warning-message "Disk Warning"
Viewing Alerting Rules
To view the alerting rules, use the following Fleet Command CLI command:
$ ngc fleet-command settings current
Example Output
----------------------------------------------
Fleet Command Settings
Remote Management: Enabled
...
----------------------------------------------
Alert Settings for MEM
Critical Level: 10
Critical Message: critical
Warning Level: 15
Warning Message: warning
Alert Settings for DISK
Critical Level: 10
Critical Message: critical
Warning Level: 15
Warning Message: warning
Alert Settings for CPU
Critical Level: 10
Critical Message: critical
Warning Level: 15
...
Querying Alert State
To query the alert state, use the Fleet Command CLI:
# To view summary statistics
ngc fleet-command metric summary <cpu|mem|disk> --bucket alert --from-date <from-date> --to-date <to-date> [--field <field-name>]
# To view alert state
ngc fleet-command metric query <query>
where
--from-date
--to-date
--field
For mem or disk: free, total, used
For cpu: usage_guest_nice, usage_idle, usage_iowait, usage_irq, usage_nice, usage_softirq, usage_system, usage_user
<query>
Start of date range. (Format: yyyy-MM-dd::HH:mm:ss)
End of date range. (Format: yyyy-MM-dd::HH:mm:ss)
Valid options are:
Specifies a Flux query.
The following examples show how to get the alert status.
To get CPU alerts across all locations and systems for the last 15 minutes:
$ ngc fleet-command metric query 'from(bucket:"alert") |> range(start:-15m) |> filter(fn: (r) => r._measurement == "cpu" and exists r.alertstatus)'
Replace the value in the
r._measurement
filter to query formem
ordisk
alerts.Partial Output
"values": { "_field": "usage_idle", "_measurement": "cpu", "_start": 1693383674.3168387, "_stop": 1693384274.3168387, "_time": 1693384001.0, "_value": 0.010593500889363162, "alertstatus": "critical", "cpu": "cpu-total", "host": "demo-system-1.example.com", "location": "demo-location", "result": "_result", "system": "demo system 1", "table": 1 }
To get disk alerts from systems at the
demo-location
location for the last 5 minutes:$ ngc fleet-command metric query 'from(bucket:"alert") |> range(start:-5m) |> filter(fn: (r) => r._measurement == "disk" and r._location == "demo-location" and exists r.alertstatus)'
Refer to Query data with Flux for information about the query syntax. For the latest information on the metric command and options, refer to the NGC CLI documentation.