sharp_am Telemetry
When using Unified Fabric Manager (UFM), sharp_am publishes statistical data, accessible through an HTTP endpoint in CSV, Prometheus, or JSON formats.
sharp_am generates this data at consistent intervals (recommended: every 60 seconds), regardless of whether it is being actively requested. Because of this, frequent polling will return the same data, so it’s advisable to retrieve information at intervals similar to those configured for sharp_am data updates.
The published data fields include the following:
Field Name | Description |
| Hostname of the server running sharp_am. |
| Unix timestamp (in seconds) indicating when data was generated; independent of request time. |
| Unix timestamp (in milliseconds) showing when data was requested. |
| Total number of currently active SHARP jobs. |
| Active SHARP jobs specifically requesting SAT rather than just LLT. |
| Aggregation nodes (switches) in an invalid state and excluded from resource allocation. |
The data includes histogram fields, such as active_jobs_num_hcas_histogram_bucket_X
, representing active jobs based on the number of HCAs each job serves. Each bucket corresponds to a range of HCAs, with the bucket labeled _infinity
covering jobs with 1025 or more HCAs.
Similarly, trees_level_histogram_bucket_X
fields provide a histogram of active jobs by SHARP tree level. For instance, a job using HCAs connected to the same leaf switch (requiring only one level) would be counted in trees_level_histogram_bucket_0
.
Historical Data Fields
In addition to current metrics, sharp_am also provides historical statistics:
Field Name | Description |
history_starting_timestamp | Start time for historical data collection, which resets on restart or failover. |
history_denied_reservations | Count of denied reservation requests, which may indicate configuration issues. |
history_denied_jobs_by_reservations | Count of job requests denied due to mismatched reservations. |
history_denied_jobs_by_resource_limit | Count of job denials due to insufficient resources, potentially due to disconnected or invalid switches. |
history_jobs_ended_due_to_client_failure | Number of jobs that ended due to client-side failure. |
history_jobs_ended_due_to_fatal_sharp_error | Number of jobs that ended due to switch failure or link error. |
history_jobs_ended_successfully | Number of jobs completed without issues. |
history_ended_jobs_duration_in_hours_histogram_bucket_X | Job durations (in hours) of completed jobs, segmented into histogram buckets. |
For example, a job active for less than one hour would fall under history_ended_jobs_duration_in_hours_histogram_bucket_1
, while one running for six days would be counted in history_ended_jobs_duration_in_hours_histogram_bucket_168
.
Fetching Data
To retrieve this data, use port 9002 (if configured as default) and one of the following endpoints:
Endpoint | Response Format |
| CSV format |
| JSON format |
| Prometheus format |
Example of a JSON data request:
curl --silent http://localhost:9002/json/fset/sharp_am | jq
Enabling UFM Configuration
SHARP telemetry is not active by default. To enable it, configuration changes are necessary. For assistance, it is recommended to contact NVIDIA Support to configure UFM to support SHARP telemetry.