NVIDIA Docs Hub NVIDIA Networking Accelerator Software NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.10.2 sharp_am Telemetry

sharp_am Telemetry

When using Unified Fabric Manager (UFM), sharp_am publishes statistical data, accessible through an HTTP endpoint in CSV, Prometheus, or JSON formats.

sharp_am generates this data at consistent intervals (recommended: every 60 seconds), regardless of whether it is being actively requested. Because of this, frequent polling will return the same data, so it’s advisable to retrieve information at intervals similar to those configured for sharp_am data updates.

The published data fields include the following:

Field Name	Description
`metadata_host`	Hostname of the server running sharp_am.
`metadata_timestamp`	Unix timestamp (in seconds) indicating when data was generated; independent of request time.
`timestamp`	Unix timestamp (in milliseconds) showing when data was requested.
`active_jobs`	Total number of currently active SHARP jobs.
`active_sat_jobs`	Active SHARP jobs specifically requesting SAT rather than just LLT.
`agg_nodes_in_invalid_state`	Aggregation nodes (switches) in an invalid state and excluded from resource allocation.

The data includes histogram fields, such as active_jobs_num_hcas_histogram_bucket_X, representing active jobs based on the number of HCAs each job serves. Each bucket corresponds to a range of HCAs, with the bucket labeled _infinity covering jobs with 1025 or more HCAs.

Similarly, trees_level_histogram_bucket_X fields provide a histogram of active jobs by SHARP tree level. For instance, a job using HCAs connected to the same leaf switch (requiring only one level) would be counted in trees_level_histogram_bucket_0.

Historical Data Fields

In addition to current metrics, sharp_am also provides historical statistics:

Field Name	Description
history_starting_timestamp	Start time for historical data collection, which resets on restart or failover.
history_denied_reservations	Count of denied reservation requests, which may indicate configuration issues.
history_denied_jobs_by_reservations	Count of job requests denied due to mismatched reservations.
history_denied_jobs_by_resource_limit	Count of job denials due to insufficient resources, potentially due to disconnected or invalid switches.
history_jobs_ended_due_to_client_failure	Number of jobs that ended due to client-side failure.
history_jobs_ended_due_to_fatal_sharp_error	Number of jobs that ended due to switch failure or link error.
history_jobs_ended_successfully	Number of jobs completed without issues.
history_ended_jobs_duration_in_hours_histogram_bucket_X	Job durations (in hours) of completed jobs, segmented into histogram buckets.

For example, a job active for less than one hour would fall under history_ended_jobs_duration_in_hours_histogram_bucket_1, while one running for six days would be counted in history_ended_jobs_duration_in_hours_histogram_bucket_168.

Fetching Data

To retrieve this data, use port 9002 (if configured as default) and one of the following endpoints:

Endpoint	Response Format
`/csv/fset/sharp_am`	CSV format
`/json/fset/sharp_am`	JSON format
`/fset/sharp_am`	Prometheus format

Example of a JSON data request:

Copy
Copied!

            
             curl --silent http://localhost:9002/json/fset/sharp_am | jq

Enabling UFM Configuration

SHARP telemetry is not active by default. To enable it, configuration changes are necessary. For assistance, it is recommended to contact NVIDIA Support to configure UFM to support SHARP telemetry.