NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.10.2

sharp_am Telemetry

When using Unified Fabric Manager (UFM), sharp_am publishes statistical data, accessible through an HTTP endpoint in CSV, Prometheus, or JSON formats.

sharp_am generates this data at consistent intervals (recommended: every 60 seconds), regardless of whether it is being actively requested. Because of this, frequent polling will return the same data, so it’s advisable to retrieve information at intervals similar to those configured for sharp_am data updates.

The published data fields include the following:

Field Name

Description

metadata_host

Hostname of the server running sharp_am.

metadata_timestamp

Unix timestamp (in seconds) indicating when data was generated; independent of request time.

timestamp

Unix timestamp (in milliseconds) showing when data was requested.

active_jobs

Total number of currently active SHARP jobs.

active_sat_jobs

Active SHARP jobs specifically requesting SAT rather than just LLT.

agg_nodes_in_invalid_state

Aggregation nodes (switches) in an invalid state and excluded from resource allocation.

The data includes histogram fields, such as active_jobs_num_hcas_histogram_bucket_X, representing active jobs based on the number of HCAs each job serves. Each bucket corresponds to a range of HCAs, with the bucket labeled _infinity covering jobs with 1025 or more HCAs.

Similarly, trees_level_histogram_bucket_X fields provide a histogram of active jobs by SHARP tree level. For instance, a job using HCAs connected to the same leaf switch (requiring only one level) would be counted in trees_level_histogram_bucket_0.

Historical Data Fields

In addition to current metrics, sharp_am also provides historical statistics:

Field Name

Description

history_starting_timestamp

Start time for historical data collection, which resets on restart or failover.

history_denied_reservations

Count of denied reservation requests, which may indicate configuration issues.

history_denied_jobs_by_reservations

Count of job requests denied due to mismatched reservations.

history_denied_jobs_by_resource_limit

Count of job denials due to insufficient resources, potentially due to disconnected or invalid switches.

history_jobs_ended_due_to_client_failure

Number of jobs that ended due to client-side failure.

history_jobs_ended_due_to_fatal_sharp_error

Number of jobs that ended due to switch failure or link error.

history_jobs_ended_successfully

Number of jobs completed without issues.

history_ended_jobs_duration_in_hours_histogram_bucket_X

Job durations (in hours) of completed jobs, segmented into histogram buckets.

For example, a job active for less than one hour would fall under history_ended_jobs_duration_in_hours_histogram_bucket_1, while one running for six days would be counted in history_ended_jobs_duration_in_hours_histogram_bucket_168.

Fetching Data

To retrieve this data, use port 9002 (if configured as default) and one of the following endpoints:

Endpoint

Response Format

/csv/fset/sharp_am

CSV format

/json/fset/sharp_am

JSON format

/fset/sharp_am

Prometheus format

Example of a JSON data request:

Copy
Copied!
            

curl --silent http://localhost:9002/json/fset/sharp_am | jq

Enabling UFM Configuration

SHARP telemetry is not active by default. To enable it, configuration changes are necessary. For assistance, it is recommended to contact NVIDIA Support to configure UFM to support SHARP telemetry.

© Copyright 2025, NVIDIA. Last updated on Mar 16, 2025.