NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.12.0

SHARP in Public Cloud

Deploying SHARP in a public cloud environment requires special considerations to ensure fair resource usage across tenants and to prevent one tenant from disrupting another.

Two key measures should be taken:

  1. Setting a Secret AM Key

    This ensures that only sharp_am is authorized to perform configuration changes. Without this safeguard, a tenant’s application could potentially impersonate sharp_am and alter fabric settings.

  2. Using PKEYs and the SHARP Reservation API

    PKEYs are used (independently of SHARP) to isolate traffic between tenants. SHARP integrates with this mechanism via its reservation API to enforce resource separation and access control aligned with the PKEY setup.

The Secret AM Key is a 64-bit value that must be defined by the cloud administrator and set in the sharp_am configuration file.

This key must remain strictly confidential and must not be shared with or exposed to any cloud tenant.

Once configured, sharp_am programs all connected switches with this key. From that point forward, switches will only accept Sharp-related MADs that include the correct Secret AM Key.

MADs originating from libsharp use a separate key known as the Sharp Job Key. This key is dynamically generated per job and distributed by sharp_am to the corresponding libsharp instance, ensuring isolation between tenants and preventing one tenant from sending MADs on behalf of another.

If a MAD is received by a switch with an incorrect key, it is silently dropped, and the switch emits an AMKeyViolation trap (trap number 257) to sharp_am. Cloud administrators should monitor event logs for such traps. A high volume of these traps may indicate a brute-force attempt by a tenant to discover the key.

SHARP reservations enable logical grouping and isolation of nodes - by tenant, job, application, or any custom-defined grouping. In public cloud scenarios, this mechanism is typically used to define and manage tenants.

Each reservation can be associated with a PKEY, ensuring that tenant applications are logically isolated. This prevents one tenant’s SHARP jobs from interfering with another’s, and gives administrators fine-grained control over SHARP usage.

To enable SHARP reservations, the following prerequisites must be met:

  • sharp_am must be running inside UFM.

  • In the UFM configuration file gv.cfg, set:

    Copy
    Copied!
                

    enable_sharp_allocation = True

Once reservation mode is enabled, compute nodes are not permitted to initiate SHARP jobs unless they belong to a defined reservation. These reservations are created and managed using the UFM REST API, which supports:

  • Creating, updating, and deleting reservations

  • Associating reservations with PKEYs

  • Setting SHARP resource limits (e.g., number of trees) per tenant

This mechanism gives fabric administrators the ability to define per-tenant SHARP entitlements and enforce strict isolation.

For full details, refer to the NVIDIA UFM Enterprise REST API Guide.

SHARP Resource Limits

Enforcing SHARP resource limits per tenant is essential in multi-tenant environments to maintain fairness and avoid resource contention.

By default:

  1. A tenant can run multiple SHARP jobs concurrently.

  2. No two SHARP jobs may share the same HCA.

  3. There is no global limit on the number of jobs a tenant may launch. However, since each SHARP job requires at least 2 HCAs, and each HCA may only serve one job, the effective job limit per tenant is approximately half the number of available HCAs.

These default constraints ensure that in a non-blocking topology, no tenant can monopolize resources or degrade performance for others.

Administrators can override defaults by:

  • Setting a global job limit for all tenants via a configuration parameter.

  • Defining per-tenant limits using the UFM REST API.

  • Adjusting the number of SHARP jobs allowed per HCA, via a global configuration parameter.

These controls provide the flexibility to tailor SHARP behavior to specific cloud tenancy models and fairness policies.

© Copyright 2025, NVIDIA. Last updated on Aug 25, 2025.