SHARP in Public Cloud
Deploying SHARP in a public cloud environment requires special considerations to ensure fair resource usage across tenants and to prevent one tenant from disrupting another.
Two key measures should be taken:
Setting a Secret AM Key
This ensures that only
sharp_amis authorized to perform configuration changes. Without this safeguard, a tenant’s application could potentially impersonate
sharp_amand alter fabric settings.
Using PKEYs and the SHARP Reservation API
PKEYs are used (independently of SHARP) to isolate traffic between tenants. SHARP integrates with this mechanism via its reservation API to enforce resource separation and access control aligned with the PKEY setup.
The Secret AM Key is a 64-bit value that must be defined by the cloud administrator and set in the
sharp_am configuration file.
This key must remain strictly confidential and must not be shared with or exposed to any cloud tenant.
Once configured,
sharp_am programs all connected switches with this key. From that point forward, switches will only accept Sharp-related MADs that include the correct Secret AM Key.
MADs originating from
libsharp use a separate key known as the Sharp Job Key. This key is dynamically generated per job and distributed by
sharp_am to the corresponding
libsharp instance, ensuring isolation between tenants and preventing one tenant from sending MADs on behalf of another.
If a MAD is received by a switch with an incorrect key, it is silently dropped, and the switch emits an
AMKeyViolation trap (trap number 257) to
sharp_am. Cloud administrators should monitor event logs for such traps. A high volume of these traps may indicate a brute-force attempt by a tenant to discover the key.
SHARP reservations enable logical grouping and isolation of nodes - by tenant, job, application, or any custom-defined grouping. In public cloud scenarios, this mechanism is typically used to define and manage tenants.
Each reservation can be associated with a PKEY, ensuring that tenant applications are logically isolated. This prevents one tenant’s SHARP jobs from interfering with another’s, and gives administrators fine-grained control over SHARP usage.
To enable SHARP reservations, the following prerequisites must be met:
sharp_ammust be running inside UFM.
In the UFM configuration file
gv.cfg, set:
enable_sharp_allocation = True
Once reservation mode is enabled, compute nodes are not permitted to initiate SHARP jobs unless they belong to a defined reservation. These reservations are created and managed using the UFM REST API, which supports:
Creating, updating, and deleting reservations
Associating reservations with PKEYs
Setting SHARP resource limits (e.g., number of trees) per tenant
This mechanism gives fabric administrators the ability to define per-tenant SHARP entitlements and enforce strict isolation.
For full details, refer to the NVIDIA UFM Enterprise REST API Guide.
SHARP Resource Limits
Enforcing SHARP resource limits per tenant is essential in multi-tenant environments to maintain fairness and avoid resource contention.
By default:
A tenant can run multiple SHARP jobs concurrently.
No two SHARP jobs may share the same HCA.
There is no global limit on the number of jobs a tenant may launch. However, since each SHARP job requires at least 2 HCAs, and each HCA may only serve one job, the effective job limit per tenant is approximately half the number of available HCAs.
These default constraints ensure that in a non-blocking topology, no tenant can monopolize resources or degrade performance for others.
Administrators can override defaults by:
Setting a global job limit for all tenants via a configuration parameter.
Defining per-tenant limits using the UFM REST API.
Adjusting the number of SHARP jobs allowed per HCA, via a global configuration parameter.
These controls provide the flexibility to tailor SHARP behavior to specific cloud tenancy models and fairness policies.
Automatic Synchronization of SHARP Reservations and PKeys
By default, the cloud administrator must use the SHARP Reservation REST API to define reservations and ensure they align with the PKeys.
Using the Reservation REST API, the administrator can control which tenants are allowed to use SHARP (for example, creating a PKey without a corresponding reservation means SHARP will not be available for the tenant) and define the resource limits for each tenant.
However, if the cloud administrator wishes to provide SHARP capability to all tenants and use the default resource limits, the use of the Reservation REST API can be avoided by enabling automatic synchronization between SHARP reservations with and Pkeys.
In this mode, reservations are automatically created, updated, deleted and deleted based on the PKey definitions, eliminating the need to manually invoke the SHARP Reservation REST API and simplifying system management.
To enable automatic synchronization, update the configuration file:
conf/sharp/sharp_am.cfg
reservation_auto_by_pkeys = TRUE
A restart of sharp_am is required for the change to take effect.
Limitation note:
A compute node HCA can have full PKey membership in multiple PKeys but can belong to only one SHARP reservation.
If an HCA has full membership in multiple PKeys, SHARP will arbitrarily select one PKey to create a reservation for.
In multi-tenant cloud environments, such configurations are uncomoon, as HCAs typically have full membership in only one PKey.