SHARP in Public Cloud
Deploying SHARP in a public cloud environment requires special considerations to ensure fair resource usage across tenants and to prevent one tenant from disrupting another.
Two key measures should be taken:
Setting a Secret AM Key
This ensures that only
sharp_amis authorized to perform configuration changes. Without this safeguard, a tenant’s application could potentially impersonatesharp_amand alter fabric settings.Using PKeys
PKeys are used (independently of SHARP) to isolate traffic between tenants. SHARP automatically integrates with this mechanism to enforce resource separation and access control aligned with the PKey setup.
The Secret AM Key is a 64-bit value that must be defined by the cloud administrator and set in the sharp_am configuration file.
This key must remain strictly confidential and must not be shared with or exposed to any cloud tenant.
Once configured, sharp_am programs all connected switches with this key. From that point forward, switches will only accept Sharp-related MADs that include the correct Secret AM Key.
MADs originating from libsharp use a separate key known as the Sharp Job Key. This key is dynamically generated per job and distributed by sharp_am to the corresponding libsharp instance, ensuring isolation between tenants and preventing one tenant from sending MADs on behalf of another.
If a MAD is received by a switch with an incorrect key, it is silently dropped, and the switch emits an AMKeyViolation trap (trap number 257) to sharp_am. Cloud administrators should monitor event logs for such traps. A high volume of these traps may indicate a brute-force attempt by a tenant to discover the key.
Sharp can be configured to automatically treat the PKeys definitions as the system tenants.
This ensures that tenants can use SHARP without interfering with one another’s, and gives administrators fine-grained control over SHARP usage.
To enable SHARP in PKeys support mode, the following prerequisites must be met:
sharp_ammust be running inside UFM.In the UFM configuration file
gv.cfg, set:enable_sharp_allocation = True
Once PKeys support mode is enabled, compute nodes are not permitted to initiate SHARP jobs unless they belong to a defined PKey.
The system admin can control which PKeys are entitled to use SHARP and which are not. By Default, every defined PKey can use SHARP. The PKey API provides an optional field to mention whether SHARP can be used by applications that belong to the PKey.
The PKeys definitions are created and managed via UFM GUI and REST API, which supports:
Creating, updating, and deleting PKeys.
Defining whether a PKey should be able to use Sharp or not.
This mechanism gives fabric administrators the ability to define per-tenant SHARP entitlements and enforce strict isolation.
For full details, refer to the NVIDIA UFM Enterprise REST API Guide.
SHARP Resource Limits
Enforcing SHARP resource limits per tenant is essential in multi-tenant environments to maintain fairness and avoid resource contention.
By default:
A tenant can run multiple SHARP jobs concurrently.
No two SHARP jobs may share the same HCA.
There is no global limit on the number of jobs a tenant may launch. However, since each SHARP job requires at least 2 HCAs, and each HCA may only serve one job, the effective job limit per tenant is approximately half the number of available HCAs.
These default constraints ensure that in a non-blocking topology, no tenant can monopolize resources or degrade performance for others.
Administrators can override defaults by:
Setting a global job limit for all tenants via a configuration parameter.
Adjusting the number of SHARP jobs allowed per HCA, via a global configuration parameter.
These controls provide the flexibility to tailor SHARP behavior to specific cloud tenancy models and fairness policies.
SHARP can attach a compute node HCA to only one PKey at any given time.
A compute node HCA can have full PKey membership in multiple PKeys, but SHARP will attach it to only one of them for SHARP Jobs.
If an HCA has full membership in multiple PKeys, SHARP will arbitrarily select one PKey to attach it to, and will generate warning events about it.
In multi-tenant cloud environments, such configurations are uncomoon, as HCAs typically have full membership in only one PKey.