image image image image image

On This Page

Running Mellanox SHARPD Daemon in Managed Mode

When running the daemon in a managed mode, it expects communication from the prolog/epilog scripts of the Job Scheduler (JS). The prolog/epilog scripts should invoke the “sharp_job_quota” executable to communicate with Mellanox SHARP.

To run SHARPD in managed mode, use the “mgmt_mode” option (default: 0 – run in “unmanaged” mode).

JS can set/unset upper limit for Mellanox SHARP resources (e.g OSTs, groups and etc.) allowed for a particular user/job via sharp_job_quota using the “set” and “remove” commands.

Usage 

sharp_job_quota [OPTIONS]

sharp_job_quota option

OptionRequired/OptionalArgumentsDescription

-t, --operation

Required

set /
remove

Sets or removes quota

-i, --allocation-id

Required

Unique numeric 64-bit ID

This is the scheduler id for the job. No other job in the system at the same time can have the same id

-u, --uid

Optional

Numeric

UID of the user allowed to run the job

-n, --user_name

Optional

string

Name of the user allowed to run the job

--coll_job_quota_max_groups

Optional

Numeric value: 0..256

Maximum number of Mellanox SHARP groups (communicators) allowed. Default value: 0.
0 means there is not limit for the job. It can ask for any number.

--coll_job_quota_max_qps_per_port

Optional

Numeric value: 0..256

Maximum QPs/port allowed.
Default value: 0.
0 means there is not limit for the job. It can ask for any number.

--coll_job_quota_max_payload_per_ost

Optional

Numeric value: 0..1024

Maximum payload per OST allowed.
Default value: 1024

--coll_job_quota_max_osts

Optional

Numeric value: 0..512

Indicates the maximum number of OSTs allowed for job per collective operation.
Default value: 0.
0 means there is not limit for the job. It can ask for any number.

-- coll_job_quota_max_num_trees

Optional

Numeric Value:

0..4

Indicates the maximum number of trees allowed for the job.

--job_priority

Optional

Numeric value

0..9

Indicates priority of the job.

-- coll_job_quota_percentage

Optional

Number value

0..100

Indicates percentage of resources to request for the job.

Important Notes

  • The executable needs to run with the same user as the SD (root)
  • When using the “set” operation, either the uid or the user_name must be provided
  • Regardless of the job quota set in prolog, the AM can allocate less resources than requested or decline the request

Examples

# sharp_job_quota --operation set --user_name jobrunner --allocation_id 2017 --coll_job_quota_max_groups 10
# sharp_job_quota --operation remove --allocation_id 2017

SLURM Examples 

#sharp_job_quota --operation set --uid $SLURM_JOB_UID --allocation_id $SLURM_JOB_ID
#sharp_job_quota --operation remove --allocation_id $SLURM_JOB_ID