Appendix - UFM SLURM Integration
Simple Linux Utility for Resource Management (SLURM) is a job scheduler for Linux and Unix-like kernels.
By integrating SLURM with UFM, you can:
Assign partition keys (pkeys) to SLRUM nodes that are assigned for specific SLURM jobs.
Create SHARP reservations based on SLURM nodes assigned for specific SLURM jobs.
UFM 6.9.0 (or newer) installed on a RedHat 7.x
Python 2.7 on SLURM controller
UFM-SLURM integration files (provided independently)
A script is provided to install the UFM-SLURM integration automatically.
Using the SLURM controller, extract the UFM-SLURM integration tar file:
tar -xf ufm_slurm_integration.tar.gz
Run the installation script using root privileges.
sudo ./install.sh
To install the UFM-SLURM integration manually:
Extract the UFM-SLURM integration tar file:
tar -xf ufm_slurm_integration.tar.gz
Copy the UFM-SLURM integration files to the SLURM controller folder.
Change the permissions of the UFM-SLURM integration files to 755.
Modify the SLURM configuration file on the SLURM controller, /etc/slurm/slurm.conf, and add/modify the following two parameters:
PrologSlurmctld=/etc/slurm/ufm-prolog.sh EpilogSlurmctld=/etc/slurm/ufm-epilog.sh
The integration process uses a configuration file located at /etc/slurm/ufm_slurm.conf. This file is used to configure settings and attributes for UFM-SLURM integration.
Here are the contents:
Attribute Name |
Description |
auth_type |
Should be token_auth, or basic_auth. If you select basic_auth you need to set ufm_server_user and ufm_server_pass. If you select token_auth you need to set token_auth. |
ufm_server_user |
Username of UFM server used to connect to UFM if you set auth_type=basic_auth |
ufm_server_pass |
UFM server user password |
token_auth=generated_token |
Set generated_token, for more info how to generate token please see section Prolog and Epilog. |
ufm_server |
IP of UFM server to connect to |
log_file_name |
Name of integration logging file |
partially_aloc |
Determines whether or not to allow allocation of nodes |
All of these attributes are mandatory.
To configure UFM for NVIDIA SHARP allocation/deallocation you must set sharp_enabled and enable_sharp_allocation to true in gv.cfg file.
Generate token_auth
If you set auth_type=token_auth in UFM SLURM’s config file, you must generate a new token by logging into the UFM server and running the following curl command:
curl -H "X-Remote-User:admin" -XPOST http://127.0.0.1:8000/app/tokens
Then you must copy the generated token and paste it into the config file beside the token_auth parameter.
After submitting jobs on SLURM, there are two scripts that are automatically executed:
ufm-prolog.sh – the prolog script is executed when a job is submitted and before running the job itself. It creates the partition key (pkey) assignment and/or NVIDIA SHARP reservation and assigns the SLURM job hosts for them.
ufm-epilog.sh – the epilog script is executed when a job is complete. It removes the partition key (pkey) assignment and/or NVIDIA SHARP reservation and free the associated SLURM job hosts.
The integration use scripts and configuration files to work, which should be copied to SLURM controller /etc”/slurm. Here is a list of these files:
File Name |
Description |
ufm-prolog.sh |
Bash file which executes jobs related to UFM after the SLURM job is completed |
ufm-epilog.sh |
Bash file which executes jobs related to UFM before the SLURM job is executed |
ufm_slurm.conf |
UFM-SLURM integration configuration file |
ufm_slurm_prolog.py |
Python script file which creates the partition key (pkey) assignment and/or SHARP reservation when the prolog bash script is running |
ufm_slurm_epilog.py |
Python script file which removes partition key (pkey) assignment and/or SHARP reservation based on the SLURM job hosts. |
ufm_slurm_utils.py |
Utility Python file containing functions and utilities used by the integration process |
Using the SLURM controller, execute the following commands to run your batch job:
$ sbatch -N4 slurm_demo.sh
Submitted batch job 1
N4 is the number of compute nodes used to run the jobs. slurm_demo.sh is the job batch file to be run.
The output and result are stored on the working directory slurm-{id}.out where {id} is the ID of the submitted job.
In the above example, after executing sbatch command, you can see that the submitted job ID is 1. Therefore, the output file would be stored in slurm-1.out.
Execute the following command to see the output:
$cat slurm-1.out
On the UFM side, a partition key (PKey) is assigned with all SLURM job IDs allocated to hosts Incase it was configured in ufm_slurm.conf file otherwise will use the default management PKey.
In addition, the UFM-SLURM will automatically create SHARM AM reservation in case UFM SHARP and UFM SHARP Allocation are enabled in UFM.
After the SLURM job is completed, the UFM removes the job-related partition key (pkey) assignment and SHARP reservation.
From the moment a job is submitted by the SLURM server until its completion, a log file named /tmp/ufm_slurm.log logs all of the actions and errors that occurred during the execution.
This log file can be changed by modifying the log_file_name parameter in /etc/slurm /ufm_slurm.conf.