NVIDIA Docs Hub Homepage NVIDIA Networking Networking Software Management Software NVIDIA UFM Enterprise User Manual v6.11.2 Appendix - UFM SLURM Integration

Appendix - UFM SLURM Integration

Simple Linux Utility for Resource Management (SLURM) is a job scheduler for Linux and Unix-like kernels.

By integrating SLURM with UFM, you can:

Assign partition keys (pkeys) to SLRUM nodes that are assigned for specific SLURM jobs.
Create SHARP reservations based on SLURM nodes assigned for specific SLURM jobs.

Prerequisites

UFM 6.9.0 (or newer) installed on a RedHat 7.x
Python 2.7 on SLURM controller
UFM-SLURM integration files (provided independently)

Automatic Installation

A script is provided to install the UFM-SLURM integration automatically.

Using the SLURM controller, extract the UFM-SLURM integration tar file:

Copy
Copied!

            
            tar -xf ufm_slurm_integration.tar.gz

Run the installation script using root privileges.

Copy
Copied!

            
            sudo ./install.sh

Manual Installation

To install the UFM-SLURM integration manually:

Extract the UFM-SLURM integration tar file:

Copy
Copied!

            
            tar -xf ufm_slurm_integration.tar.gz

Copy the UFM-SLURM integration files to the SLURM controller folder.
Change the permissions of the UFM-SLURM integration files to 755.

Modify the SLURM configuration file on the SLURM controller, /etc/slurm/slurm.conf, and add/modify the following two parameters:

Copy
Copied!

            
            PrologSlurmctld=/etc/slurm/ufm-prolog.sh
EpilogSlurmctld=/etc/slurm/ufm-epilog.sh

UFM SLURM Config File

The integration process uses a configuration file located at /etc/slurm/ufm_slurm.conf. This file is used to configure settings and attributes for UFM-SLURM integration.

Here are the contents:

Attribute Name	Description
auth_type	Should be `token_auth`, or `basic_auth`. If you select `basic_auth` you need to set `ufm_server_user` and `ufm_server_pass`. If you select `token_auth` you need to set `token_auth`.
ufm_server_user	Username of UFM server used to connect to UFM if you set `auth_type=basic_auth`
ufm_server_pass	UFM server user password
token_auth=generated_token	Set `generated_token`, for more info how to generate token please see section Prolog and Epilog.
ufm_server	IP of UFM server to connect to
log_file_name	Name of integration logging file
partially_aloc	Determines whether or not to allow allocation of nodes

Note

All of these attributes are mandatory.

Configuring UFM for NVIDIA SHARP Allocation

To configure UFM for NVIDIA SHARP allocation/deallocation you must set sharp_enabled and enable_sharp_allocation to true in gv.cfg file.

Generate token_auth

If you set auth_type=token_auth in UFM SLURM’s config file, you must generate a new token by logging into the UFM server and running the following curl command:

Copy
Copied!

            
            curl -H "X-Remote-User:admin" -XPOST http://127.0.0.1:8000/app/tokens

Then you must copy the generated token and paste it into the config file beside the token_auth parameter.

Prolog and Epilog

After submitting jobs on SLURM, there are two scripts that are automatically executed:

ufm-prolog.sh – the prolog script is executed when a job is submitted and before running the job itself. It creates the partition key (pkey) assignment and/or NVIDIA SHARP reservation and assigns the SLURM job hosts for them.
ufm-epilog.sh – the epilog script is executed when a job is complete. It removes the partition key (pkey) assignment and/or NVIDIA SHARP reservation and free the associated SLURM job hosts.

Integration Files

The integration use scripts and configuration files to work, which should be copied to SLURM controller /etc”/slurm. Here is a list of these files:

File Name	Description
ufm-prolog.sh	Bash file which executes jobs related to UFM after the SLURM job is completed
ufm-epilog.sh	Bash file which executes jobs related to UFM before the SLURM job is executed
ufm_slurm.conf	UFM-SLURM integration configuration file
ufm_slurm_prolog.py	Python script file which creates the partition key (pkey) assignment and/or SHARP reservation when the prolog bash script is running
ufm_slurm_epilog.py	Python script file which removes partition key (pkey) assignment and/or SHARP reservation based on the SLURM job hosts.
ufm_slurm_utils.py	Utility Python file containing functions and utilities used by the integration process

Running UFM-SLURM Integration

Using the SLURM controller, execute the following commands to run your batch job:

Copy
Copied!

            
            $ sbatch -N4 slurm_demo.sh
Submitted batch job 1

Note

N4 is the number of compute nodes used to run the jobs. slurm_demo.sh is the job batch file to be run.

The output and result are stored on the working directory slurm-{id}.out where {id} is the ID of the submitted job.

In the above example, after executing sbatch command, you can see that the submitted job ID is 1. Therefore, the output file would be stored in slurm-1.out.

Execute the following command to see the output:

Copy
Copied!

            
            $cat slurm-1.out

On the UFM side, a partition key (PKey) is assigned with all SLURM job IDs allocated to hosts Incase it was configured in ufm_slurm.conf file otherwise will use the default management PKey.

In addition, the UFM-SLURM will automatically create SHARM AM reservation in case UFM SHARP and UFM SHARP Allocation are enabled in UFM.

After the SLURM job is completed, the UFM removes the job-related partition key (pkey) assignment and SHARP reservation.

From the moment a job is submitted by the SLURM server until its completion, a log file named /tmp/ufm_slurm.log logs all of the actions and errors that occurred during the execution.

This log file can be changed by modifying the log_file_name parameter in /etc/slurm /ufm_slurm.conf.

On This Page