Appendix - UFM SLURM Integration

NVIDIA UFM Enterprise User Manual v6.11.2

Simple Linux Utility for Resource Management (SLURM) is a job scheduler for Linux and Unix-like kernels.

By integrating SLURM with UFM, you can:

  • Assign partition keys (pkeys) to SLRUM nodes that are assigned for specific SLURM jobs.

  • Create SHARP reservations based on SLURM nodes assigned for specific SLURM jobs.

  • UFM 6.9.0 (or newer) installed on a RedHat 7.x

  • Python 2.7 on SLURM controller

  • UFM-SLURM integration files (provided independently)

A script is provided to install the UFM-SLURM integration automatically.

  1. Using the SLURM controller, extract the UFM-SLURM integration tar file:

    Copy
    Copied!
                

    tar -xf ufm_slurm_integration.tar.gz

  2. Run the installation script using root privileges.

    Copy
    Copied!
                

    sudo ./install.sh

To install the UFM-SLURM integration manually:

  1. Extract the UFM-SLURM integration tar file:

    Copy
    Copied!
                

    tar -xf ufm_slurm_integration.tar.gz

  2. Copy the UFM-SLURM integration files to the SLURM controller folder.

  3. Change the permissions of the UFM-SLURM integration files to 755.

  4. Modify the SLURM configuration file on the SLURM controller, /etc/slurm/slurm.conf, and add/modify the following two parameters:

    Copy
    Copied!
                

    PrologSlurmctld=/etc/slurm/ufm-prolog.sh EpilogSlurmctld=/etc/slurm/ufm-epilog.sh

The integration process uses a configuration file located at /etc/slurm/ufm_slurm.conf. This file is used to configure settings and attributes for UFM-SLURM integration.

Here are the contents:

Attribute Name

Description

auth_type

Should be token_auth, or basic_auth.

If you select basic_auth you need to set ufm_server_user and ufm_server_pass.

If you select token_auth you need to set token_auth.

ufm_server_user

Username of UFM server used to connect to UFM if you set auth_type=basic_auth

ufm_server_pass

UFM server user password

token_auth=generated_token

Set generated_token, for more info how to generate token please see section Prolog and Epilog.

ufm_server

IP of UFM server to connect to

log_file_name

Name of integration logging file

partially_aloc

Determines whether or not to allow allocation of nodes

Note

All of these attributes are mandatory.

To configure UFM for NVIDIA SHARP allocation/deallocation you must set sharp_enabled and enable_sharp_allocation to true in gv.cfg file.

Generate token_auth

If you set auth_type=token_auth in UFM SLURM’s config file, you must generate a new token by logging into the UFM server and running the following curl command:

Copy
Copied!
            

curl -H "X-Remote-User:admin" -XPOST http://127.0.0.1:8000/app/tokens

Then you must copy the generated token and paste it into the config file beside the token_auth parameter.

After submitting jobs on SLURM, there are two scripts that are automatically executed:

  • ufm-prolog.sh – the prolog script is executed when a job is submitted and before running the job itself. It creates the partition key (pkey) assignment and/or NVIDIA SHARP reservation and assigns the SLURM job hosts for them.

  • ufm-epilog.sh – the epilog script is executed when a job is complete. It removes the partition key (pkey) assignment and/or NVIDIA SHARP reservation and free the associated SLURM job hosts.

The integration use scripts and configuration files to work, which should be copied to SLURM controller /etc”/slurm. Here is a list of these files:

File Name

Description

ufm-prolog.sh

Bash file which executes jobs related to UFM after the SLURM job is completed

ufm-epilog.sh

Bash file which executes jobs related to UFM before the SLURM job is executed

ufm_slurm.conf

UFM-SLURM integration configuration file

ufm_slurm_prolog.py

Python script file which creates the partition key (pkey) assignment and/or SHARP reservation when the prolog bash script is running

ufm_slurm_epilog.py

Python script file which removes partition key (pkey) assignment and/or SHARP reservation based on the SLURM job hosts.

ufm_slurm_utils.py

Utility Python file containing functions and utilities used by the integration process

Using the SLURM controller, execute the following commands to run your batch job:

Copy
Copied!
            

$ sbatch -N4 slurm_demo.sh Submitted batch job 1

Note

N4 is the number of compute nodes used to run the jobs. slurm_demo.sh is the job batch file to be run.

The output and result are stored on the working directory slurm-{id}.out where {id} is the ID of the submitted job.

In the above example, after executing sbatch command, you can see that the submitted job ID is 1. Therefore, the output file would be stored in slurm-1.out.

Execute the following command to see the output:

Copy
Copied!
            

$cat slurm-1.out

On the UFM side, a partition key (PKey) is assigned with all SLURM job IDs allocated to hosts Incase it was configured in ufm_slurm.conf file otherwise will use the default management PKey.

In addition, the UFM-SLURM will automatically create SHARM AM reservation in case UFM SHARP and UFM SHARP Allocation are enabled in UFM.

After the SLURM job is completed, the UFM removes the job-related partition key (pkey) assignment and SHARP reservation.

From the moment a job is submitted by the SLURM server until its completion, a log file named /tmp/ufm_slurm.log logs all of the actions and errors that occurred during the execution.

This log file can be changed by modifying the log_file_name parameter in /etc/slurm /ufm_slurm.conf.

© Copyright 2024, NVIDIA. Last updated on Jul 4, 2024.