Integrating with Slurm

Overview

Integration with HPC workload managers like Slurm can be achieved through a scheduler prolog/epilog configuration. Each HPC job will be represented in DPS as a resource group. The prolog handles creation of the resource group with the corresponding DPS policies such as power level and WPPS. The epilogue handles the deletion of the resource group and restoring any default policies. If DPS detects insufficient power to start the job at the requested settings it may make adjustments or return a failure.

Example Slurm Integration

You can use the PrologSlurmctld and EpilogSlurmctld parameters in the slurm.conf file.

With this configuration a job will be re-queued if the DPS prolog fails (returns a non-zero exit code). While there can be various reasons for this failure, in general DPS was unable to configure a corresponding resource group with the requested power settings and the job should not run.

#slurm.conf
PrologSlurmctld=/usr/share/dps/prolog.sh
EpilogSlurmctld=/usr/share/dps/epilog.sh

Example Prolog script

#!/bin/bash
# Create DPS resource group for power management
JOB_NAME=${SLURM_JOB_NAME}
JOB_ID=${SLURM_JOB_ID}
NODES=${SLURM_JOB_NODELIST}

# Create resource group
dpsctl resource-group create \
  --resource-group "${JOB_NAME}" \
  --external-id ${JOB_ID} \
  --policy "Node-High"

# Add allocated nodes
dpsctl resource-group add \
  --resource-group "${JOB_NAME}" \
  --entities "${NODES}"

# Activate power policies
dpsctl resource-group activate \
  --resource-group "${JOB_NAME}"

Example Epilog Script

#!/bin/bash
# SLURM Epilog - Clean up resource group

JOB_NAME=${SLURM_JOB_NAME}

# Delete resource group (automatically deactivates)
dpsctl resource-group delete \
  --resource-group "${JOB_NAME}"

Troubleshooting

See the Slurm documentation for information on Prolog and Epilog.

The DPS prolog/epilog scripts log to STDOUT by default.

Ensure that dpsctl has been properly configured for authentication and authorization.