Slurm Executor#
The Slurm executor runs evaluations on high‑performance computing (HPC) clusters managed by Slurm, an open‑source workload manager widely used in research and enterprise environments. It schedules and executes jobs across cluster nodes, enabling parallel, large‑scale evaluation runs while preserving reproducibility via containerized benchmarks.
See common concepts and commands in Executors.
Slurm can optionally host your model for the scope of an evaluation by deploying a serving container on the cluster and pointing the benchmark to that temporary endpoint. In this mode, two containers are used: one for the evaluation harness and one for the model server. The evaluation configuration includes a deployment section when this is enabled. See the examples in the examples/ directory for ready‑to‑use configurations.
If you do not require deployment on Slurm, simply omit the deployment section from your configuration and set the model’s endpoint URL directly (any OpenAI‑compatible endpoint that you host elsewhere).
Prerequisites#
Access to a Slurm cluster (with appropriate partitions/queues)
Pyxis SPANK plugin installed on the cluster
Configuration Overview#
Connecting to Your Slurm Cluster#
To run evaluations on Slurm, specify how to connect to your cluster
execution:
hostname: your-cluster-headnode # Slurm headnode (login node)
username: your_username # Cluster username (defaults to $USER env var)
account: your_allocation # Slurm account or project name
output_dir: /shared/scratch/your_username/eval_results # Absolute, shared path
Note
When specifying the parameters make sure to provide:
hostname: Slurm headnode (login node) where you normally SSH to submit jobs.output_dir: must be an absolute path on a shared filesystem (e.g., /shared/scratch/ or /home/) accessible to both the headnode and compute nodes.
Model Deployment Options#
When deploying models on Slurm, you have two options for specifying your model source:
Option 1: HuggingFace Models (Recommended - Automatic Download)#
Use valid Hugging Face model IDs for
hf_model_handle(for example,meta-llama/Llama-3.1-8B-Instruct).Browse model IDs: Hugging Face Models.
deployment:
checkpoint_path: null # Set to null when using hf_model_handle
hf_model_handle: meta-llama/Llama-3.1-8B-Instruct # HuggingFace model ID
Benefits:
Model is automatically downloaded during deployment
No need to pre-download or manage model files
Works with any HuggingFace model (public or private with valid access tokens)
Requirements:
Set
HF_TOKENenvironment variable if accessing gated modelsInternet access from compute nodes (or model cached locally)
Option 2: Local Model Files (Manual Setup Required)#
If you work with a checkpoint stored on locally on the cluster, use checkpoint_path:
deployment:
checkpoint_path: /shared/models/llama-3.1-8b-instruct # model directory accessible to compute nodes
# Do NOT set hf_model_handle when using checkpoint_path
Note:
The directory must exist, be accessible from compute nodes, and contain model files
Slurm does not automatically download models when using
checkpoint_path
Environment Variables and Secrets#
Environment variables use the unified prefix syntax ($host:, $lit:, $runtime:) described in Environment Variables. Declare them at the top-level env_vars: section, at evaluation.env_vars, or per-task — the launcher handles writing a .secrets.env file that is uploaded alongside the batch script and sourced at runtime.
env_vars:
HF_TOKEN: $host:HF_TOKEN # resolved from host, never in batch script
CACHE_DIR: $lit:/cache/huggingface # literal path
TRANSFORMERS_OFFLINE: $lit:1 # literal flag
Security: Secret values are never written into the generated run.sub script. They are stored in a separate .secrets.env file and sourced at runtime, preventing accidental exposure in logs or artifacts.
Mounting and Storage#
The Slurm executor provides sophisticated mounting capabilities:
execution:
mounts:
deployment:
/path/to/checkpoints: /checkpoint
/path/to/cache: /cache
evaluation:
/path/to/data: /data
/path/to/results: /results
mount_home: true # Mount user home directory
Mount Types::
Deployment Mounts: For model checkpoints, cache directories, and model data.
Evaluation Mounts: For input data, additional artifacts, and evaluation-specific files
Home Mount: Optional mounting of user home directory (enabled by default)
Complete Configuration Example#
Here’s a complete Slurm executor configuration using HuggingFace models:
# examples/slurm_llama_3_1_8b_instruct.yaml
defaults:
- execution: slurm/default
- deployment: vllm
- _self_
env_vars:
HF_TOKEN: $host:HF_TOKEN # Needed to access meta-llama/Llama-3.1-8B-Instruct gated model
execution:
hostname: your-cluster-headnode
account: your_account
output_dir: /shared/scratch/your_username/eval_results
partition: gpu
walltime: "04:00:00"
gpus_per_node: 8
deployment:
hf_model_handle: meta-llama/Llama-3.1-8B-Instruct
checkpoint_path: null
served_model_name: meta-llama/Llama-3.1-8B-Instruct
tensor_parallel_size: 1
data_parallel_size: 8
evaluation:
tasks:
- name: hellaswag
- name: arc_challenge
- name: winogrande
This configuration:
Uses the Slurm execution backend
Deploys a vLLM model server on the cluster
Requests GPU resources (8 GPUs per node, 4-hour time limit)
Runs three benchmark tasks in parallel
Saves benchmark artifacts to
output_dir
Resuming#
The Slurm executor includes advanced auto-resume capabilities:
Automatic Resumption#
Timeout Handling: Jobs automatically resume after timeout
Preemption Recovery: Automatic resumption after job preemption
Node Failure Recovery: Jobs resume after node failures
Dependency Management: Uses Slurm job dependencies for resumption
How It Works#
Initial Submission: Job is submitted with auto-resume handler
Failure Detection: Script detects timeout/preemption/failure
Automatic Resubmission: New job is submitted with dependency on previous job
Progress Preservation: Evaluation continues from where it left off
Maximum Total Walltime#
To prevent jobs from resuming indefinitely, you can configure a maximum total wall-clock time across all resumes using the max_walltime parameter:
execution:
walltime: "04:00:00" # Time limit per job submission
max_walltime: "24:00:00" # Maximum total time across all resumes (optional)
How it works:
The actual runtime of each job is tracked using SLURM’s
sacctcommandWhen a job resumes, the previous job’s actual elapsed time is added to the accumulated total
Before starting each resumed job, the accumulated runtime is checked against
max_walltimeIf the accumulated runtime exceeds
max_walltime, the job chain stops with an errorThis prevents runaway jobs that might otherwise resume indefinitely
Configuration:
max_walltime: Maximum total runtime inHH:MM:SSformat (e.g.,"24:00:00"for 24 hours)Defaults to
"120:00:00"(120 hours). Set tonullfor unlimited resuming
Note
The max_walltime tracks actual job execution time only, excluding time spent waiting in the queue. This ensures accurate runtime accounting even when jobs are repeatedly preempted or must wait for resources.
Monitoring and Job Management#
For monitoring jobs, checking status, and managing evaluations, see the Executors Overview section.