Slurm Executor#

The Slurm executor runs evaluations on high‑performance computing (HPC) clusters managed by Slurm, an open‑source workload manager widely used in research and enterprise environments. It schedules and executes jobs across cluster nodes, enabling parallel, large‑scale evaluation runs while preserving reproducibility via containerized benchmarks.

See common concepts and commands in Executors.

Slurm can optionally host your model for the scope of an evaluation by deploying a serving container on the cluster and pointing the benchmark to that temporary endpoint. In this mode, two containers are used: one for the evaluation harness and one for the model server. The evaluation configuration includes a deployment section when this is enabled. See the examples in the examples/ directory for ready‑to‑use configurations.

If you do not require deployment on Slurm, simply omit the deployment section from your configuration and set the model’s endpoint URL directly (any OpenAI‑compatible endpoint that you host elsewhere).

Prerequisites#

  • Access to a Slurm cluster (with appropriate partitions/queues)

  • Pyxis SPANK plugin installed on the cluster

Configuration Overview#

Connecting to Your Slurm Cluster#

To run evaluations on Slurm, specify how to connect to your cluster

execution:
  hostname: your-cluster-headnode      # Slurm headnode (login node)
  username: your_username            # Cluster username (defaults to $USER env var)
  account: your_allocation           # Slurm account or project name
  output_dir: /shared/scratch/your_username/eval_results  # Absolute, shared path

Note

When specifying the parameters make sure to provide:

  • hostname: Slurm headnode (login node) where you normally SSH to submit jobs.

  • output_dir: must be an absolute path on a shared filesystem (e.g., /shared/scratch/ or /home/) accessible to both the headnode and compute nodes.

Model Deployment Options#

When deploying models on Slurm, you have two options for specifying your model source:

Option 2: Local Model Files (Manual Setup Required)#

If you work with a checkpoint stored on locally on the cluster, use checkpoint_path:

deployment:
  checkpoint_path: /shared/models/llama-3.1-8b-instruct  # model directory accessible to compute nodes
  # Do NOT set hf_model_handle when using checkpoint_path

Note:

  • The directory must exist, be accessible from compute nodes, and contain model files

  • Slurm does not automatically download models when using checkpoint_path

Environment Variables and Secrets#

Environment variables use the unified prefix syntax (host:, lit:, runtime:) described in Environment Variables. Declare them at the top-level env_vars: section, at evaluation.env_vars, or per-task — the launcher handles writing a .secrets.env file that is uploaded alongside the batch script and sourced at runtime.

env_vars:
  HF_TOKEN: host:HF_TOKEN              # resolved from host, never in batch script
  CACHE_DIR: lit:/cache/huggingface     # literal path
  TRANSFORMERS_OFFLINE: lit:1           # literal flag

Security: Secret values are never written into the generated run.sub script. They are stored in a separate .secrets.env file and sourced at runtime, preventing accidental exposure in logs or artifacts.

Multi-Node and Multi-Instance#

Configure multi-node deployments using num_nodes and num_instances:

execution:
  num_nodes: 4              # Total SLURM nodes
  num_instances: 2          # Independent deployment instances (default: 1)
  • num_nodes: Total number of SLURM nodes to allocate

  • num_instances: Number of independent deployment instances. When > 1, HAProxy is automatically configured to load-balance across instances. num_nodes must be divisible by num_instances.

Note

You can use the vllm_ray deployment config when you need vllm Ray-based multi-node deployment (see deployment docs).

Mounting and Storage#

The Slurm executor provides sophisticated mounting capabilities:

execution:
  mounts:
    deployment:
      /path/to/checkpoints: /checkpoint
      /path/to/cache: /cache
    evaluation:
      /path/to/data: /data
      /path/to/results: /results
    mount_home: true  # Mount user home directory

Mount Types::

  • Deployment Mounts: For model checkpoints, cache directories, and model data.

  • Evaluation Mounts: For input data, additional artifacts, and evaluation-specific files

  • Home Mount: Optional mounting of user home directory (enabled by default)

Complete Configuration Example#

Here’s a complete Slurm executor configuration using HuggingFace models:

# examples/slurm_llama_3_1_8b_instruct.yaml
defaults:
  - execution: slurm/default
  - deployment: vllm
  - _self_

env_vars:
  HF_TOKEN: host:HF_TOKEN   # Needed to access meta-llama/Llama-3.1-8B-Instruct gated model

execution:
  hostname: your-cluster-headnode
  account: your_account
  output_dir: /shared/scratch/your_username/eval_results
  partition: gpu
  walltime: "04:00:00"
  endpoint_readiness_timeout: 1200  # wait up to 20 minutes for model server
  gpus_per_node: 8

deployment:
  hf_model_handle: meta-llama/Llama-3.1-8B-Instruct
  checkpoint_path: null
  served_model_name: meta-llama/Llama-3.1-8B-Instruct
  tensor_parallel_size: 1
  data_parallel_size: 8

evaluation:
  tasks:
    - name: hellaswag
    - name: arc_challenge
    - name: winogrande

This configuration:

  • Uses the Slurm execution backend

  • Deploys a vLLM model server on the cluster

  • Requests GPU resources (8 GPUs per node, 4-hour time limit)

  • Runs three benchmark tasks in parallel

  • Saves benchmark artifacts to output_dir

Resuming#

The Slurm executor includes advanced auto-resume capabilities:

Automatic Resumption#

  • Timeout Handling: Jobs automatically resume after timeout

  • Preemption Recovery: Automatic resumption after job preemption

  • Node Failure Recovery: Jobs resume after node failures

  • Dependency Management: Uses Slurm job dependencies for resumption

How It Works#

  1. Initial Submission: Job is submitted with auto-resume handler

  2. Failure Detection: Script detects timeout/preemption/failure

  3. Automatic Resubmission: New job is submitted with dependency on previous job

  4. Progress Preservation: Evaluation continues from where it left off

Maximum Total Walltime#

To prevent jobs from resuming indefinitely, you can configure a maximum total wall-clock time across all resumes using the max_walltime parameter:

execution:
  walltime: "04:00:00"       # Time limit per job submission
  max_walltime: "24:00:00"   # Maximum total time across all resumes (optional)

How it works:

  • The actual runtime of each job is tracked using SLURM’s sacct command

  • When a job resumes, the previous job’s actual elapsed time is added to the accumulated total

  • Before starting each resumed job, the accumulated runtime is checked against max_walltime

  • If the accumulated runtime exceeds max_walltime, the job chain stops with an error

  • This prevents runaway jobs that might otherwise resume indefinitely

Configuration:

  • max_walltime: Maximum total runtime in HH:MM:SS format (e.g., "24:00:00" for 24 hours)

  • Defaults to "120:00:00" (120 hours). Set to null for unlimited resuming

Note

The max_walltime tracks actual job execution time only, excluding time spent waiting in the queue. This ensures accurate runtime accounting even when jobs are repeatedly preempted or must wait for resources.

Endpoint Readiness Timeout#

When deploying a model server on Slurm, the executor waits for the server’s health endpoint to return HTTP 200 before starting the evaluation. By default it waits up to the configured walltime. You can override this with endpoint_readiness_timeout (in seconds):

execution:
  endpoint_readiness_timeout: 1200  # wait up to 20 minutes

If the server does not become ready within the timeout, the job fails with a clear error instead of waiting until the Slurm walltime expires.

Arbitrary sbatch Flags#

For advanced cluster configurations that require sbatch options not natively exposed by the Slurm executor (such as --switches, --constraint, --mem, --reservation, etc.), use the sbatch_extra_flags dict:

execution:
  sbatch_extra_flags:
    switches: 1           # Emits: #SBATCH --switches 1
    constraint: "h100"    # Emits: #SBATCH --constraint h100
    exclusive: true       # Emits: #SBATCH --exclusive  (boolean true → flag only, no value)
    mem: "0"              # Emits: #SBATCH --mem 0

Each key-value pair in sbatch_extra_flags generates a corresponding #SBATCH --<key> <value> header in the batch script. Boolean true values emit the flag without a value (useful for bare flags like --exclusive). Values of false or null are silently skipped.

The default config includes exclusive: true, which requests exclusive node access. To allow node sharing (useful on clusters that charge by resource usage rather than by node), override it:

execution:
  sbatch_extra_flags:
    exclusive: false   # Allow node sharing
    switches: 1        # Add any other flags you need

Tip

--switches=1 is useful for multi-node deployments. It instructs Slurm to allocate all nodes on the same network switch, which can reduce inter-node communication latency.

Monitoring and Job Management#

For monitoring jobs, checking status, and managing evaluations, see the Executors Overview section.