Slurm Executor#

The Slurm executor runs evaluations on high‑performance computing (HPC) clusters managed by Slurm, an open‑source workload manager widely used in research and enterprise environments. It schedules and executes jobs across cluster nodes, enabling parallel, large‑scale evaluation runs while preserving reproducibility via containerized benchmarks.

See common concepts and commands in Executors.

Slurm can optionally host your model for the scope of an evaluation by deploying a serving container on the cluster and pointing the benchmark to that temporary endpoint. In this mode, two containers are used: one for the evaluation harness and one for the model server. The evaluation configuration includes a deployment section when this is enabled. See the examples in the examples/ directory for ready‑to‑use configurations.

If you do not require deployment on Slurm, simply omit the deployment section from your configuration and set the model’s endpoint URL directly (any OpenAI‑compatible endpoint that you host elsewhere).

Prerequisites#

  • Access to a Slurm cluster (with appropriate partitions/queues)

  • Pyxis SPANK plugin installed on the cluster

Configuration Overview#

Connecting to Your Slurm Cluster#

To run evaluations on Slurm, specify how to connect to your cluster

execution:
  hostname: your-cluster-headnode      # Slurm headnode (login node)
  username: your_username            # Cluster username (defaults to $USER env var)
  account: your_allocation           # Slurm account or project name
  output_dir: /shared/scratch/your_username/eval_results  # Absolute, shared path

Note

When specifying the parameters make sure to provide:

  • hostname: Slurm headnode (login node) where you normally SSH to submit jobs.

  • output_dir: must be an absolute path on a shared filesystem (e.g., /shared/scratch/ or /home/) accessible to both the headnode and compute nodes.

Model Deployment Options#

When deploying models on Slurm, you have two options for specifying your model source:

Option 2: Local Model Files (Manual Setup Required)#

If you work with a checkpoint stored on locally on the cluster, use checkpoint_path:

deployment:
  checkpoint_path: /shared/models/llama-3.1-8b-instruct  # model directory accessible to compute nodes
  # Do NOT set hf_model_handle when using checkpoint_path

Note:

  • The directory must exist, be accessible from compute nodes, and contain model files

  • Slurm does not automatically download models when using checkpoint_path

Environment Variables and Secrets#

Environment variables use the unified prefix syntax ($host:, $lit:, $runtime:) described in Environment Variables. Declare them at the top-level env_vars: section, at evaluation.env_vars, or per-task — the launcher handles writing a .secrets.env file that is uploaded alongside the batch script and sourced at runtime.

env_vars:
  HF_TOKEN: $host:HF_TOKEN              # resolved from host, never in batch script
  CACHE_DIR: $lit:/cache/huggingface     # literal path
  TRANSFORMERS_OFFLINE: $lit:1           # literal flag

Security: Secret values are never written into the generated run.sub script. They are stored in a separate .secrets.env file and sourced at runtime, preventing accidental exposure in logs or artifacts.

Mounting and Storage#

The Slurm executor provides sophisticated mounting capabilities:

execution:
  mounts:
    deployment:
      /path/to/checkpoints: /checkpoint
      /path/to/cache: /cache
    evaluation:
      /path/to/data: /data
      /path/to/results: /results
    mount_home: true  # Mount user home directory

Mount Types::

  • Deployment Mounts: For model checkpoints, cache directories, and model data.

  • Evaluation Mounts: For input data, additional artifacts, and evaluation-specific files

  • Home Mount: Optional mounting of user home directory (enabled by default)

Complete Configuration Example#

Here’s a complete Slurm executor configuration using HuggingFace models:

# examples/slurm_llama_3_1_8b_instruct.yaml
defaults:
  - execution: slurm/default
  - deployment: vllm
  - _self_

env_vars:
  HF_TOKEN: $host:HF_TOKEN   # Needed to access meta-llama/Llama-3.1-8B-Instruct gated model

execution:
  hostname: your-cluster-headnode
  account: your_account
  output_dir: /shared/scratch/your_username/eval_results
  partition: gpu
  walltime: "04:00:00"
  gpus_per_node: 8

deployment:
  hf_model_handle: meta-llama/Llama-3.1-8B-Instruct
  checkpoint_path: null
  served_model_name: meta-llama/Llama-3.1-8B-Instruct
  tensor_parallel_size: 1
  data_parallel_size: 8

evaluation:
  tasks:
    - name: hellaswag
    - name: arc_challenge
    - name: winogrande

This configuration:

  • Uses the Slurm execution backend

  • Deploys a vLLM model server on the cluster

  • Requests GPU resources (8 GPUs per node, 4-hour time limit)

  • Runs three benchmark tasks in parallel

  • Saves benchmark artifacts to output_dir

Resuming#

The Slurm executor includes advanced auto-resume capabilities:

Automatic Resumption#

  • Timeout Handling: Jobs automatically resume after timeout

  • Preemption Recovery: Automatic resumption after job preemption

  • Node Failure Recovery: Jobs resume after node failures

  • Dependency Management: Uses Slurm job dependencies for resumption

How It Works#

  1. Initial Submission: Job is submitted with auto-resume handler

  2. Failure Detection: Script detects timeout/preemption/failure

  3. Automatic Resubmission: New job is submitted with dependency on previous job

  4. Progress Preservation: Evaluation continues from where it left off

Maximum Total Walltime#

To prevent jobs from resuming indefinitely, you can configure a maximum total wall-clock time across all resumes using the max_walltime parameter:

execution:
  walltime: "04:00:00"       # Time limit per job submission
  max_walltime: "24:00:00"   # Maximum total time across all resumes (optional)

How it works:

  • The actual runtime of each job is tracked using SLURM’s sacct command

  • When a job resumes, the previous job’s actual elapsed time is added to the accumulated total

  • Before starting each resumed job, the accumulated runtime is checked against max_walltime

  • If the accumulated runtime exceeds max_walltime, the job chain stops with an error

  • This prevents runaway jobs that might otherwise resume indefinitely

Configuration:

  • max_walltime: Maximum total runtime in HH:MM:SS format (e.g., "24:00:00" for 24 hours)

  • Defaults to "120:00:00" (120 hours). Set to null for unlimited resuming

Note

The max_walltime tracks actual job execution time only, excluding time spent waiting in the queue. This ensures accurate runtime accounting even when jobs are repeatedly preempted or must wait for resources.

Monitoring and Job Management#

For monitoring jobs, checking status, and managing evaluations, see the Executors Overview section.