Slurm Deployment via Launcher#
Deploy and evaluate models on HPC clusters using Slurm workload manager through NeMo Evaluator Launcher orchestration.
Overview#
Slurm launcher-orchestrated deployment:
Submits jobs to Slurm-managed HPC clusters
Supports multi-node evaluation runs
Handles resource allocation and job scheduling
Manages model deployment lifecycle within Slurm jobs
Quick Start#
# Deploy and evaluate on Slurm cluster
nv-eval run \
--config-dir examples \
--config-name slurm_llama_3_1_8b_instruct \
-o deployment.checkpoint_path=/shared/models/llama-3.1-8b-instruct \
-o execution.partition=gpu
vLLM Deployment#
# Slurm with vLLM deployment
defaults:
- execution: slurm/default
- deployment: vllm
- _self_
deployment:
type: vllm
checkpoint_path: /shared/models/llama-3.1-8b-instruct
served_model_name: meta-llama/Llama-3.1-8B-Instruct
tensor_parallel_size: 1
data_parallel_size: 8
port: 8000
execution:
account: my-account
output_dir: /shared/results
partition: gpu
num_nodes: 1
ntasks_per_node: 1
gres: gpu:8
walltime: "02:00:00"
target:
api_endpoint:
url: http://localhost:8000/v1/chat/completions
model_id: meta-llama/Llama-3.1-8B-Instruct
evaluation:
tasks:
- name: ifeval
- name: gpqa_diamond
- name: mbpp
Slurm Configuration#
Supported Parameters#
The following execution parameters are supported for Slurm deployments. See configs/execution/slurm/default.yaml
in the launcher package for the base configuration:
execution:
# Required parameters
hostname: ??? # Slurm cluster hostname
username: ${oc.env:USER} # SSH username (defaults to USER environment variable)
account: ??? # Slurm account for billing
output_dir: ??? # Results directory
# Resource allocation
partition: batch # Slurm partition/queue
num_nodes: 1 # Number of nodes
ntasks_per_node: 1 # Tasks per node
gres: gpu:8 # GPU resources
walltime: "01:00:00" # Wall time limit (HH:MM:SS)
# Environment variables and mounts
env_vars:
deployment: {} # Environment variables for deployment container
evaluation: {} # Environment variables for evaluation container
mounts:
deployment: {} # Mount points for deployment container (source:target format)
evaluation: {} # Mount points for evaluation container (source:target format)
mount_home: true # Whether to mount home directory
Note
The gpus_per_node
parameter can be used as an alternative to gres
for specifying GPU resources. However, gres
is the default in the base configuration.
Configuration Examples#
Benchmark Suite Evaluation#
# Run multiple benchmarks on a single model
defaults:
- execution: slurm/default
- deployment: vllm
- _self_
deployment:
type: vllm
checkpoint_path: /shared/models/llama-3.1-8b-instruct
served_model_name: meta-llama/Llama-3.1-8B-Instruct
tensor_parallel_size: 1
data_parallel_size: 8
port: 8000
execution:
account: my-account
output_dir: /shared/results
hostname: slurm.example.com
partition: gpu
num_nodes: 1
ntasks_per_node: 1
gres: gpu:8
walltime: "06:00:00"
target:
api_endpoint:
url: http://localhost:8000/v1/chat/completions
model_id: meta-llama/Llama-3.1-8B-Instruct
evaluation:
tasks:
- name: ifeval
- name: gpqa_diamond
- name: mbpp
- name: hellaswag
Job Management#
Submitting Jobs#
# Submit job with configuration
nv-eval run \
--config-dir examples \
--config-name slurm_llama_3_1_8b_instruct
# Submit with configuration overrides
nv-eval run \
--config-dir examples \
--config-name slurm_llama_3_1_8b_instruct \
-o execution.walltime="04:00:00" \
-o execution.partition=gpu-long
Monitoring Jobs#
# Check job status
nv-eval status <job_id>
# List all runs (optionally filter by executor)
nv-eval ls runs --executor slurm
Managing Jobs#
# Cancel job
nv-eval kill <job_id>
Native Slurm Commands#
You can also use native Slurm commands to manage jobs directly:
# View job details
squeue -j <slurm_job_id> -o "%.18i %.9P %.50j %.8u %.2t %.10M %.6D %R"
# Check job efficiency
seff <slurm_job_id>
# Cancel Slurm job directly
scancel <slurm_job_id>
# Hold/release job
scontrol hold <slurm_job_id>
scontrol release <slurm_job_id>
# View detailed job information
scontrol show job <slurm_job_id>
Troubleshooting#
Common Issues#
Job Pending:
# Check node availability
sinfo -p gpu
# Try different partition
-o execution.partition="gpu-shared"
Job Failed:
# Check job status
nv-eval status <job_id>
# View Slurm job details
scontrol show job <slurm_job_id>
# Check job output logs (location shown in status output)
Job Timeout:
# Increase walltime
-o execution.walltime="08:00:00"
# Check current walltime limit for partition
sinfo -p <partition_name> -o "%P %l"
Resource Allocation:
# Adjust GPU allocation via gres
-o execution.gres=gpu:4
-o deployment.tensor_parallel_size=4
Debugging with Slurm Commands#
# View job details
scontrol show job <slurm_job_id>
# Monitor resource usage
sstat -j <slurm_job_id> --format=AveCPU,AveRSS,MaxRSS,AveVMSize
# Job accounting information
sacct -j <slurm_job_id> --format=JobID,JobName,State,ExitCode,DerivedExitCode
# Check job efficiency after completion
seff <slurm_job_id>