bridge.utils.slurm_utils#
Utilities for detecting and configuring SLURM cluster environments.
This module provides functionality to detect SLURM environments and extract distributed training configuration from SLURM environment variables.
Module Contents#
Functions#
Detect if running in a SLURM environment. |
|
Get the global rank from SLURM environment. |
|
Get the world size from SLURM environment. |
|
Get the local rank from SLURM environment. |
|
Parse SLURM_NODELIST to get the master node address. |
|
Get master port for SLURM job. |
|
Parse a SLURM nodelist string and extract the first node. |
API#
- bridge.utils.slurm_utils.is_slurm_job() bool#
Detect if running in a SLURM environment.
- Returns:
True if SLURM job detected, False otherwise.
- bridge.utils.slurm_utils.resolve_slurm_rank() int | None#
Get the global rank from SLURM environment.
- Returns:
The global rank, or None if not in SLURM environment.
- bridge.utils.slurm_utils.resolve_slurm_world_size() int | None#
Get the world size from SLURM environment.
- Returns:
The world size, or None if not in SLURM environment.
- bridge.utils.slurm_utils.resolve_slurm_local_rank() int | None#
Get the local rank from SLURM environment.
- Returns:
The local rank, or None if not in SLURM environment.
- bridge.utils.slurm_utils.resolve_slurm_master_addr() str | None#
Parse SLURM_NODELIST to get the master node address.
Handles common SLURM nodelist formats:
Simple list: “node001,node002” -> “node001”
Range: “node[001-004]” -> “node001”
List in brackets: “node[001,003,005]” -> “node001”
- Returns:
The master node address, or None if not in SLURM environment.
- bridge.utils.slurm_utils.resolve_slurm_master_port() int | None#
Get master port for SLURM job.
Uses a deterministic port based on SLURM_JOB_ID to avoid conflicts when multiple jobs run on the same nodes.
- Returns:
The master port, or None if not in SLURM environment.
- bridge.utils.slurm_utils._parse_slurm_nodelist(nodelist: str) str#
Parse a SLURM nodelist string and extract the first node.
Handles common SLURM nodelist formats:
Simple list: “node001,node002” -> “node001”
Range: “node[001-004]” -> “node001”
List in brackets: “node[001,003,005]” -> “node001”
- Parameters:
nodelist – The SLURM nodelist string to parse.
- Returns:
The hostname of the first node in the list.