bridge.utils.slurm_utils#
Utilities for detecting and configuring SLURM cluster environments.
This module provides functionality to detect SLURM environments and extract distributed training configuration from SLURM environment variables.
Module Contents#
Functions#
Parse SLURM_NODELIST to get the master node address. |
|
Get master port for SLURM job. |
|
Parse a SLURM nodelist string and extract the first node. |
API#
- bridge.utils.slurm_utils.resolve_slurm_master_addr() str | None#
Parse SLURM_NODELIST to get the master node address.
Handles common SLURM nodelist formats:
Simple list: “node001,node002” -> “node001”
Range: “node[001-004]” -> “node001”
List in brackets: “node[001,003,005]” -> “node001”
- Returns:
The master node address, or None if not in SLURM environment.
- bridge.utils.slurm_utils.resolve_slurm_master_port() int | None#
Get master port for SLURM job.
Uses a deterministic port based on SLURM_JOB_ID to avoid conflicts when multiple jobs run on the same nodes.
- Returns:
The master port, or None if not in SLURM environment.
- bridge.utils.slurm_utils._parse_slurm_nodelist(nodelist: str) str#
Parse a SLURM nodelist string and extract the first node.
Handles common SLURM nodelist formats:
Simple list: “node001,node002” -> “node001”
Range: “node[001-004]” -> “node001”
List in brackets: “node[001,003,005]” -> “node001”
Mixed entries: “nodeA,nodeB[1-3],nodeC” -> “nodeA”
- Parameters:
nodelist – The SLURM nodelist string to parse.
- Returns:
The hostname of the first node in the list.