bridge.utils.slurm_utils#

Utilities for detecting and configuring SLURM cluster environments.

This module provides functionality to detect SLURM environments and extract distributed training configuration from SLURM environment variables.

Module Contents#

Functions#

resolve_slurm_master_addr

Parse SLURM_NODELIST to get the master node address.

resolve_slurm_master_port

Get master port for SLURM job.

_parse_slurm_nodelist

Parse a SLURM nodelist string and extract the first node.

API#

bridge.utils.slurm_utils.resolve_slurm_master_addr() str | None#

Parse SLURM_NODELIST to get the master node address.

Handles common SLURM nodelist formats:

  • Simple list: “node001,node002” -> “node001”

  • Range: “node[001-004]” -> “node001”

  • List in brackets: “node[001,003,005]” -> “node001”

Returns:

The master node address, or None if not in SLURM environment.

bridge.utils.slurm_utils.resolve_slurm_master_port() int | None#

Get master port for SLURM job.

Uses a deterministic port based on SLURM_JOB_ID to avoid conflicts when multiple jobs run on the same nodes.

Returns:

The master port, or None if not in SLURM environment.

bridge.utils.slurm_utils._parse_slurm_nodelist(nodelist: str) str#

Parse a SLURM nodelist string and extract the first node.

Handles common SLURM nodelist formats:

  • Simple list: “node001,node002” -> “node001”

  • Range: “node[001-004]” -> “node001”

  • List in brackets: “node[001,003,005]” -> “node001”

  • Mixed entries: “nodeA,nodeB[1-3],nodeC” -> “nodeA”

Parameters:

nodelist – The SLURM nodelist string to parse.

Returns:

The hostname of the first node in the list.