bridge.utils.slurm_utils#

Utilities for detecting and configuring SLURM cluster environments.

This module provides functionality to detect SLURM environments and extract distributed training configuration from SLURM environment variables.

Module Contents#

Functions#

is_slurm_job

Detect if running in a SLURM environment.

resolve_slurm_rank

Get the global rank from SLURM environment.

resolve_slurm_world_size

Get the world size from SLURM environment.

resolve_slurm_local_rank

Get the local rank from SLURM environment.

resolve_slurm_master_addr

Parse SLURM_NODELIST to get the master node address.

resolve_slurm_master_port

Get master port for SLURM job.

_parse_slurm_nodelist

Parse a SLURM nodelist string and extract the first node.

API#

bridge.utils.slurm_utils.is_slurm_job() bool#

Detect if running in a SLURM environment.

Returns:

True if SLURM job detected, False otherwise.

bridge.utils.slurm_utils.resolve_slurm_rank() int | None#

Get the global rank from SLURM environment.

Returns:

The global rank, or None if not in SLURM environment.

bridge.utils.slurm_utils.resolve_slurm_world_size() int | None#

Get the world size from SLURM environment.

Returns:

The world size, or None if not in SLURM environment.

bridge.utils.slurm_utils.resolve_slurm_local_rank() int | None#

Get the local rank from SLURM environment.

Returns:

The local rank, or None if not in SLURM environment.

bridge.utils.slurm_utils.resolve_slurm_master_addr() str | None#

Parse SLURM_NODELIST to get the master node address.

Handles common SLURM nodelist formats:

  • Simple list: “node001,node002” -> “node001”

  • Range: “node[001-004]” -> “node001”

  • List in brackets: “node[001,003,005]” -> “node001”

Returns:

The master node address, or None if not in SLURM environment.

bridge.utils.slurm_utils.resolve_slurm_master_port() int | None#

Get master port for SLURM job.

Uses a deterministic port based on SLURM_JOB_ID to avoid conflicts when multiple jobs run on the same nodes.

Returns:

The master port, or None if not in SLURM environment.

bridge.utils.slurm_utils._parse_slurm_nodelist(nodelist: str) str#

Parse a SLURM nodelist string and extract the first node.

Handles common SLURM nodelist formats:

  • Simple list: “node001,node002” -> “node001”

  • Range: “node[001-004]” -> “node001”

  • List in brackets: “node[001,003,005]” -> “node001”

Parameters:

nodelist – The SLURM nodelist string to parse.

Returns:

The hostname of the first node in the list.