bridge.utils.common_utils#
Module Contents#
Functions#
Get the distributed rank safely, even if torch.distributed is not initialized. |
|
Get the distributed world size safely, even if torch.distributed is not initialized. |
|
Get the last rank in the distributed group |
|
Get the local rank from the environment variable, intended for use before full init. |
|
Get the master address for distributed initialization. |
|
Get the master port for distributed initialization. |
|
Print a message only on global rank 0. |
|
Warn only on rank 0. |
|
Check if the current rank is the last rank in the default process group. |
|
Print a message only on the last rank of the default process group. |
|
Mark params for TP grad sync and hook setattr on a module and its children. |
|
Extract the expert number from a parameter name. |
|
Resolve a path to an absolute path. |
API#
- bridge.utils.common_utils.get_rank_safe() int#
Get the distributed rank safely, even if torch.distributed is not initialized.
Fallback order:
torch.distributed.get_rank() (if initialized)
RANK environment variable (torchrun/torchelastic)
SLURM_PROCID environment variable (SLURM)
Default: 0 (with warning)
- Returns:
The current process rank.
- bridge.utils.common_utils.get_world_size_safe() int#
Get the distributed world size safely, even if torch.distributed is not initialized.
Fallback order:
torch.distributed.get_world_size() (if initialized)
WORLD_SIZE environment variable (torchrun/torchelastic)
SLURM_NTASKS environment variable (SLURM)
Default: 1 (with warning)
- Returns:
The total number of processes in the distributed job.
- bridge.utils.common_utils.get_last_rank() int#
Get the last rank in the distributed group
- bridge.utils.common_utils.get_local_rank_preinit() int#
Get the local rank from the environment variable, intended for use before full init.
Fallback order:
LOCAL_RANK environment variable (torchrun/torchelastic)
SLURM_LOCALID environment variable (SLURM)
Default: 0 (with warning)
- Returns:
The local rank of the current process.
- bridge.utils.common_utils.get_master_addr_safe() str#
Get the master address for distributed initialization.
Fallback order:
MASTER_ADDR environment variable (torchrun/torchelastic)
SLURM_NODELIST parsed (SLURM)
Default: localhost (with warning)
- Returns:
The master node address.
- bridge.utils.common_utils.get_master_port_safe() int#
Get the master port for distributed initialization.
Fallback order:
MASTER_PORT environment variable (torchrun/torchelastic)
SLURM job-based port (SLURM_JOB_ID derived)
Default: 29500 (with warning)
- Returns:
The master port.
- bridge.utils.common_utils.print_rank_0(message: str) None#
Print a message only on global rank 0.
- Parameters:
message – The message string to print.
- bridge.utils.common_utils.warn_rank_0(message)#
Warn only on rank 0.
- bridge.utils.common_utils.is_last_rank() bool#
Check if the current rank is the last rank in the default process group.
- Returns:
True if the current rank is the last one, False otherwise.
- bridge.utils.common_utils.print_rank_last(message: str) None#
Print a message only on the last rank of the default process group.
- Parameters:
message – The message string to print.
- bridge.utils.common_utils.hook_hf_module_setattr_for_tp_grad_sync(
- module: torch.nn.Module,
Mark params for TP grad sync and hook setattr on a module and its children.
This ensures that all existing parameters under the provided module have the attribute
average_gradients_across_tp_domain=Trueand that any future submodules assigned onto this module (or any of its current children) will also have their parameters marked automatically.- Parameters:
module – The root module (typically a Hugging Face module instance).
- Returns:
The same module instance for convenience.
- bridge.utils.common_utils.extract_expert_number_from_param(param_name: str) int#
Extract the expert number from a parameter name.
- Parameters:
param_name – The parameter name to extract the expert number from.
- Returns:
The expert number.
- bridge.utils.common_utils.resolve_path(path: str) pathlib.Path#
Resolve a path to an absolute path.