bridge.models.hf_pretrained.safe_config_loader#

Thread-safe configuration loading utilities.

This module provides utilities for safely loading HuggingFace model configurations in multi-threaded environments, preventing race conditions that can occur when multiple threads try to download and cache the same model simultaneously.

Module Contents#

Functions#

safe_load_config_with_retry

Thread-safe and process-safe configuration loading with retry logic.

API#

bridge.models.hf_pretrained.safe_config_loader.safe_load_config_with_retry(
path: Union[str, pathlib.Path],
trust_remote_code: bool = False,
max_retries: int = 3,
base_delay: float = 1.0,
**kwargs,
) transformers.configuration_utils.PretrainedConfig#

Thread-safe and process-safe configuration loading with retry logic.

This function prevents race conditions when multiple threads/processes try to download and cache the same model configuration simultaneously. Uses file locking (if filelock is available) to coordinate access across processes.

Parameters:
  • path – HuggingFace model ID or path to model directory

  • trust_remote_code – Whether to trust remote code when loading config

  • max_retries – Maximum number of retry attempts (default: 3)

  • base_delay – Base delay in seconds for exponential backoff (default: 1.0)

  • **kwargs – Additional arguments passed to AutoConfig.from_pretrained

Returns:

The loaded model configuration

Return type:

PretrainedConfig

Raises:

ValueError – If config loading fails after all retries

Environment Variables: MEGATRON_CONFIG_LOCK_DIR: Override the directory where lock files are created. Default: ~/.cache/huggingface/ Useful for multi-node setups where a shared lock directory is needed.

.. rubric:: Example

config = safe_load_config_with_retry(“meta-llama/Meta-Llama-3-8B”) print(config.model_type)

With custom retry settings

config = safe_load_config_with_retry( … “gpt2”, … max_retries=5, … base_delay=0.5, … trust_remote_code=True … )

Multi-node setup with shared lock directory

import os os.environ[“MEGATRON_CONFIG_LOCK_DIR”] = “/shared/locks” config = safe_load_config_with_retry(“meta-llama/Meta-Llama-3-8B”)