bridge.training.utils.checkpoint_utils#

Module Contents#

Functions#

file_exists

Check if a file exists.

ensure_directory_exists

Ensure that the directory for a given filename exists.

get_checkpoint_name

Determine the directory name for a specific checkpoint.

get_checkpoint_train_state_filename

Get the filename for the train state tracker file.

get_checkpoint_run_config_filename

Get the filename for the run configuration file within a checkpoint directory.

get_checkpoint_tracker_filename

Tracker file rescords the latest chckpoint during training to restart from.

checkpoint_exists

Check if a checkpoint directory exists.

read_run_config

Read the run configuration from a YAML file (rank 0 only).

read_train_state

Read the train state metadata from a YAML file (rank 0 only).

Data#

API#

bridge.training.utils.checkpoint_utils.TRAIN_STATE_FILE#

‘train_state.pt’

bridge.training.utils.checkpoint_utils.TRACKER_PREFIX#

‘latest’

bridge.training.utils.checkpoint_utils.CONFIG_FILE#

‘run_config.yaml’

bridge.training.utils.checkpoint_utils.logger#

‘getLogger(…)’

bridge.training.utils.checkpoint_utils.file_exists(path: str) bool#

Check if a file exists.

Parameters:

path – The path to the file. Can be a local path or an MSC URL.

Returns:

True if the file exists, False otherwise.

bridge.training.utils.checkpoint_utils.ensure_directory_exists(
filename: str,
check_parent: bool = True,
) None#

Ensure that the directory for a given filename exists.

Parameters:
  • filename – The path whose directory should be checked/created.

  • check_parent – If True (default), checks the parent directory of the filename. If False, treats the filename itself as the directory path.

bridge.training.utils.checkpoint_utils.get_checkpoint_name(
checkpoints_path: str,
iteration: int,
release: bool = False,
) str#

Determine the directory name for a specific checkpoint.

Constructs the path based on iteration number or release flag.

Parameters:
  • checkpoints_path – Base directory where checkpoints are stored.

  • iteration – The training iteration number.

  • release – If True, uses ‘release’ as the directory name instead of iteration.

Returns:

The full path to the checkpoint directory.

bridge.training.utils.checkpoint_utils.get_checkpoint_train_state_filename(
checkpoints_path: str,
prefix: Optional[str] = None,
) str#

Get the filename for the train state tracker file.

This file typically stores metadata about the latest checkpoint, like the iteration number.

Parameters:
  • checkpoints_path – Base directory where checkpoints are stored.

  • prefix – Optional prefix (e.g., ‘latest’) to prepend to the filename.

Returns:

The full path to the train state tracker file.

bridge.training.utils.checkpoint_utils.get_checkpoint_run_config_filename(checkpoints_path: str) str#

Get the filename for the run configuration file within a checkpoint directory.

Parameters:

checkpoints_path – Base directory where checkpoints are stored.

Returns:

The full path to the run configuration file (e.g., run_config.yaml).

bridge.training.utils.checkpoint_utils.get_checkpoint_tracker_filename(checkpoints_path: str) str#

Tracker file rescords the latest chckpoint during training to restart from.

Supports checkpoints produced by Megatron-LM.

Parameters:

checkpoints_path – Base directory where checkpoints are stored.

Returns:

The full path to the checkpoint tracker file (e.g., latest_checkpointed_iteration.txt).

bridge.training.utils.checkpoint_utils.checkpoint_exists(checkpoints_path: Optional[str]) bool#

Check if a checkpoint directory exists.

Parameters:

checkpoints_path – Path to the potential checkpoint directory.

Returns:

True if the path exists, False otherwise.

bridge.training.utils.checkpoint_utils.read_run_config(run_config_filename: str) dict[str, Any]#

Read the run configuration from a YAML file (rank 0 only).

Reads the file on rank 0 and broadcasts the result to other ranks.

Parameters:

run_config_filename – Path to the run config YAML file.

Returns:

A dictionary containing the run configuration.

Raises:

RuntimeError – If reading the config file fails on rank 0.

bridge.training.utils.checkpoint_utils.read_train_state(
train_state_filename: str,
) megatron.bridge.training.state.TrainState#

Read the train state metadata from a YAML file (rank 0 only).

Reads the file on rank 0 and broadcasts the result to other ranks.

Parameters:

train_state_filename – Path to the train state YAML file.

Returns:

An initialized TrainState object.