bridge.training.utils.mlflow_utils#

Module Contents#

Functions#

on_save_checkpoint_success

Callback executed after a checkpoint is successfully saved.

on_load_checkpoint_success

Callback executed after a checkpoint is successfully loaded.

_sanitize_mlflow_metrics

Sanitize all metric names in a dictionary for MLFlow logging.

API#

bridge.training.utils.mlflow_utils.on_save_checkpoint_success(
checkpoint_path: str,
save_dir: str,
iteration: int,
mlflow_logger: Optional[Any],
) None#

Callback executed after a checkpoint is successfully saved.

If an MLFlow logger is provided, logs the checkpoint directory as an MLFlow artifact under a structured artifact path that includes the iteration number.

Parameters:
  • checkpoint_path – The path to the specific checkpoint file/directory saved.

  • save_dir – The base directory where checkpoints are being saved.

  • iteration – The training iteration at which the checkpoint was saved.

  • mlflow_logger – The MLFlow module (e.g., mlflow) with an active run. If None, this function is a no-op.

bridge.training.utils.mlflow_utils.on_load_checkpoint_success(
checkpoint_path: str,
load_dir: str,
mlflow_logger: Optional[Any],
) None#

Callback executed after a checkpoint is successfully loaded.

For MLFlow, this emits a simple metric and tag to document which checkpoint was loaded during the run. It does not perform artifact lookups.

Parameters:
  • checkpoint_path – The path to the specific checkpoint file/directory loaded.

  • load_dir – The base directory from which the checkpoint was loaded.

  • mlflow_logger – The MLFlow module (e.g., mlflow) with an active run. If None, this function is a no-op.

bridge.training.utils.mlflow_utils._sanitize_mlflow_metrics(
metrics: dict[str, Any],
) dict[str, Any]#

Sanitize all metric names in a dictionary for MLFlow logging.