bridge.training.utils.mlflow_utils#

Module Contents#

Functions#

on_save_checkpoint_success

Callback executed after a checkpoint is successfully saved.

on_load_checkpoint_success

Callback executed after a checkpoint is successfully loaded.

_sanitize_mlflow_metrics

Sanitize all metric names in a dictionary for MLFlow logging.

end_active_mlflow_run

End the active MLFlow run with the given status.

install_mlflow_failure_hook

Mark the active MLFlow run as FAILED on uncaught Python exceptions.

API#

bridge.training.utils.mlflow_utils.on_save_checkpoint_success(
checkpoint_path: str,
save_dir: str,
iteration: int,
mlflow_logger: Optional[Any],
) None#

Callback executed after a checkpoint is successfully saved.

If an MLFlow logger is provided, logs the checkpoint directory as an MLFlow artifact under a structured artifact path that includes the iteration number.

Parameters:
  • checkpoint_path – The path to the specific checkpoint file/directory saved.

  • save_dir – The base directory where checkpoints are being saved.

  • iteration – The training iteration at which the checkpoint was saved.

  • mlflow_logger – The MLFlow module (e.g., mlflow) with an active run. If None, this function is a no-op.

bridge.training.utils.mlflow_utils.on_load_checkpoint_success(
checkpoint_path: str,
load_dir: str,
mlflow_logger: Optional[Any],
) None#

Callback executed after a checkpoint is successfully loaded.

For MLFlow, this emits a simple metric and tag to document which checkpoint was loaded during the run. It does not perform artifact lookups.

Parameters:
  • checkpoint_path – The path to the specific checkpoint file/directory loaded.

  • load_dir – The base directory from which the checkpoint was loaded.

  • mlflow_logger – The MLFlow module (e.g., mlflow) with an active run. If None, this function is a no-op.

bridge.training.utils.mlflow_utils._sanitize_mlflow_metrics(
metrics: dict[str, Any],
) dict[str, Any]#

Sanitize all metric names in a dictionary for MLFlow logging.

bridge.training.utils.mlflow_utils.end_active_mlflow_run(status: str) None#

End the active MLFlow run with the given status.

Used by the SIGTERM exit path (status="KILLED") and the failure excepthook (status="FAILED") to override MLFlow’s default FINISHED status so the UI distinguishes interrupted and crashed runs from successful ones. Clean exits rely on MLFlow’s own atexit handler, which already ends the run as FINISHED.

No-op if MLFlow is not installed or no run is active. Exceptions raised inside mlflow.end_run are caught and logged.

Parameters:

status – An MLFlow RunStatus string, typically "KILLED" or "FAILED".

bridge.training.utils.mlflow_utils.install_mlflow_failure_hook() None#

Mark the active MLFlow run as FAILED on uncaught Python exceptions.

MLFlow’s own atexit handler ends the run with the default status FINISHED on process exit, making a crashed run indistinguishable from a clean one in the UI. We chain a sys.excepthook that fires before atexit and explicitly sets FAILED first; the previous excepthook is preserved so default traceback printing still happens.

Idempotent: a second call after a previous install is a no-op.