bridge.training.utils.mlflow_utils#
Module Contents#
Functions#
Callback executed after a checkpoint is successfully saved. |
|
Callback executed after a checkpoint is successfully loaded. |
|
Sanitize all metric names in a dictionary for MLFlow logging. |
|
End the active MLFlow run with the given status. |
|
Mark the active MLFlow run as |
API#
- bridge.training.utils.mlflow_utils.on_save_checkpoint_success(
- checkpoint_path: str,
- save_dir: str,
- iteration: int,
- mlflow_logger: Optional[Any],
Callback executed after a checkpoint is successfully saved.
If an MLFlow logger is provided, logs the checkpoint directory as an MLFlow artifact under a structured artifact path that includes the iteration number.
- Parameters:
checkpoint_path – The path to the specific checkpoint file/directory saved.
save_dir – The base directory where checkpoints are being saved.
iteration – The training iteration at which the checkpoint was saved.
mlflow_logger – The MLFlow module (e.g.,
mlflow) with an active run. If None, this function is a no-op.
- bridge.training.utils.mlflow_utils.on_load_checkpoint_success(
- checkpoint_path: str,
- load_dir: str,
- mlflow_logger: Optional[Any],
Callback executed after a checkpoint is successfully loaded.
For MLFlow, this emits a simple metric and tag to document which checkpoint was loaded during the run. It does not perform artifact lookups.
- Parameters:
checkpoint_path – The path to the specific checkpoint file/directory loaded.
load_dir – The base directory from which the checkpoint was loaded.
mlflow_logger – The MLFlow module (e.g.,
mlflow) with an active run. If None, this function is a no-op.
- bridge.training.utils.mlflow_utils._sanitize_mlflow_metrics(
- metrics: dict[str, Any],
Sanitize all metric names in a dictionary for MLFlow logging.
- bridge.training.utils.mlflow_utils.end_active_mlflow_run(status: str) None#
End the active MLFlow run with the given status.
Used by the SIGTERM exit path (
status="KILLED") and the failure excepthook (status="FAILED") to override MLFlow’s defaultFINISHEDstatus so the UI distinguishes interrupted and crashed runs from successful ones. Clean exits rely on MLFlow’s own atexit handler, which already ends the run asFINISHED.No-op if MLFlow is not installed or no run is active. Exceptions raised inside
mlflow.end_runare caught and logged.- Parameters:
status – An MLFlow
RunStatusstring, typically"KILLED"or"FAILED".
- bridge.training.utils.mlflow_utils.install_mlflow_failure_hook() None#
Mark the active MLFlow run as
FAILEDon uncaught Python exceptions.MLFlow’s own atexit handler ends the run with the default status
FINISHEDon process exit, making a crashed run indistinguishable from a clean one in the UI. We chain asys.excepthookthat fires before atexit and explicitly setsFAILEDfirst; the previous excepthook is preserved so default traceback printing still happens.Idempotent: a second call after a previous install is a no-op.