bridge.training.utils.mlflow_utils#
Module Contents#
Functions#
Callback executed after a checkpoint is successfully saved. |
|
Callback executed after a checkpoint is successfully loaded. |
|
Sanitize all metric names in a dictionary for MLFlow logging. |
API#
- bridge.training.utils.mlflow_utils.on_save_checkpoint_success(
- checkpoint_path: str,
- save_dir: str,
- iteration: int,
- mlflow_logger: Optional[Any],
Callback executed after a checkpoint is successfully saved.
If an MLFlow logger is provided, logs the checkpoint directory as an MLFlow artifact under a structured artifact path that includes the iteration number.
- Parameters:
checkpoint_path – The path to the specific checkpoint file/directory saved.
save_dir – The base directory where checkpoints are being saved.
iteration – The training iteration at which the checkpoint was saved.
mlflow_logger – The MLFlow module (e.g.,
mlflow) with an active run. If None, this function is a no-op.
- bridge.training.utils.mlflow_utils.on_load_checkpoint_success(
- checkpoint_path: str,
- load_dir: str,
- mlflow_logger: Optional[Any],
Callback executed after a checkpoint is successfully loaded.
For MLFlow, this emits a simple metric and tag to document which checkpoint was loaded during the run. It does not perform artifact lookups.
- Parameters:
checkpoint_path – The path to the specific checkpoint file/directory loaded.
load_dir – The base directory from which the checkpoint was loaded.
mlflow_logger – The MLFlow module (e.g.,
mlflow) with an active run. If None, this function is a no-op.
- bridge.training.utils.mlflow_utils._sanitize_mlflow_metrics(
- metrics: dict[str, Any],
Sanitize all metric names in a dictionary for MLFlow logging.