nemo_automodel.components.loggers.mlflow_utils#
Module Contents#
Functions#
Configure MLflow on rank 0 and start (or resume) a run. |
|
Flatten nested dicts to dot-keyed strings for MLflow params. |
|
Clean a metrics dict before passing to |
|
End the active MLflow run with status=KILLED. |
|
Mark active MLflow run as FAILED on uncaught Python exceptions. |
Data#
API#
- nemo_automodel.components.loggers.mlflow_utils.logger#
‘getLogger(…)’
- nemo_automodel.components.loggers.mlflow_utils.configure_mlflow(cfg: Any) Optional[Any]#
Configure MLflow on rank 0 and start (or resume) a run.
Also installs a
sys.excepthookso crashed jobs report as FAILED rather than FINISHED. After this call the recipe logs via module-levelmlflow.log_paramsandmlflow.log_metricsdirectly; on non-rank-0 processesmlflow.active_run()is None so those calls become no-ops naturally.Returns the active run on rank 0, or None when MLflow is not configured or on non-rank-0 processes.
- nemo_automodel.components.loggers.mlflow_utils.flatten_params_for_mlflow(
- params: Dict[str, Any],
- max_depth: Optional[int] = 1,
- prefix: str = '',
- _depth: int = 0,
Flatten nested dicts to dot-keyed strings for MLflow params.
max_depthcontrols how many levels of dict nesting get split into individual keys; deeper nesting is stringified at that depth’s leaf:1(default) — split one level, e.g.model.text_config: "{'output_hidden_states': True}".N > 1— split up to N levels deep.None— fully recursive: every leaf gets its own key, e.g.model.text_config.output_hidden_states: 'True'.
Lists and tuples are always stringified; per-element keys would add noise without helping comparison (e.g.
betas: [0.9, 0.95]).
- nemo_automodel.components.loggers.mlflow_utils.to_float_metrics(
- metrics: Dict[str, Any],
Clean a metrics dict before passing to
mlflow.log_metrics.MetricsSample.to_dict()mixes numbers, tensors, and a stringtimestampfield, butmlflow.log_metricsonly accepts numeric values. This function filters and coerces values so the call succeeds:Non-numeric values (e.g.
timestamp) — dropped (otherwise mlflow raisesTypeError: must be real number, not str).Tensors — coerced via
.item()(multi-element tensors are reduced with.mean()first).Python scalars — coerced to float.
- nemo_automodel.components.loggers.mlflow_utils.end_mlflow_active_run_as_killed() None#
End the active MLflow run with status=KILLED.
Called from the SIGTERM handler so interrupted runs show as KILLED rather than FINISHED in the MLflow UI (mlflow’s atexit handler defaults to FINISHED on graceful exit, making cancelled and clean runs look identical).
No-op if no run is active; errors from
end_runare suppressed so that signal-handler reentrancy in mlflow can’t crash the SIGTERM path.
- nemo_automodel.components.loggers.mlflow_utils._install_mlflow_failure_hook() None#
Mark active MLflow run as FAILED on uncaught Python exceptions.
MLflow’s atexit handler ends the run with default status=FINISHED on process exit, making a crashed run indistinguishable from a clean one in the UI. We chain a
sys.excepthookthat fires before atexit and explicitly sets FAILED first; the previous excepthook is preserved so default traceback printing still happens.This only covers Python exceptions on the main thread. SIGKILL (OOM, job cancellation) and NCCL watchdog
std::terminatepaths bypass it and leave the run in RUNNING until a server-side janitor times it out. Worker-thread exceptions needthreading.excepthookseparately.