nemo_automodel.components.loggers.mlflow_utils#

Module Contents#

Functions#

configure_mlflow

Configure MLflow on rank 0 and start (or resume) a run.

flatten_params_for_mlflow

Flatten nested dicts to dot-keyed strings for MLflow params.

to_float_metrics

Clean a metrics dict before passing to mlflow.log_metrics.

end_mlflow_active_run_as_killed

End the active MLflow run with status=KILLED.

_install_mlflow_failure_hook

Mark active MLflow run as FAILED on uncaught Python exceptions.

Data#

API#

nemo_automodel.components.loggers.mlflow_utils.logger#

‘getLogger(…)’

nemo_automodel.components.loggers.mlflow_utils.configure_mlflow(cfg: Any) Optional[Any]#

Configure MLflow on rank 0 and start (or resume) a run.

Also installs a sys.excepthook so crashed jobs report as FAILED rather than FINISHED. After this call the recipe logs via module-level mlflow.log_params and mlflow.log_metrics directly; on non-rank-0 processes mlflow.active_run() is None so those calls become no-ops naturally.

Returns the active run on rank 0, or None when MLflow is not configured or on non-rank-0 processes.

nemo_automodel.components.loggers.mlflow_utils.flatten_params_for_mlflow(
params: Dict[str, Any],
max_depth: Optional[int] = 1,
prefix: str = '',
_depth: int = 0,
) Dict[str, str]#

Flatten nested dicts to dot-keyed strings for MLflow params.

max_depth controls how many levels of dict nesting get split into individual keys; deeper nesting is stringified at that depth’s leaf:

  • 1 (default) — split one level, e.g. model.text_config: "{'output_hidden_states': True}".

  • N > 1 — split up to N levels deep.

  • None — fully recursive: every leaf gets its own key, e.g. model.text_config.output_hidden_states: 'True'.

Lists and tuples are always stringified; per-element keys would add noise without helping comparison (e.g. betas: [0.9, 0.95]).

nemo_automodel.components.loggers.mlflow_utils.to_float_metrics(
metrics: Dict[str, Any],
) Dict[str, float]#

Clean a metrics dict before passing to mlflow.log_metrics.

MetricsSample.to_dict() mixes numbers, tensors, and a string timestamp field, but mlflow.log_metrics only accepts numeric values. This function filters and coerces values so the call succeeds:

  • Non-numeric values (e.g. timestamp) — dropped (otherwise mlflow raises TypeError: must be real number, not str).

  • Tensors — coerced via .item() (multi-element tensors are reduced with .mean() first).

  • Python scalars — coerced to float.

nemo_automodel.components.loggers.mlflow_utils.end_mlflow_active_run_as_killed() None#

End the active MLflow run with status=KILLED.

Called from the SIGTERM handler so interrupted runs show as KILLED rather than FINISHED in the MLflow UI (mlflow’s atexit handler defaults to FINISHED on graceful exit, making cancelled and clean runs look identical).

No-op if no run is active; errors from end_run are suppressed so that signal-handler reentrancy in mlflow can’t crash the SIGTERM path.

nemo_automodel.components.loggers.mlflow_utils._install_mlflow_failure_hook() None#

Mark active MLflow run as FAILED on uncaught Python exceptions.

MLflow’s atexit handler ends the run with default status=FINISHED on process exit, making a crashed run indistinguishable from a clean one in the UI. We chain a sys.excepthook that fires before atexit and explicitly sets FAILED first; the previous excepthook is preserved so default traceback printing still happens.

This only covers Python exceptions on the main thread. SIGKILL (OOM, job cancellation) and NCCL watchdog std::terminate paths bypass it and leave the run in RUNNING until a server-side janitor times it out. Worker-thread exceptions need threading.excepthook separately.