nemo_automodel.components.loggers.mlflow_utils
nemo_automodel.components.loggers.mlflow_utils
Module Contents
Functions
Data
API
Mark active MLflow run as FAILED on uncaught Python exceptions.
MLflow’s atexit handler ends the run with default status=FINISHED on
process exit, making a crashed run indistinguishable from a clean one
in the UI. We chain a sys.excepthook that fires before atexit and
explicitly sets FAILED first; the previous excepthook is preserved so
default traceback printing still happens.
This only covers Python exceptions on the main thread. SIGKILL (OOM,
job cancellation) and NCCL watchdog std::terminate paths bypass it
and leave the run in RUNNING until a server-side janitor times it out.
Worker-thread exceptions need threading.excepthook separately.
Configure MLflow on rank 0 and start (or resume) a run.
Also installs a sys.excepthook so crashed jobs report as FAILED rather
than FINISHED. After this call the recipe logs via module-level
mlflow.log_params and mlflow.log_metrics directly; on non-rank-0
processes mlflow.active_run() is None so those calls become no-ops
naturally.
Returns the active run on rank 0, or None when MLflow is not configured or on non-rank-0 processes.
End the active MLflow run with status=KILLED.
Called from the SIGTERM handler so interrupted runs show as KILLED rather than FINISHED in the MLflow UI (mlflow’s atexit handler defaults to FINISHED on graceful exit, making cancelled and clean runs look identical).
No-op if no run is active; errors from end_run are suppressed so that
signal-handler reentrancy in mlflow can’t crash the SIGTERM path.
Flatten nested dicts to dot-keyed strings for MLflow params.
max_depth controls how many levels of dict nesting get split into
individual keys; deeper nesting is stringified at that depth’s leaf:
1(default) — split one level, e.g.model.text_config: "{'output_hidden_states': True}".N > 1— split up to N levels deep.None— fully recursive: every leaf gets its own key, e.g.model.text_config.output_hidden_states: 'True'.
Lists and tuples are always stringified; per-element keys would add
noise without helping comparison (e.g. betas: [0.9, 0.95]).
Clean a metrics dict before passing to mlflow.log_metrics.
MetricsSample.to_dict() mixes numbers, tensors, and a string timestamp
field, but mlflow.log_metrics only accepts numeric values. This function
filters and coerces values so the call succeeds:
- Non-numeric values (e.g.
timestamp) — dropped (otherwise mlflow raisesTypeError: must be real number, not str). - Tensors — coerced via
.item()(multi-element tensors are reduced with.mean()first). - Python scalars — coerced to float.