> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.loggers.mlflow_utils

## Module Contents

### Functions

| Name                                                                                                                 | Description                                                     |
| -------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------- |
| [`_install_mlflow_failure_hook`](#nemo_automodel-components-loggers-mlflow_utils-_install_mlflow_failure_hook)       | Mark active MLflow run as FAILED on uncaught Python exceptions. |
| [`configure_mlflow`](#nemo_automodel-components-loggers-mlflow_utils-configure_mlflow)                               | Configure MLflow on rank 0 and start (or resume) a run.         |
| [`end_mlflow_active_run_as_killed`](#nemo_automodel-components-loggers-mlflow_utils-end_mlflow_active_run_as_killed) | End the active MLflow run with status=KILLED.                   |
| [`flatten_params_for_mlflow`](#nemo_automodel-components-loggers-mlflow_utils-flatten_params_for_mlflow)             | Flatten nested dicts to dot-keyed strings for MLflow params.    |
| [`to_float_metrics`](#nemo_automodel-components-loggers-mlflow_utils-to_float_metrics)                               | Clean a metrics dict before passing to `mlflow.log_metrics`.    |

### Data

[`logger`](#nemo_automodel-components-loggers-mlflow_utils-logger)

### API

```python
nemo_automodel.components.loggers.mlflow_utils._install_mlflow_failure_hook() -> None
```

Mark active MLflow run as FAILED on uncaught Python exceptions.

MLflow's atexit handler ends the run with default status=FINISHED on
process exit, making a crashed run indistinguishable from a clean one
in the UI. We chain a `sys.excepthook` that fires before atexit and
explicitly sets FAILED first; the previous excepthook is preserved so
default traceback printing still happens.

This only covers Python exceptions on the main thread. SIGKILL (OOM,
job cancellation) and NCCL watchdog `std::terminate` paths bypass it
and leave the run in RUNNING until a server-side janitor times it out.
Worker-thread exceptions need `threading.excepthook` separately.

```python
nemo_automodel.components.loggers.mlflow_utils.configure_mlflow(
    cfg: typing.Any
) -> typing.Optional[typing.Any]
```

Configure MLflow on rank 0 and start (or resume) a run.

Also installs a `sys.excepthook` so crashed jobs report as FAILED rather
than FINISHED. After this call the recipe logs via module-level
`mlflow.log_params` and `mlflow.log_metrics` directly; on non-rank-0
processes `mlflow.active_run()` is None so those calls become no-ops
naturally.

Returns the active run on rank 0, or None when MLflow is not configured
or on non-rank-0 processes.

```python
nemo_automodel.components.loggers.mlflow_utils.end_mlflow_active_run_as_killed() -> None
```

End the active MLflow run with status=KILLED.

Called from the SIGTERM handler so interrupted runs show as KILLED
rather than FINISHED in the MLflow UI (mlflow's atexit handler defaults
to FINISHED on graceful exit, making cancelled and clean runs look
identical).

No-op if no run is active; errors from `end_run` are suppressed so that
signal-handler reentrancy in mlflow can't crash the SIGTERM path.

```python
nemo_automodel.components.loggers.mlflow_utils.flatten_params_for_mlflow(
    params: typing.Dict[str, typing.Any],
    max_depth: typing.Optional[int] = 1,
    prefix: str = '',
    _depth: int = 0
) -> typing.Dict[str, str]
```

Flatten nested dicts to dot-keyed strings for MLflow params.

`max_depth` controls how many levels of dict nesting get split into
individual keys; deeper nesting is stringified at that depth's leaf:

* `1` (default) — split one level, e.g.
  `model.text_config: "&#123;'output_hidden_states': True&#125;"`.
* `N &gt; 1` — split up to N levels deep.
* `None` — fully recursive: every leaf gets its own key, e.g.
  `model.text_config.output_hidden_states: 'True'`.

Lists and tuples are always stringified; per-element keys would add
noise without helping comparison (e.g. `betas: [0.9, 0.95]`).

```python
nemo_automodel.components.loggers.mlflow_utils.to_float_metrics(
    metrics: typing.Dict[str, typing.Any]
) -> typing.Dict[str, float]
```

Clean a metrics dict before passing to `mlflow.log_metrics`.

`MetricsSample.to_dict()` mixes numbers, tensors, and a string `timestamp`
field, but `mlflow.log_metrics` only accepts numeric values. This function
filters and coerces values so the call succeeds:

* Non-numeric values (e.g. `timestamp`) — dropped (otherwise mlflow raises
  `TypeError: must be real number, not str`).
* Tensors — coerced via `.item()` (multi-element tensors are reduced with
  `.mean()` first).
* Python scalars — coerced to float.

```python
nemo_automodel.components.loggers.mlflow_utils.logger = logging.getLogger(__name__)
```