> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# MLflow Logging in NeMo AutoModel

## Introduction

MLflow is an open-source platform for managing the machine learning lifecycle, including experiment tracking, model versioning, and deployment. NeMo AutoModel integrates with MLflow to log training metrics, parameters, and artifacts during model training.

With MLflow integration, you can:

* Track and compare experiments across multiple runs
* Log hyperparameters and training configurations
* Monitor training and validation metrics in real-time
* Store model checkpoints and artifacts
* Visualize experiment results through the MLflow UI
* Share results with team members

## Prerequisites

Before using MLflow logging in NeMo AutoModel, ensure you have:

1. **MLflow installed**: MLflow is installed with `nemo-automodel` by default. If you see an import error in your environment, install it manually:
   ```bash
   pip install mlflow
   # or:
   uv pip install mlflow
   ```

2. **MLflow tracking server** (optional): For production use, set up a tracking server to centralize experiment data. For local development, MLflow will use a local file-based store by default.

## Configuration

Enable MLflow logging by adding an `mlflow` section to your recipe YAML configuration:

```yaml
mlflow:
  experiment_name: "automodel-llm-llama3_2_1b_squad-finetune"
  run_name: ""
  tracking_uri: null
  artifact_location: null
  tags:
    task: "squad-finetune"
    model_family: "llama3.2"
    model_size: "1b"
    dataset: "squad"
    framework: "automodel"
```

### Configuration Parameters

| Parameter           | Type | Default                | Description                                                                  |
| ------------------- | ---- | ---------------------- | ---------------------------------------------------------------------------- |
| `experiment_name`   | str  | "automodel-experiment" | Name of the MLflow experiment. All runs are grouped under this experiment.   |
| `run_name`          | str  | ""                     | Optional name for the current run. If empty, MLflow generates a unique name. |
| `tracking_uri`      | str  | null                   | URI of the MLflow tracking server. If null, uses local file-based storage.   |
| `artifact_location` | str  | null                   | Location to store artifacts. If null, uses default MLflow location.          |
| `tags`              | dict | {}                     | Dictionary of tags to attach to the run for organization and filtering.      |

### Tracking URI Options

The `tracking_uri` parameter determines where MLflow stores experiment data:

* **Local file storage (default)**: `null` or `file:///path/to/mlruns`
* **Remote tracking server**: `http://your-mlflow-server:5000`
* **Database backend**: `postgresql://user:password@host:port/database`

For team collaboration, we recommend setting up a remote tracking server.

## What Gets Logged

NeMo AutoModel automatically logs the following information to MLflow:

### Metrics

* Training loss at each step
* Validation loss and metrics
* Learning rate schedule
* Gradient norms (if gradient clipping is enabled)

### Parameters

* Model configuration (architecture, size, pretrained checkpoint)
* Training hyperparameters (learning rate, batch size, optimizer settings)
* Dataset information
* Parallelism configuration (DP, TP, CP settings)

### Tags

* Custom tags from configuration
* Automatically added tags:
  * Model name from `pretrained_model_name_or_path`
  * Global and local batch sizes

### Artifacts

* Model checkpoints (if configured)
* Training configuration files

Only rank 0 in distributed training logs to MLflow to avoid duplicate entries and reduce overhead.

## Usage Example

Here's a complete example of training with MLflow logging enabled:

### Configure Your Recipe

Add the MLflow configuration to your YAML file (e.g., `llama3_2_1b_squad.yaml`):

```yaml
step_scheduler:
  global_batch_size: 64
  local_batch_size: 8
  ckpt_every_steps: 1000
  val_every_steps: 10
  num_epochs: 1

model:
  _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
  pretrained_model_name_or_path: meta-llama/Llama-3.2-1B

mlflow:
  experiment_name: "llama3-squad-finetune"
  run_name: "baseline-run-1"
  tracking_uri: null  # Uses local storage
  tags:
    task: "question-answering"
    dataset: "squad"
    model: "llama-3.2-1b"
```

### Run Training

```bash
automodel --nproc-per-node=8 examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml
```

During training, you'll see MLflow logging messages:

```
MLflow run started: abc123def456
View run at: file:///path/to/mlruns/#/experiments/1/runs/abc123def456
```

### View Results in MLflow UI

Launch the MLflow UI to visualize your experiments:

```bash
mlflow ui
```

By default, the UI runs at `http://localhost:5000`. Open this URL in your browser to:

* Compare metrics across runs
* View parameter configurations
* Download artifacts
* Filter and search experiments by tags

## Integration with Other Loggers

MLflow can be used alongside other logging tools like Weights & Biases (WandB). Simply enable both in your configuration:

```yaml
# Enable both MLflow and WandB
mlflow:
  experiment_name: "my-experiment"
  tags:
    framework: "automodel"

wandb:
  project: "my-project"
  entity: "my-team"
  name: "my-run"
```

Both loggers will track the same metrics independently, allowing you to leverage the strengths of each platform.

## Best Practices

### Experiment Organization

1. **Use descriptive experiment names**: Group related runs under meaningful experiment names.
   ```yaml
   experiment_name: "llama3-squad-ablation-study"
   ```

2. **Tag your runs**: Add tags for easy filtering and comparison.
   ```yaml
   tags:
     model_size: "1b"
     learning_rate: "1e-5"
     optimizer: "adam"
   ```

3. **Use run names for variants**: Differentiate runs within an experiment.
   ```yaml
   run_name: "lr-1e5-bs64"
   ```

### Remote Tracking Server

For team collaboration, set up a shared MLflow tracking server:

```yaml
mlflow:
  tracking_uri: "http://mlflow-server.example.com:5000"
  experiment_name: "team-llm-experiments"
```

### Artifact Storage

For large-scale experiments, configure a dedicated artifact location:

```yaml
mlflow:
  artifact_location: "s3://my-bucket/mlflow-artifacts"
```

Supported storage backends include S3, Azure Blob Storage, Google Cloud Storage, and network file systems.

### Performance Considerations

* MLflow logging adds minimal overhead since only rank 0 logs.
* Metrics are logged asynchronously to avoid blocking training.
* For very frequent logging (every step), consider increasing `val_every_steps` to reduce I/O.

## Troubleshooting

### MLflow Not Installed

If you see an import error:

```
ImportError: MLflow is not installed. Please install it (e.g. pip install mlflow).
```

Install MLflow:

```bash
pip install mlflow
# or:
uv pip install mlflow
```

### Connection Issues

If you can't connect to a remote tracking server:

* Verify the `tracking_uri` is correct
* Check network connectivity and firewall rules
* Ensure the tracking server is running

### Missing Metrics

If metrics aren't appearing in MLflow:

* Verify you're running on rank 0 or check rank 0 logs
* Ensure the MLflow run started successfully (check for "MLflow run started" message)
* Check that metrics are being computed during training

## References

* [MLflow Documentation](https://mlflow.org/docs/latest/index.html)
* [MLflow Tracking](https://mlflow.org/docs/latest/tracking.html)
* [MLflow Python API](https://mlflow.org/docs/latest/python_api/index.html)
* [NeMo AutoModel Examples](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples)