MLflow Logging in NeMo AutoModel
Introduction
MLflow is an open-source platform for managing the machine learning lifecycle, including experiment tracking, model versioning, and deployment. NeMo AutoModel integrates with MLflow to log training metrics, parameters, and artifacts during model training.
With MLflow integration, you can:
- Track and compare experiments across multiple runs
- Log hyperparameters and training configurations
- Monitor training and validation metrics in real-time
- Store model checkpoints and artifacts
- Visualize experiment results through the MLflow UI
- Share results with team members
Prerequisites
Before using MLflow logging in NeMo AutoModel, ensure you have:
-
MLflow installed: MLflow is installed with
nemo-automodelby default. If you see an import error in your environment, install it manually: -
MLflow tracking server (optional): For production use, set up a tracking server to centralize experiment data. For local development, MLflow will use a local file-based store by default.
Configuration
Enable MLflow logging by adding an mlflow section to your recipe YAML configuration:
Configuration Parameters
Tracking URI Options
The tracking_uri parameter determines where MLflow stores experiment data:
- Local file storage (default):
nullorfile:///path/to/mlruns - Remote tracking server:
http://your-mlflow-server:5000 - Database backend:
postgresql://user:password@host:port/database
For team collaboration, we recommend setting up a remote tracking server.
What Gets Logged
NeMo AutoModel automatically logs the following information to MLflow:
Metrics
- Training loss at each step
- Validation loss and metrics
- Learning rate schedule
- Gradient norms (if gradient clipping is enabled)
Parameters
- Model configuration (architecture, size, pretrained checkpoint)
- Training hyperparameters (learning rate, batch size, optimizer settings)
- Dataset information
- Parallelism configuration (DP, TP, CP settings)
Tags
- Custom tags from configuration
- Automatically added tags:
- Model name from
pretrained_model_name_or_path - Global and local batch sizes
- Model name from
Artifacts
- Model checkpoints (if configured)
- Training configuration files
Only rank 0 in distributed training logs to MLflow to avoid duplicate entries and reduce overhead.
Usage Example
Here’s a complete example of training with MLflow logging enabled:
Configure Your Recipe
Add the MLflow configuration to your YAML file (e.g., llama3_2_1b_squad.yaml):
Run Training
During training, you’ll see MLflow logging messages:
View Results in MLflow UI
Launch the MLflow UI to visualize your experiments:
By default, the UI runs at http://localhost:5000. Open this URL in your browser to:
- Compare metrics across runs
- View parameter configurations
- Download artifacts
- Filter and search experiments by tags
Integration with Other Loggers
MLflow can be used alongside other logging tools like Weights & Biases (WandB). Simply enable both in your configuration:
Both loggers will track the same metrics independently, allowing you to leverage the strengths of each platform.
Best Practices
Experiment Organization
-
Use descriptive experiment names: Group related runs under meaningful experiment names.
-
Tag your runs: Add tags for easy filtering and comparison.
-
Use run names for variants: Differentiate runs within an experiment.
Remote Tracking Server
For team collaboration, set up a shared MLflow tracking server:
Artifact Storage
For large-scale experiments, configure a dedicated artifact location:
Supported storage backends include S3, Azure Blob Storage, Google Cloud Storage, and network file systems.
Performance Considerations
- MLflow logging adds minimal overhead since only rank 0 logs.
- Metrics are logged asynchronously to avoid blocking training.
- For very frequent logging (every step), consider increasing
val_every_stepsto reduce I/O.
Troubleshooting
MLflow Not Installed
If you see an import error:
Install MLflow:
Connection Issues
If you can’t connect to a remote tracking server:
- Verify the
tracking_uriis correct - Check network connectivity and firewall rules
- Ensure the tracking server is running
Missing Metrics
If metrics aren’t appearing in MLflow:
- Verify you’re running on rank 0 or check rank 0 logs
- Ensure the MLflow run started successfully (check for “MLflow run started” message)
- Check that metrics are being computed during training