MLflow Exporter (mlflow)#
Exports accuracy metrics and artifacts to an MLflow Tracking Server.
Purpose: Centralize metrics, parameters, and artifacts in MLflow for experiment tracking
Requirements:
mlflowpackage installed and a reachable MLflow tracking server
Prerequisites: MLflow Server Setup
Before exporting results, ensure that an MLflow Tracking Server is running and reachable.
If no server is active, export attempts will fail with connection errors.
Quick Start: Local Tracking Server
For local development or testing:
# Install MLflow
pip install nemo-evaluator-launcher[mlflow]
# Start a local tracking server (runs on: http://127.0.0.1:5000)
mlflow server --host 127.0.0.1 --port 5000
This starts MLflow with a local SQLite backend and a file-based artifact store under current directory.
Production Deployments
For production or multi-user setups:
Remote MLflow Server: Deploy MLflow on a dedicated VM or container.
Docker:
docker run -p 5000:5000 ghcr.io/mlflow/mlflow:latest \ mlflow server --host 0.0.0.0
Cloud-Managed Services: Use hosted options such as Databricks MLflow or AWS SageMaker MLflow.
For detailed deployment and configuration options, see the official MLflow Tracking Server documentation.
Usage#
Export evaluation results to MLflow Tracking Server for centralized experiment management.
Configure MLflow export to run automatically after evaluation completes. Add MLflow configuration to your run config YAML file:
execution:
auto_export:
destinations: ["mlflow"]
export:
mlflow:
tracking_uri: "http://mlflow.example.com:5000"
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
evaluation:
tasks:
- name: simple_evals.mmlu
Alternatively you can use MLFLOW_TRACKING_URI environment variable:
execution:
auto_export:
destinations: ["mlflow"]
# Export-related env vars (placeholders expanded at runtime)
env_vars:
export:
# you can skip export.mlflow.tracking_uri if you set this var
MLFLOW_TRACKING_URI: MLFLOW_TRACKING_URI
Set optional fields to customize your export:
execution:
auto_export:
destinations: ["mlflow"]
export:
mlflow:
tracking_uri: "http://mlflow.example.com:5000"
experiment_name: "llm-evaluation"
description: "Llama 3.1 8B evaluation"
log_metrics: ["mmlu_score_macro", "mmlu_score_micro"]
tags:
model_family: "llama"
version: "3.1"
extra_metadata:
hardware: "A100"
batch_size: 32
log_artifacts: true
Run the evaluation with auto-export enabled:
nemo-evaluator-launcher run --config ./my_config.yaml
Export results programmatically after evaluation completes:
from nemo_evaluator_launcher.api.functional import export_results
# Basic MLflow export
export_results(
invocation_ids=["8abcd123"],
dest="mlflow",
config={
"tracking_uri": "http://mlflow:5000",
"experiment_name": "model-evaluation"
}
)
# Export with metadata and tags
export_results(
invocation_ids=["8abcd123"],
dest="mlflow",
config={
"tracking_uri": "http://mlflow:5000",
"experiment_name": "llm-benchmarks",
"run_name": "llama-3.1-8b-mmlu",
"description": "Evaluation of Llama 3.1 8B on MMLU",
"tags": {
"model_family": "llama",
"model_version": "3.1",
"benchmark": "mmlu"
},
"log_metrics": ["accuracy"],
"extra_metadata": {
"hardware": "A100-80GB",
"batch_size": 32
}
}
)
# Export with artifacts disabled
export_results(
invocation_ids=["8abcd123"],
dest="mlflow",
config={
"tracking_uri": "http://mlflow:5000",
"experiment_name": "model-comparison",
"log_artifacts": False
}
)
# Skip if run already exists
export_results(
invocation_ids=["8abcd123"],
dest="mlflow",
config={
"tracking_uri": "http://mlflow:5000",
"experiment_name": "nightly-evals",
"skip_existing": True
}
)
Export results after evaluation completes:
# Default export
nemo-evaluator-launcher export 8abcd123 --dest mlflow
# With overrides
nemo-evaluator-launcher export 8abcd123 --dest mlflow \
-o export.mlflow.tracking_uri=http://mlflow:5000 \
-o export.mlflow.experiment_name=my-exp
# With metric filtering
nemo-evaluator-launcher export 8abcd123 --dest mlflow --log-metrics accuracy pass@1
Configuration Parameters#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
str |
MLflow tracking server URI |
Required if env var |
|
str |
MLflow experiment name |
|
|
str |
Run display name |
Auto-generated |
|
str |
Run description |
None |
|
dict[str, str] |
Custom tags for the run |
None |
|
dict |
Additional parameters logged to MLflow |
None |
|
bool |
Skip export if run exists for invocation. Useful to avoid creating duplicate runs when re-exporting. |
|
|
list[str] |
Filter metrics by substring match |
All metrics |
|
bool |
Upload evaluation artifacts |
|
|
bool |
Upload execution logs |
|
|
bool |
Copy only required artifacts |
|