MLflow Exporter (`mlflow`)#

Exports accuracy metrics and artifacts to an MLflow Tracking Server.

Purpose: Centralize metrics, parameters, and artifacts in MLflow for experiment tracking
Requirements: mlflow package installed and a reachable MLflow tracking server

Usage#

Export evaluation results to MLflow Tracking Server for centralized experiment management.

Auto-Export (Recommended)

Configure MLflow export to run automatically after evaluation completes. Add MLflow configuration to your run config YAML file:

execution:
  auto_export:
    destinations: ["mlflow"]

export:
  mlflow:
    tracking_uri: "http://mlflow.example.com:5000"

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions

evaluation:
  tasks:
    - name: simple_evals.mmlu

Alternatively you can use MLFLOW_TRACKING_URI environment variable:

execution:
  auto_export:
    destinations: ["mlflow"]
  
  # Export-related env vars (placeholders expanded at runtime)
  env_vars:
    export:
      # you can skip export.mlflow.tracking_uri if you set this var
      MLFLOW_TRACKING_URI: MLFLOW_TRACKING_URI

Set optional fields to customize your export:

execution:
  auto_export:
    destinations: ["mlflow"]

export:
  mlflow:
    tracking_uri: "http://mlflow.example.com:5000"
    experiment_name: "llm-evaluation"
    description: "Llama 3.1 8B evaluation"
    log_metrics: ["mmlu_score_macro", "mmlu_score_micro"]
    tags:
      model_family: "llama"
      version: "3.1"
    extra_metadata:
      hardware: "A100"
      batch_size: 32
    log_artifacts: true

Run the evaluation with auto-export enabled:

nemo-evaluator-launcher run --config ./my_config.yaml

Manual Export (Python API)

Export results programmatically after evaluation completes:

from nemo_evaluator_launcher.api.functional import export_results

# Basic MLflow export
export_results(
    invocation_ids=["8abcd123"], 
    dest="mlflow", 
    config={
        "tracking_uri": "http://mlflow:5000",
        "experiment_name": "model-evaluation"
    }
)

# Export with metadata and tags
export_results(
    invocation_ids=["8abcd123"], 
    dest="mlflow", 
    config={
        "tracking_uri": "http://mlflow:5000",
        "experiment_name": "llm-benchmarks",
        "run_name": "llama-3.1-8b-mmlu",
        "description": "Evaluation of Llama 3.1 8B on MMLU",
        "tags": {
            "model_family": "llama",
            "model_version": "3.1",
            "benchmark": "mmlu"
        },
        "log_metrics": ["accuracy"],
        "extra_metadata": {
            "hardware": "A100-80GB",
            "batch_size": 32
        }
    }
)

# Export with artifacts disabled
export_results(
    invocation_ids=["8abcd123"], 
    dest="mlflow", 
    config={
        "tracking_uri": "http://mlflow:5000",
        "experiment_name": "model-comparison",
        "log_artifacts": False
    }
)

# Skip if run already exists
export_results(
    invocation_ids=["8abcd123"], 
    dest="mlflow", 
    config={
        "tracking_uri": "http://mlflow:5000",
        "experiment_name": "nightly-evals",
        "skip_existing": True
    }
)

Manual Export (CLI)

Export results after evaluation completes:

# Default export
nemo-evaluator-launcher export 8abcd123 --dest mlflow

# With overrides
nemo-evaluator-launcher export 8abcd123 --dest mlflow \
  -o export.mlflow.tracking_uri=http://mlflow:5000 \
  -o export.mlflow.experiment_name=my-exp

# With metric filtering
nemo-evaluator-launcher export 8abcd123 --dest mlflow --log-metrics accuracy pass@1

Configuration Parameters#

Parameter	Type	Description	Default
`tracking_uri`	str	MLflow tracking server URI	Required if env var `MLFLOW_TRACKING_URI` is not set
`experiment_name`	str	MLflow experiment name	`"nemo-evaluator-launcher"`
`run_name`	str	Run display name	Auto-generated
`description`	str	Run description	None
`tags`	dict[str, str]	Custom tags for the run	None
`extra_metadata`	dict	Additional parameters logged to MLflow	None
`skip_existing`	bool	Skip export if run exists for invocation. Useful to avoid creating duplicate runs when re-exporting.	`false`
`log_metrics`	list[str]	Filter metrics by substring match	All metrics
`log_artifacts`	bool	Upload evaluation artifacts	`true`
`log_logs`	bool	Upload execution logs	`false`
`only_required`	bool	Copy only required artifacts	`true`

MLflow Exporter (mlflow)#

Usage#

Configuration Parameters#

MLflow Exporter (`mlflow`)#