MLflow Exporter (mlflow)#

Exports accuracy metrics and artifacts to an MLflow Tracking Server.

  • Purpose: Centralize metrics, parameters, and artifacts in MLflow for experiment tracking

  • Requirements: mlflow package installed and a reachable MLflow tracking server

Usage#

Export evaluation results to MLflow Tracking Server for centralized experiment management.

Configure MLflow export to run automatically after evaluation completes. Add MLflow configuration to your run config YAML file:

execution:
  auto_export:
    destinations: ["mlflow"]
    configs:
      mlflow:
        tracking_uri: "http://mlflow.example.com:5000"
        experiment_name: "llm-evaluation"
        description: "Llama 3.1 8B evaluation"
        log_metrics: ["accuracy", "f1"]
        tags:
          model_family: "llama"
          version: "3.1"
        extra_metadata:
          hardware: "A100"
          batch_size: 32
        log_artifacts: true

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions

evaluation:
  tasks:
    - name: simple_evals.mmlu

Run the evaluation with auto-export enabled:

nemo-evaluator-launcher run --config-dir . --config-name my_config

Export results programmatically after evaluation completes:

from nemo_evaluator_launcher.api.functional import export_results

# Basic MLflow export
export_results(
    invocation_ids=["8abcd123"], 
    dest="mlflow", 
    config={
        "tracking_uri": "http://mlflow:5000",
        "experiment_name": "model-evaluation"
    }
)

# Export with metadata and tags
export_results(
    invocation_ids=["8abcd123"], 
    dest="mlflow", 
    config={
        "tracking_uri": "http://mlflow:5000",
        "experiment_name": "llm-benchmarks",
        "run_name": "llama-3.1-8b-mmlu",
        "description": "Evaluation of Llama 3.1 8B on MMLU",
        "tags": {
            "model_family": "llama",
            "model_version": "3.1",
            "benchmark": "mmlu"
        },
        "log_metrics": ["accuracy"],
        "extra_metadata": {
            "hardware": "A100-80GB",
            "batch_size": 32
        }
    }
)

# Export with artifacts disabled
export_results(
    invocation_ids=["8abcd123"], 
    dest="mlflow", 
    config={
        "tracking_uri": "http://mlflow:5000",
        "experiment_name": "model-comparison",
        "log_artifacts": False
    }
)

# Skip if run already exists
export_results(
    invocation_ids=["8abcd123"], 
    dest="mlflow", 
    config={
        "tracking_uri": "http://mlflow:5000",
        "experiment_name": "nightly-evals",
        "skip_existing": True
    }
)

Configuration Parameters#

Parameter

Type

Description

Default

tracking_uri

str

MLflow tracking server URI

Required

experiment_name

str

MLflow experiment name

"nemo-evaluator-launcher"

run_name

str

Run display name

Auto-generated

description

str

Run description

None

tags

dict[str, str]

Custom tags for the run

None

extra_metadata

dict

Additional parameters logged to MLflow

None

skip_existing

bool

Skip export if run exists for invocation

false

log_metrics

list[str]

Filter metrics by substring match

All metrics

log_artifacts

bool

Upload evaluation artifacts

true