MLflow Exporter (mlflow)#

Exports accuracy metrics and artifacts to an MLflow Tracking Server.

  • Purpose: Centralize metrics, parameters, and artifacts in MLflow for experiment tracking

  • Requirements: mlflow package installed and a reachable MLflow tracking server

Usage#

Export evaluation results to MLflow Tracking Server for centralized experiment management.

Configure MLflow export to run automatically after evaluation completes. Add MLflow configuration to your run config YAML file:

execution:
  auto_export:
    destinations: ["mlflow"]
  
  # Export-related env vars (placeholders expanded at runtime)
  env_vars:
    export:
      MLFLOW_TRACKING_URI: MLFLOW_TRACKING_URI # or set tracking_uri under export.mflow
      PATH: "/path/to/conda/env/bin:$PATH"

export:
  mlflow:
    tracking_uri: "http://mlflow.example.com:5000"
    experiment_name: "llm-evaluation"
    description: "Llama 3.1 8B evaluation"
    log_metrics: ["accuracy", "f1"]
    tags:
      model_family: "llama"
      version: "3.1"
    extra_metadata:
      hardware: "A100"
      batch_size: 32
    log_artifacts: true

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions

evaluation:
  tasks:
    - name: simple_evals.mmlu

Run the evaluation with auto-export enabled:

nemo-evaluator-launcher run --config-dir . --config-name my_config

Export results programmatically after evaluation completes:

from nemo_evaluator_launcher.api.functional import export_results

# Basic MLflow export
export_results(
    invocation_ids=["8abcd123"], 
    dest="mlflow", 
    config={
        "tracking_uri": "http://mlflow:5000",
        "experiment_name": "model-evaluation"
    }
)

# Export with metadata and tags
export_results(
    invocation_ids=["8abcd123"], 
    dest="mlflow", 
    config={
        "tracking_uri": "http://mlflow:5000",
        "experiment_name": "llm-benchmarks",
        "run_name": "llama-3.1-8b-mmlu",
        "description": "Evaluation of Llama 3.1 8B on MMLU",
        "tags": {
            "model_family": "llama",
            "model_version": "3.1",
            "benchmark": "mmlu"
        },
        "log_metrics": ["accuracy"],
        "extra_metadata": {
            "hardware": "A100-80GB",
            "batch_size": 32
        }
    }
)

# Export with artifacts disabled
export_results(
    invocation_ids=["8abcd123"], 
    dest="mlflow", 
    config={
        "tracking_uri": "http://mlflow:5000",
        "experiment_name": "model-comparison",
        "log_artifacts": False
    }
)

# Skip if run already exists
export_results(
    invocation_ids=["8abcd123"], 
    dest="mlflow", 
    config={
        "tracking_uri": "http://mlflow:5000",
        "experiment_name": "nightly-evals",
        "skip_existing": True
    }
)

Export results after evaluation completes:

# Default export
nemo-evaluator-launcher export 8abcd123 --dest mlflow

# With overrides
nemo-evaluator-launcher export 8abcd123 --dest mlflow \
  -o export.mlflow.tracking_uri=http://mlflow:5000 \
  -o export.mlflow.experiment_name=my-exp

# With metric filtering
nemo-evaluator-launcher export 8abcd123 --dest mlflow --log-metrics accuracy pass@1

Configuration Parameters#

Parameter

Type

Description

Default

tracking_uri

str

MLflow tracking server URI

Required if env var MLFLOW_TRACKING_URI is not set

experiment_name

str

MLflow experiment name

"nemo-evaluator-launcher"

run_name

str

Run display name

Auto-generated

description

str

Run description

None

tags

dict[str, str]

Custom tags for the run

None

extra_metadata

dict

Additional parameters logged to MLflow

None

skip_existing

bool

Skip export if run exists for invocation. Useful to avoid creating duplicate runs when re-exporting.

false

log_metrics

list[str]

Filter metrics by substring match

All metrics

log_artifacts

bool

Upload evaluation artifacts

true

log_logs

bool

Upload execution logs

false

only_required

bool

Copy only required artifacts

true