Weights & Biases Exporter (wandb)#

Exports accuracy metrics and artifacts to W&B. Supports either per-task runs or a single multi-task run per invocation, with artifact logging and run metadata.

  • Purpose: Track runs, metrics, and artifacts in W&B

  • Requirements: wandb installed and credentials configured

Usage#

Export evaluation results to Weights & Biases for experiment tracking, visualization, and collaboration.

Basic export to W&B using credentials and project settings from your evaluation configuration:

# Export to W&B (uses config from evaluation run)
nemo-evaluator-launcher export 8abcd123 --dest wandb

# Filter metrics to export specific measurements
nemo-evaluator-launcher export 8abcd123 --dest wandb --log-metrics accuracy f1_score

Note

Specify W&B configuration (entity, project, tags, etc.) in your evaluation YAML configuration file under execution.auto_export.configs.wandb. The CLI export command reads these settings from the stored job configuration.

Export results programmatically with W&B configuration:

from nemo_evaluator_launcher.api.functional import export_results

# Basic W&B export
export_results(
    invocation_ids=["8abcd123"], 
    dest="wandb", 
    config={
        "entity": "myorg", 
        "project": "model-evaluations"
    }
)

# Export with metadata and organization
export_results(
    invocation_ids=["8abcd123"], 
    dest="wandb", 
    config={
        "entity": "myorg",
        "project": "llm-benchmarks",
        "name": "llama-3.1-8b-eval",
        "group": "llama-family-comparison",
        "description": "Evaluation of Llama 3.1 8B on benchmarks",
        "tags": ["llama-3.1", "8b"],
        "log_mode": "per_task",
        "log_metrics": ["accuracy"],
        "log_artifacts": True,
        "extra_metadata": {
            "hardware": "A100-80GB"
        }
    }
)

# Multi-task mode: single run for all tasks
export_results(
    invocation_ids=["8abcd123"], 
    dest="wandb", 
    config={
        "entity": "myorg",
        "project": "model-comparison",
        "log_mode": "multi_task",
        "log_artifacts": False
    }
)

Configure W&B export in your evaluation YAML file for automatic export on completion:

execution:
  auto_export:
    destinations: ["wandb"]
  
  # Export-related env vars (placeholders expanded at runtime)
  env_vars:
    export:
      WANDB_API_KEY: WANDB_API_KEY
      PATH: "/path/to/conda/env/bin:$PATH"

export:
  wandb:
    entity: "myorg"
    project: "llm-benchmarks"
    name: "llama-3.1-8b-instruct-v1"
    group: "baseline-evals"
    tags: ["llama-3.1", "baseline"]
    description: "Baseline evaluation"
    log_mode: "multi_task"
    log_metrics: ["accuracy"]
    log_artifacts: true
    log_logs: true
    only_required: false
    
    extra_metadata:
      hardware: "H100"
      checkpoint: "path/to/checkpoint"

Configuration Parameters#

Parameter

Type

Description

Default

entity

str

W&B entity (organization or username)

Required

project

str

W&B project name

Required

log_mode

str

Logging mode: per_task creates separate runs for each evaluation task; multi_task creates a single run for all tasks

per_task

name

str

Run display name. If not specified, auto-generated as eval-{invocation_id}-{benchmark} (per_task) or eval-{invocation_id} (multi_task)

Auto-generated

group

str

Run group for organizing related runs

Invocation ID

tags

list[str]

Tags for categorizing the run

None

description

str

Run description (stored as W&B notes)

None

log_metrics

list[str]

Metric name patterns to filter (e.g., ["accuracy", "f1"]). Logs only metrics containing these substrings

All metrics

log_artifacts

bool

Whether to upload evaluation artifacts (results files, configs) to W&B

true

log_logs

bool

Upload execution logs

false

only_required

bool

Copy only required artifacts

true

extra_metadata

dict

Additional metadata stored in run config (e.g., hardware, hyperparameters)

{}

job_type

str

W&B job type classification

evaluation