Weights & Biases Exporter (wandb
)#
Exports accuracy metrics and artifacts to W&B. Supports either per-task runs or a single multi-task run per invocation, with artifact logging and run metadata.
Purpose: Track runs, metrics, and artifacts in W&B
Requirements:
wandb
installed and credentials configured
Usage#
Export evaluation results to Weights & Biases for experiment tracking, visualization, and collaboration.
Basic export to W&B using credentials and project settings from your evaluation configuration:
# Export to W&B (uses config from evaluation run)
nemo-evaluator-launcher export 8abcd123 --dest wandb
# Filter metrics to export specific measurements
nemo-evaluator-launcher export 8abcd123 --dest wandb --log-metrics accuracy f1_score
Note
Specify W&B configuration (entity, project, tags, etc.) in your evaluation YAML configuration file under execution.auto_export.configs.wandb
. The CLI export command reads these settings from the stored job configuration.
Export results programmatically with W&B configuration:
from nemo_evaluator_launcher.api.functional import export_results
# Basic W&B export
export_results(
invocation_ids=["8abcd123"],
dest="wandb",
config={
"entity": "myorg",
"project": "model-evaluations"
}
)
# Export with metadata and organization
export_results(
invocation_ids=["8abcd123"],
dest="wandb",
config={
"entity": "myorg",
"project": "llm-benchmarks",
"name": "llama-3.1-8b-eval",
"group": "llama-family-comparison",
"description": "Evaluation of Llama 3.1 8B on benchmarks",
"tags": ["llama-3.1", "8b"],
"log_mode": "per_task",
"log_metrics": ["accuracy"],
"log_artifacts": True,
"extra_metadata": {
"hardware": "A100-80GB"
}
}
)
# Multi-task mode: single run for all tasks
export_results(
invocation_ids=["8abcd123"],
dest="wandb",
config={
"entity": "myorg",
"project": "model-comparison",
"log_mode": "multi_task",
"log_artifacts": False
}
)
Configure W&B export in your evaluation YAML file for automatic export on completion:
execution:
auto_export:
destinations: ["wandb"]
# Export-related env vars (placeholders expanded at runtime)
env_vars:
export:
WANDB_API_KEY: WANDB_API_KEY
PATH: "/path/to/conda/env/bin:$PATH"
export:
wandb:
entity: "myorg"
project: "llm-benchmarks"
name: "llama-3.1-8b-instruct-v1"
group: "baseline-evals"
tags: ["llama-3.1", "baseline"]
description: "Baseline evaluation"
log_mode: "multi_task"
log_metrics: ["accuracy"]
log_artifacts: true
log_logs: true
only_required: false
extra_metadata:
hardware: "H100"
checkpoint: "path/to/checkpoint"
Configuration Parameters#
Parameter |
Type |
Description |
Default |
---|---|---|---|
|
str |
W&B entity (organization or username) |
Required |
|
str |
W&B project name |
Required |
|
str |
Logging mode: |
|
|
str |
Run display name. If not specified, auto-generated as |
Auto-generated |
|
str |
Run group for organizing related runs |
Invocation ID |
|
list[str] |
Tags for categorizing the run |
None |
|
str |
Run description (stored as W&B notes) |
None |
|
list[str] |
Metric name patterns to filter (e.g., |
All metrics |
|
bool |
Whether to upload evaluation artifacts (results files, configs) to W&B |
|
|
bool |
Upload execution logs |
|
|
bool |
Copy only required artifacts |
|
|
dict |
Additional metadata stored in run config (e.g., hardware, hyperparameters) |
|
|
str |
W&B job type classification |
|