Evaluation Metrics#

Metrics define how to score the outputs of your models, agents, or pipelines.

What is a metric?#

A metric is a reusable scoring definition that evaluates model or agent outputs. There are two kinds of metrics: built-in metrics and custom metrics.

  • Inputs: For custom metrics, inputs define scoring logic composed of dataset fields and model outputs; for judge-based custom metrics, this also includes judge-model inputs (for example, judge prompts/rubrics and configuration).

  • Outputs: Row-level scores and aggregate statistics.

  • Scope: Custom metrics are workspace-scoped and referenced as workspace/name.

Note

Terminology on this page:

  • Metric definition: The reusable scoring configuration.

  • Metric type: The metric family (for example exact-match, BLEU, LLM-as-a-judge).

  • Metric score: The numeric or rubric output produced at evaluation time.

Note

Creating, updating, and deleting custom metrics requires write access to the workspace.

The Evaluation Workflow#

[1] Choose or create a metric
            |
            v
[2] Select a dataset and execution mode
            |
            v
[3] Create and run an evaluation job
            |
            v
[4] Review row-level and aggregate scores

Quick Start#

Minimal live evaluation with a built-in metric:

import os

from nemo_platform import NeMoPlatform
from nemo_platform.types.evaluation import DatasetRows, ExactMatchMetric

client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)

result = client.evaluation.metrics.evaluate(
    metric=ExactMatchMetric(type="exact-match", reference="{{item.ref}}", candidate="{{item.pred}}"),
    dataset=DatasetRows(rows=[{"ref": "Paris", "pred": "Paris"}]),
)
print(result.aggregate_scores[0].mean)

Execution Modes#

Metrics can be executed in two modes:

Mode

Use Case

Response

Live Evaluation

Rapid prototyping, developing metrics, testing configurations. Limited up to 10 rows.

Immediate (synchronous)

Job Evaluation

Production workloads, full datasets

Async (poll for completion)

Important

Live evaluation is limited to 10 dataset rows. Use job evaluation for larger datasets.

Built-in vs. Custom Metrics#

  • Built-in metrics: Ready-to-use metrics provided by NeMo Platform (for example exact-match, bleu, rouge).

  • Custom metrics: Metrics you define for domain-specific evaluation needs.

To discover available built-in and system metrics, see Manage Metrics. For custom metric creation guides, start with Similarity Metrics, LLM-as-a-Judge, or Bring Your Own Metric.

Datasets#

Evaluation jobs need dataset input. You can provide data in three ways:

Dataset Source

Description

Best For

DatasetRows

Inline rows sent directly in the request

Quick testing and live evaluation

FilesetRef

Reference to a persisted fileset (workspace/fileset-name)

Production jobs and reusable datasets

Fileset

Inline fileset definition with storage configuration

Direct access to external storage

Example of providing a FilesetRef to reference specific files or globs:

# Include all files in subdirectory
dataset = "my-workspace/my-dataset#subdir/path"

# Single file
dataset = "my-workspace/my-dataset#file.jsonl"

# Single file in a subdirectory
dataset = "my-workspace/my-dataset#subdir/path/file.jsonl"

# Glob match files
dataset = "my-workspace/my-dataset#*.jsonl"

# Glob match files in subdirectory
dataset = "my-workspace/my-dataset#subdir/path/*.jsonl"

Available Metric Types#

Use the metric-type pages below to create and configure custom metrics.

LLM-as-a-Judge

Use another LLM to evaluate outputs with flexible scoring criteria. Define custom rubrics or numerical ranges.

Evaluate with LLM-as-a-Judge
Agentic Metrics

Evaluate agent workflows including tool calling accuracy, goal completion, and topic adherence.

Agentic Evaluation Metrics
RAG Metrics

Evaluate RAG pipelines for retrieval quality and answer generation using RAGAS metrics.

RAG Evaluation Metrics
Retriever Metrics

Evaluate document retrieval pipelines using standard IR metrics like recall@k and NDCG.

Retriever Evaluation Metrics
Similarity Metrics

Create metrics for text similarity, exact matching, and standard NLP evaluations using Jinja2 templating.

Similarity Metrics
Bring Your Own Metric

Integrate custom evaluation endpoints for domain-specific scoring.

Bring Your Own Metric

Understanding Scores#

Scores are the metric outputs produced during evaluation:

Score Type

Meaning

Typical Use

Row scores

Score(s) for each dataset row

Debugging failures and error analysis

Aggregate scores

Statistics computed over all rows

Tracking overall quality and regressions

Manage Custom Metrics#

Create, update, and delete custom metrics that can be reused across evaluation jobs. See Manage Metrics for API documentation.