Bring Your Own Metric#

NeMo Platform offers built-in metrics that can be configured to evaluate on your custom data. You can bring your own metric into the NeMo Platform ecosystem with remote metrics.

A remote metric can integrate with your custom evaluation served by a REST API. You have full control of the logic and evaluation that executes and the reported scores.

Overview#

Remote metrics support two types:

Type

Use Case

Payload Structure

Generic Remote (remote)

Custom endpoints with configurable body/scores

User-defined Jinja template

NeMo Agent Toolkit Remote (nemo-agent-toolkit-remote)

NAT evaluator endpoints

Fixed: {evaluator_name, item}

NeMo Evaluator supports two evaluation modes:

Mode

Use Case

Dataset Size

Response

Live Evaluation

Rapid prototyping, testing

Up to 10 rows

Immediate (synchronous)

Job Evaluation

Production workloads, full datasets

Unlimited

Async (poll for completion)

Prerequisites#

Before running remote metric evaluations:

  1. Workspace: Have a workspace created.

  2. Remote endpoint: Have your evaluation endpoint running and accessible.

  3. API key (if required): If your endpoint requires authentication, create a secret to store the API key.

  4. Initialize the SDK:

import os
from nemo_platform import NeMoPlatform
from nemo_platform.types.evaluation import (
    EvaluateDatasetRowsParam,
    RemoteMetricParam,
    NeMoAgentToolkitRemoteMetricParam,
)

client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)

Live Evaluation#

Live evaluation provides immediate results for rapid iteration when developing and testing your metrics.

Generic Remote Metric#

Use a generic remote metric when you need full control over the request payload and score extraction:

metric: RemoteMetricParam = {
    "type": "remote",
    "url": "https://my-evaluation-server.test/evaluate",
    "body": {
        "reference": "{{ item.reference }}",
        "response": "{{ item.output }}"
    },
    "scores": [
        {
            "name": "accuracy",
            "parser": {"type": "json", "json_path": "$.result.accuracy"}
        }
    ],
    "timeout_seconds": 30.0,
    "max_retries": 3,
}

dataset: EvaluateDatasetRowsParam = {
    "rows": [
        {"reference": "The capital is Paris", "output": "Paris is the capital"},
        {"reference": "2", "output": "2"},
    ]
}

result = client.evaluation.metrics.evaluate(
    metric=metric,
    dataset=dataset,
)

# Access results
for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}, count={score.count}")

Key configuration:

  • body: Jinja template for the request payload. Use {{ item.<column> }} to access dataset columns.

  • scores: List of score definitions with a parser object containing JSONPath expression for extracting values from the response.

NeMo Agent Toolkit Remote Metric#

Use the NAT remote metric type when integrating with NeMo Agent Toolkit evaluators:

metric: NeMoAgentToolkitRemoteMetricParam = {
    "type": "nemo-agent-toolkit-remote",
    "url": "http://localhost:8001/evaluate_item",
    "evaluator_name": "similarity_eval",
    "timeout_seconds": 30.0,
    "max_retries": 3,
}

dataset: EvaluateDatasetRowsParam = {
    "rows": [
        {
            "id": "item_1",
            "input_obj": "What is the capital of France?",
            "expected_output_obj": "The capital of France is Paris.",
            "output_obj": "Paris is the capital of France.",
            "trajectory": [],
            "expected_trajectory": [],
            "full_dataset_entry": {},
        }
    ]
}

result = client.evaluation.metrics.evaluate(
    metric=metric,
    dataset=dataset,
)

print(f"Score: {result.aggregate_scores[0].mean}")

The NAT metric automatically:

  • Sends payload: {"evaluator_name": "<name>", "item": <row_data>}

  • Extracts score from: $.result.score


Job-Based Evaluation#

For larger datasets or production workloads, use job-based evaluation. Jobs run asynchronously and support datasets of any size.

Create a Job with Inline Metric#

from nemo_platform.types.evaluation import (
    RemoteMetricParam,
    MetricOfflineJobParam,
)

metric: RemoteMetricParam = {
    "type": "remote",
    "url": "https://my-evaluation-server.test/evaluate",
    "body": {
        "reference": "{{ item.reference }}",
        "response": "{{ item.output }}"
    },
    "scores": [
        {
            "name": "accuracy",
            "parser": {"type": "json", "json_path": "$.result.accuracy"}
        }
    ],
    "timeout_seconds": 30.0,
    "max_retries": 3,
}

job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJobParam(
        metric=metric,
        dataset={
            "rows": [
                {"reference": "Paris", "output": "Paris"},
                {"reference": "2", "output": "2"},
            ]
        },
    ),
)

print(f"Job created: {job.name} ({job.id})")
from nemo_platform.types.evaluation import (
    NeMoAgentToolkitRemoteMetricParam,
    MetricOfflineJobParam,
)

metric: NeMoAgentToolkitRemoteMetricParam = {
    "type": "nemo-agent-toolkit-remote",
    "url": "http://host.docker.internal:8001/evaluate_item",
    "evaluator_name": "similarity_eval",
    "timeout_seconds": 30.0,
    "max_retries": 3,
}

job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJobParam(
        metric=metric,
        dataset={
            "rows": [
                {
                    "id": "item_1",
                    "input_obj": "What is the capital of France?",
                    "expected_output_obj": "The capital of France is Paris.",
                    "output_obj": "Paris is the capital.",
                    "trajectory": [],
                    "expected_trajectory": [],
                    "full_dataset_entry": {},
                }
            ]
        },
    ),
)

print(f"Job created: {job.name} ({job.id})")

Create a Stored Metric#

You can create a reusable metric and reference it by name in jobs:

# Create the metric
client.evaluation.metrics.create(
    type="remote",
    name="my-remote-metric",
    description="Custom evaluation metric for accuracy scoring",
    url="https://my-evaluation-server.test/evaluate",
    body={"reference": "{{ item.reference }}", "response": "{{ item.output }}"},
    scores=[{"name": "accuracy", "parser": {"type": "json", "json_path": "$.result.accuracy"}}],
)

# Use it in a job by reference (workspace/metric-name)
job = client.evaluation.metric_jobs.create(
    spec={"metric": "default/my-remote-metric", "dataset": "default/my-dataset-fileset"},
)

Refer to Managing Metrics for more information on how to modify or delete a metric.

Monitor Job Progress#

import time

while True:
    job_status = client.evaluation.metric_jobs.retrieve(job.name)
    print(f"Status: {job_status.status}")

    if job_status.status in ["completed", "error", "cancelled"]:
        break

    time.sleep(5)

Using API Key Secrets#

If your remote endpoint requires authentication, store the API key as a secret:

Create a Secret#

client.secrets.create(
    name="my-remote-api-key",
    data="your-api-key-value"
)

Reference the Secret in Your Metric#

from nemo_platform.types.evaluation import (
    RemoteMetricParam,
    NeMoAgentToolkitRemoteMetricParam,
)

# Live evaluation with secret
metric: RemoteMetricParam = {
    "type": "remote",
    "url": "https://my-authenticated-endpoint.test/evaluate",
    "body": {"input": "{{ item.input }}"},
    "scores": [{"name": "score", "parser": {"type": "json", "json_path": "$.score"}}],
    "api_key_secret": "my-remote-api-key",
}

result = client.evaluation.metrics.evaluate(
    metric=metric,
    dataset={"rows": [{"input": "test"}]},
)

# Job evaluation with secret
nat_metric: NeMoAgentToolkitRemoteMetricParam = {
    "type": "nemo-agent-toolkit-remote",
    "url": "http://host.docker.internal:8001/evaluate_item",
    "evaluator_name": "similarity_eval",
    "api_key_secret": "my-remote-api-key",
}

job = client.evaluation.metric_jobs.create(
    spec={"metric": nat_metric, "dataset": {"rows": [...]}},
)

The secret is automatically resolved:

  • Live evaluation: Secret is fetched from the platform’s secrets service

  • Job evaluation: Secret is injected as an environment variable into the container

The API key is sent in the Authorization: Bearer <key> header.


Endpoint Requirements#

Your remote endpoint must:

  1. Accept POST requests with Content-Type: application/json

  2. Return a JSON response containing the score(s)

Example Endpoint (FastAPI)#

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class EvaluationRequest(BaseModel):
    reference: str
    response: str

class EvaluationResponse(BaseModel):
    result: dict

@app.post("/evaluate")
async def evaluate(request: EvaluationRequest) -> EvaluationResponse:
    # Your evaluation logic here
    accuracy = 1.0 if request.reference == request.response else 0.0
    return EvaluationResponse(result={"accuracy": accuracy})

NAT Endpoint Format#

NAT endpoints receive:

{
    "evaluator_name": "similarity_eval",
    "item": {
        "id": "item_1",
        "input_obj": "What is the capital of France?",
        "expected_output_obj": "The capital of France is Paris.",
        "output_obj": "Paris is the capital.",
        "trajectory": [],
        "expected_trajectory": [],
        "full_dataset_entry": {}
    }
}

And must return:

{
    "success": true,
    "result": {
        "id": "item_1",
        "score": 0.85,
        "reasoning": {"method": "cosine_similarity"}
    },
    "error": null
}

Configuration Options#

Metric Parameters#

Parameter

Type

Description

type

string

"remote" or "nemo-agent-toolkit-remote"

url

string

Endpoint URL

body

dict

(Generic only) Jinja template for request payload

scores

list

(Generic only) List of score configuration objects (refer to Score Configuration section)

evaluator_name

string

(NAT only) Name of the NAT evaluator

api_key_secret

string

Optional secret name for API key authentication

timeout_seconds

float

Request timeout (default: 30.0)

max_retries

int

Max retry attempts (default: 3)

Score Configuration (Generic Remote Only)#

Each score object in the scores list supports the following fields:

Field

Type

Required

Description

name

string

Yes

Score identifier (lowercase, numbers, underscores only)

parser

object

Yes

Parser configuration for extracting the score value (refer to Parser configuration section)

description

string

No

Human-readable description of the score

minimum

float

No

Minimum expected value for the score range (default: None = no bound)

maximum

float

No

Maximum expected value for the score range (default: None = no bound)

Parser configuration:

Field

Type

Required

Description

type

string

Yes

Parser type, must be "json"

json_path

string

Yes

JSONPath expression to extract the score value

Example with all fields:

{
    "scores": [
        {
            "name": "accuracy",
            "parser": {
                "type": "json",
                "json_path": "$.result.accuracy"
            },
            "description": "Measures response accuracy against reference",
            "minimum": 0.0,
            "maximum": 1.0
        }
    ]
}

Job Management#

After successfully creating a job, refer to Metrics Job Management to oversee execution and monitor progress.


Limitations#

  1. Network access: For job-based evaluation, endpoints must be accessible from the job container. Use host.docker.internal for local endpoints.

  2. Response format: Scores must be extractable using JSONPath from the response. Ensure your endpoint returns properly structured JSON.

  3. Live evaluation limits: Live evaluations are limited to 10 rows. Use job-based evaluation for larger datasets.

See also