Bring Your Own Metric#

NeMo Platform offers built-in metrics that can be configured to evaluate on your custom data. You can bring your own metric into the NeMo Platform ecosystem with remote metrics.

A remote metric can integrate with your custom evaluation served by a REST API. You have full control of the logic and evaluation that executes and the reported scores.

Overview#

Remote metrics support two types:

Type	Use Case	Payload Structure
Generic Remote (`remote`)	Custom endpoints with configurable body/scores	User-defined Jinja template
NeMo Agent Toolkit Remote (`nemo-agent-toolkit-remote`)	NAT evaluator endpoints	Fixed: `{evaluator_name, item}`

NeMo Evaluator supports two evaluation modes:

Mode	Use Case	Dataset Size	Response
Live Evaluation	Rapid prototyping, testing	Up to 10 rows	Immediate (synchronous)
Job Evaluation	Production workloads, full datasets	Unlimited	Async (poll for completion)

Prerequisites#

Before running remote metric evaluations:

Workspace: Have a workspace created.
Remote endpoint: Have your evaluation endpoint running and accessible.
API key (if required): If your endpoint requires authentication, create a secret to store the API key.
Initialize the SDK:

import os
from nemo_platform import NeMoPlatform
from nemo_platform.types.evaluation import (
    EvaluateDatasetRowsParam,
    RemoteMetricParam,
    NeMoAgentToolkitRemoteMetricParam,
)

client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)

Live Evaluation#

Live evaluation provides immediate results for rapid iteration when developing and testing your metrics.

Generic Remote Metric#

Use a generic remote metric when you need full control over the request payload and score extraction:

metric: RemoteMetricParam = {
    "type": "remote",
    "url": "https://my-evaluation-server.test/evaluate",
    "body": {
        "reference": "{{ item.reference }}",
        "response": "{{ item.output }}"
    },
    "scores": [
        {
            "name": "accuracy",
            "parser": {"type": "json", "json_path": "$.result.accuracy"}
        }
    ],
    "timeout_seconds": 30.0,
    "max_retries": 3,
}

dataset: EvaluateDatasetRowsParam = {
    "rows": [
        {"reference": "The capital is Paris", "output": "Paris is the capital"},
        {"reference": "2", "output": "2"},
    ]
}

result = client.evaluation.metrics.evaluate(
    metric=metric,
    dataset=dataset,
)

# Access results
for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}, count={score.count}")

Key configuration:

body: Jinja template for the request payload. Use {{ item.<column> }} to access dataset columns.
scores: List of score definitions with a parser object containing JSONPath expression for extracting values from the response.

NeMo Agent Toolkit Remote Metric#

Use the NAT remote metric type when integrating with NeMo Agent Toolkit evaluators:

metric: NeMoAgentToolkitRemoteMetricParam = {
    "type": "nemo-agent-toolkit-remote",
    "url": "http://localhost:8001/evaluate_item",
    "evaluator_name": "similarity_eval",
    "timeout_seconds": 30.0,
    "max_retries": 3,
}

dataset: EvaluateDatasetRowsParam = {
    "rows": [
        {
            "id": "item_1",
            "input_obj": "What is the capital of France?",
            "expected_output_obj": "The capital of France is Paris.",
            "output_obj": "Paris is the capital of France.",
            "trajectory": [],
            "expected_trajectory": [],
            "full_dataset_entry": {},
        }
    ]
}

result = client.evaluation.metrics.evaluate(
    metric=metric,
    dataset=dataset,
)

print(f"Score: {result.aggregate_scores[0].mean}")

The NAT metric automatically:

Sends payload: {"evaluator_name": "<name>", "item": <row_data>}
Extracts score from: $.result.score

Job-Based Evaluation#

For larger datasets or production workloads, use job-based evaluation. Jobs run asynchronously and support datasets of any size.

Create a Job with Inline Metric#

Generic Remote Metric

from nemo_platform.types.evaluation import (
    RemoteMetricParam,
    MetricOfflineJobParam,
)

metric: RemoteMetricParam = {
    "type": "remote",
    "url": "https://my-evaluation-server.test/evaluate",
    "body": {
        "reference": "{{ item.reference }}",
        "response": "{{ item.output }}"
    },
    "scores": [
        {
            "name": "accuracy",
            "parser": {"type": "json", "json_path": "$.result.accuracy"}
        }
    ],
    "timeout_seconds": 30.0,
    "max_retries": 3,
}

job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJobParam(
        metric=metric,
        dataset={
            "rows": [
                {"reference": "Paris", "output": "Paris"},
                {"reference": "2", "output": "2"},
            ]
        },
    ),
)

print(f"Job created: {job.name} ({job.id})")

NAT Remote Metric

from nemo_platform.types.evaluation import (
    NeMoAgentToolkitRemoteMetricParam,
    MetricOfflineJobParam,
)

metric: NeMoAgentToolkitRemoteMetricParam = {
    "type": "nemo-agent-toolkit-remote",
    "url": "http://host.docker.internal:8001/evaluate_item",
    "evaluator_name": "similarity_eval",
    "timeout_seconds": 30.0,
    "max_retries": 3,
}

job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJobParam(
        metric=metric,
        dataset={
            "rows": [
                {
                    "id": "item_1",
                    "input_obj": "What is the capital of France?",
                    "expected_output_obj": "The capital of France is Paris.",
                    "output_obj": "Paris is the capital.",
                    "trajectory": [],
                    "expected_trajectory": [],
                    "full_dataset_entry": {},
                }
            ]
        },
    ),
)

print(f"Job created: {job.name} ({job.id})")

Create a Stored Metric#

You can create a reusable metric and reference it by name in jobs:

# Create the metric
client.evaluation.metrics.create(
    type="remote",
    name="my-remote-metric",
    description="Custom evaluation metric for accuracy scoring",
    url="https://my-evaluation-server.test/evaluate",
    body={"reference": "{{ item.reference }}", "response": "{{ item.output }}"},
    scores=[{"name": "accuracy", "parser": {"type": "json", "json_path": "$.result.accuracy"}}],
)

# Use it in a job by reference (workspace/metric-name)
job = client.evaluation.metric_jobs.create(
    spec={"metric": "default/my-remote-metric", "dataset": "default/my-dataset-fileset"},
)

Refer to Managing Metrics for more information on how to modify or delete a metric.

Monitor Job Progress#

import time

while True:
    job_status = client.evaluation.metric_jobs.retrieve(job.name)
    print(f"Status: {job_status.status}")

    if job_status.status in ["completed", "error", "cancelled"]:
        break

    time.sleep(5)

Using API Key Secrets#

If your remote endpoint requires authentication, store the API key as a secret:

Create a Secret#

client.secrets.create(
    name="my-remote-api-key",
    data="your-api-key-value"
)

Reference the Secret in Your Metric#

from nemo_platform.types.evaluation import (
    RemoteMetricParam,
    NeMoAgentToolkitRemoteMetricParam,
)

# Live evaluation with secret
metric: RemoteMetricParam = {
    "type": "remote",
    "url": "https://my-authenticated-endpoint.test/evaluate",
    "body": {"input": "{{ item.input }}"},
    "scores": [{"name": "score", "parser": {"type": "json", "json_path": "$.score"}}],
    "api_key_secret": "my-remote-api-key",
}

result = client.evaluation.metrics.evaluate(
    metric=metric,
    dataset={"rows": [{"input": "test"}]},
)

# Job evaluation with secret
nat_metric: NeMoAgentToolkitRemoteMetricParam = {
    "type": "nemo-agent-toolkit-remote",
    "url": "http://host.docker.internal:8001/evaluate_item",
    "evaluator_name": "similarity_eval",
    "api_key_secret": "my-remote-api-key",
}

job = client.evaluation.metric_jobs.create(
    spec={"metric": nat_metric, "dataset": {"rows": [...]}},
)

The secret is automatically resolved:

Live evaluation: Secret is fetched from the platform’s secrets service
Job evaluation: Secret is injected as an environment variable into the container

The API key is sent in the Authorization: Bearer <key> header.

Endpoint Requirements#

Your remote endpoint must:

Accept POST requests with Content-Type: application/json
Return a JSON response containing the score(s)

Example Endpoint (FastAPI)#

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class EvaluationRequest(BaseModel):
    reference: str
    response: str

class EvaluationResponse(BaseModel):
    result: dict

@app.post("/evaluate")
async def evaluate(request: EvaluationRequest) -> EvaluationResponse:
    # Your evaluation logic here
    accuracy = 1.0 if request.reference == request.response else 0.0
    return EvaluationResponse(result={"accuracy": accuracy})

NAT Endpoint Format#

NAT endpoints receive:

{
    "evaluator_name": "similarity_eval",
    "item": {
        "id": "item_1",
        "input_obj": "What is the capital of France?",
        "expected_output_obj": "The capital of France is Paris.",
        "output_obj": "Paris is the capital.",
        "trajectory": [],
        "expected_trajectory": [],
        "full_dataset_entry": {}
    }
}

And must return:

{
    "success": true,
    "result": {
        "id": "item_1",
        "score": 0.85,
        "reasoning": {"method": "cosine_similarity"}
    },
    "error": null
}

Configuration Options#

Metric Parameters#

Parameter	Type	Description
`type`	string	`"remote"` or `"nemo-agent-toolkit-remote"`
`url`	string	Endpoint URL
`body`	dict	(Generic only) Jinja template for request payload
`scores`	list	(Generic only) List of score configuration objects (refer to Score Configuration section)
`evaluator_name`	string	(NAT only) Name of the NAT evaluator
`api_key_secret`	string	Optional secret name for API key authentication
`timeout_seconds`	float	Request timeout (default: 30.0)
`max_retries`	int	Max retry attempts (default: 3)

Score Configuration (Generic Remote Only)#

Each score object in the scores list supports the following fields:

Field	Type	Required	Description
`name`	string	Yes	Score identifier (lowercase, numbers, underscores only)
`parser`	object	Yes	Parser configuration for extracting the score value (refer to Parser configuration section)
`description`	string	No	Human-readable description of the score
`minimum`	float	No	Minimum expected value for the score range (default: None = no bound)
`maximum`	float	No	Maximum expected value for the score range (default: None = no bound)

Parser configuration:

Field	Type	Required	Description
`type`	string	Yes	Parser type, must be `"json"`
`json_path`	string	Yes	JSONPath expression to extract the score value

Example with all fields:

{
    "scores": [
        {
            "name": "accuracy",
            "parser": {
                "type": "json",
                "json_path": "$.result.accuracy"
            },
            "description": "Measures response accuracy against reference",
            "minimum": 0.0,
            "maximum": 1.0
        }
    ]
}

Job Management#

After successfully creating a job, refer to Metrics Job Management to oversee execution and monitor progress.

Limitations#

Network access: For job-based evaluation, endpoints must be accessible from the job container. Use host.docker.internal for local endpoints.
Response format: Scores must be extractable using JSONPath from the response. Ensure your endpoint returns properly structured JSON.
Live evaluation limits: Live evaluations are limited to 10 rows. Use job-based evaluation for larger datasets.