Bring Your Own Metric | NVIDIA NeMo Platform

NeMo Platform offers built-in metrics that can be configured to evaluate on your custom data. Remote metrics let you bring your own metric logic into the NeMo Platform evaluation workflow by serving that logic from a REST API.

A remote metric gives you control over the evaluation logic, request payload, and reported scores while the Evaluator plugin SDK handles dataset iteration, result aggregation, retries, and job execution.

Overview

Remote metrics support two types:

Type	Use Case	Payload Structure
Generic Remote (`remote`)	Custom endpoints with configurable request body and score extraction	User-defined Jinja template
NeMo Agent Toolkit Remote (`nemo-agent-toolkit-remote`)	NeMo Agent Toolkit evaluator endpoints	Fixed: `{evaluator_name, item}`

NeMo Evaluator supports two execution modes through the Evaluator plugin SDK:

Mode	Use Case	SDK Call
Local execution	Rapid prototyping and synchronous workflows	`evaluator.run(metric=metric, dataset=dataset)`
Durable remote job	Production workloads that should run as platform jobs	`evaluator.submit(metric=metric, dataset=dataset)`

Prerequisites

Before running remote metric evaluations:

Workspace: Have a workspace created.
Remote endpoint: Have your evaluation endpoint running and accessible.
API key (if required): If your endpoint requires authentication, create a secret to store the API key.
Initialize the SDK:

1 import os
2 
3 from nemo_evaluator.sdk import Evaluator
4 from nemo_platform import NeMoPlatform
5 
6 
7 from nemo_evaluator_sdk import (
8     NemoAgentToolkitRemoteMetric,
9     RemoteMetric,
10 )
11 sdk = NeMoPlatform(
12     base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
13     workspace="default",
14 )
15 evaluator: Evaluator = sdk.evaluator  # this object is an Evaluator resource

Local Execution

Local execution provides immediate results for rapid iteration when developing and testing your metrics.

Generic Remote Metric

Use a generic remote metric when you need full control over the request payload and score extraction:

1 from nemo_evaluator_sdk import JSONScoreParser, RemoteScore, RemoteMetric
2 
3 metric = RemoteMetric(
4     url="https://my-evaluation-server.test/evaluate",
5     body={
6         "reference": "{{item.reference}}",
7         "response": "{{item.output}}",
8     },
9     scores=[
10         RemoteScore(
11             name="accuracy",
12             parser=JSONScoreParser(json_path="$.result.accuracy"),
13             description="Measures response accuracy against the reference",
14             minimum=0.0,
15             maximum=1.0,
16         )
17     ],
18     timeout_seconds=30.0,
19     max_retries=3,
20 )
21 
22 
23 result = evaluator.run(
24     metric=metric,
25     dataset=[
26         {"reference": "The capital is Paris", "output": "Paris is the capital"},
27         {"reference": "2", "output": "2"},
28     ],
29 )
30 
31 for score in result.aggregate_scores.scores:
32     print(f"{score.name}: mean={score.mean}, count={score.count}")

Key configuration:

body: Jinja template for the request payload. Use {{ item.<column> }} to access dataset columns.
scores: List of score definitions with a parser object containing JSONPath expression for extracting values from the response.

NeMo Agent Toolkit Remote Metric

Use the NAT remote metric type when integrating with NeMo Agent Toolkit evaluator endpoints:

1 from nemo_evaluator_sdk import NemoAgentToolkitRemoteMetric
2 
3 metric = NemoAgentToolkitRemoteMetric(
4     url="http://localhost:8001/evaluate_item",
5     evaluator_name="similarity_eval",
6     timeout_seconds=30.0,
7     max_retries=3,
8 )
9 
10 
11 result = evaluator.run(
12     metric=metric,
13     dataset=[
14         {
15             "id": "item_1",
16             "input_obj": "What is the capital of France?",
17             "expected_output_obj": "The capital of France is Paris.",
18             "output_obj": "Paris is the capital of France.",
19             "trajectory": [],
20             "expected_trajectory": [],
21             "full_dataset_entry": {},
22         }
23     ],
24 )
25 
26 for score in result.aggregate_scores.scores:
27     print(f"{score.name}: mean={score.mean}")

The NAT metric automatically:

Sends payload: {"evaluator_name": "<name>", "item": <row_data>}.
Extracts the score from: $.result.score.

Durable Remote Jobs

For production workloads, submit the same metric and dataset as a durable platform job. The returned job resource can wait for completion and download the final EvaluationResult.

Generic Remote Metric

NAT Remote Metric

1 from nemo_evaluator_sdk import RunConfig, JSONScoreParser, RemoteScore, RemoteMetric
2 
3 metric = RemoteMetric(
4     url="https://my-evaluation-server.test/evaluate",
5     body={
6         "reference": "{{item.reference}}",
7         "response": "{{item.output}}",
8     },
9     scores=[
10         RemoteScore(
11             name="accuracy",
12             parser=JSONScoreParser(json_path="$.result.accuracy"),
13             minimum=0.0,
14             maximum=1.0,
15         )
16     ],
17     timeout_seconds=30.0,
18     max_retries=3,
19 )
20 
21 
22 job = evaluator.submit(
23     metric=metric,
24     dataset=[
25         {"reference": "Paris", "output": "Paris"},
26         {"reference": "2", "output": "2"},
27     ],
28     config=RunConfig(parallelism=8),
29 )
30 print("Submitted job:", job.name)
31 
32 job.wait_until_done()
33 result = job.get_result()
34 
35 for score in result.aggregate_scores.scores:
36     print(f"{score.name}: mean={score.mean}, count={score.count}")

Using API Key Secrets

If your remote endpoint requires authentication, store the API key as a platform secret and reference it from your metric:

For local run versus remote submit behavior of api_key_secret, see Model API Authentication.

1 from nemo_evaluator_sdk import JSONScoreParser, RemoteScore, SecretRef, RemoteMetric
2 
3 metric = RemoteMetric(
4     url="https://my-authenticated-endpoint.test/evaluate",
5     body={"input": "{{item.input}}"},
6     scores=[RemoteScore(name="score", parser=JSONScoreParser(json_path="$.score"))],
7     api_key_secret=SecretRef(root="my-remote-api-key"),
8 )
9 
10 result = evaluator.run(metric=metric, dataset=[{"input": "test"}])

The API key is sent in the Authorization: Bearer <key> header. For local execution, the SDK resolves the key according to the local api_key_secret behavior. For durable remote jobs, the job runtime receives the secret securely.

Endpoint Requirements

Your remote endpoint must:

Accept POST requests with Content-Type: application/json.
Return a JSON response containing the score values.

Example Endpoint (FastAPI)

1 from fastapi import FastAPI
2 from pydantic import BaseModel
3 
4 
5 app = FastAPI()
6 
7 
8 class MetricRequest(BaseModel):
9     reference: str
10     response: str
11 
12 
13 class MetricResponse(BaseModel):
14     result: dict[str, float]
15 
16 
17 @app.post("/evaluate")
18 async def evaluate(request: MetricRequest) -> MetricResponse:
19     accuracy = 1.0 if request.reference == request.response else 0.0
20     return MetricResponse(result={"accuracy": accuracy})

NAT Endpoint Format

NAT endpoints receive:

1 {
2   "evaluator_name": "similarity_eval",
3   "item": {
4     "id": "item_1",
5     "input_obj": "What is the capital of France?",
6     "expected_output_obj": "The capital of France is Paris.",
7     "output_obj": "Paris is the capital.",
8     "trajectory": [],
9     "expected_trajectory": [],
10     "full_dataset_entry": {}
11   }
12 }

And must return:

1 {
2   "success": true,
3   "result": {
4     "id": "item_1",
5     "score": 0.85,
6     "reasoning": {
7       "method": "cosine_similarity"
8     }
9   },
10   "error": null
11 }

Configuration Options

Metric Parameters

Parameter	Type	Description
`type`	string	`"remote"` or `"nemo-agent-toolkit-remote"`
`url`	string	Endpoint URL
`body`	dict	(Generic only) Jinja template for request payload
`scores`	list	(Generic only) List of score configuration objects (refer to Score Configuration section)
`evaluator_name`	string	(NAT only) Name of the NAT evaluator
`api_key_secret`	`SecretRef`	Optional API key reference. See Model API Authentication
`timeout_seconds`	float	Request timeout (default: 30.0)
`max_retries`	int	Max retry attempts (default: 3)

Score Configuration (Generic Remote Only)

Each RemoteScore supports:

Field	Type	Required	Description
`name`	string	Yes	Score identifier (lowercase, numbers, underscores only)
`parser`	object	Yes	Parser configuration for extracting the score value (refer to Parser configuration section)
`description`	string	No	Human-readable description of the score
`minimum`	float	No	Minimum expected value for the score range (default: None = no bound)
`maximum`	float	No	Maximum expected value for the score range (default: None = no bound)

Parser configuration:

Field	Type	Required	Description
`type`	string	Yes	Parser type, must be `"json"`
`json_path`	string	Yes	JSONPath expression to extract the score value

Example with all fields:

1 from nemo_evaluator_sdk import JSONScoreParser, RemoteScore
2 
3 RemoteScore(
4     name="accuracy",
5     parser=JSONScoreParser(json_path="$.result.accuracy"),
6     description="Measures response accuracy against reference",
7     minimum=0.0,
8     maximum=1.0,
9 )

Limitations

Network access: For job-based evaluation, remote metric endpoints must be reachable from the local platform runtime. Use a host or service URL that the platform can access.
Response format: Scores must be extractable using JSONPath from the response. Ensure your endpoint returns properly structured JSON.
Live evaluation limits: Live evaluations are limited to 10 rows. Use job-based evaluation for larger datasets.

Evaluation Results - Understanding and downloading results
LLM-as-a-Judge - Use an LLM to evaluate outputs
Agentic Evaluation - Evaluate agent workflows