Similarity Metrics#

NeMo Platform offers built-in metrics that can be configured to evaluate on your custom data. You can customize your own metric in the NeMo Platform ecosystem with similarity metrics that support templating.

Template functionality provides maximum flexibility for evaluating your models on proprietary, domain-specific, or novel tasks. You can bring your own datasets, define your own prompts and templates using Jinja, and select or implement the metrics that matter most for your use case. This approach is ideal when:

  • You want to evaluate on tasks, data, or formats not covered by industry benchmarks or built-in metrics.

  • You need to measure model performance using custom or business-specific criteria.

  • You want to experiment with new evaluation methodologies, metrics, or workflows.

  • You need to create custom prompts and templates for specific use cases.

import os
from nemo_platform import NeMoPlatform

client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)

Template Variables#

All similarity metrics support Jinja templating with these variables:

  • {{item}} - Access dataset columns (e.g., {{item.question}}, {{item.answer}})

  • {{sample.output_text}} - The model’s generated output (default for candidate)

  • Jinja filters: lower, upper, trim, replace, etc.

Use Jinja filters to normalize text before comparison:

metric = client.evaluation.metrics.create(
    name="normalized-comparison",
    workspace="my-workspace",
    type="exact-match",
    reference="{{item.expected | lower | trim}}",
    candidate="{{sample.output_text | lower | trim }}",
)

Create a Metric#

You can create metrics and store them to be reusable. Use a metric by specifying it by the unique workspace and metric name (workspace/name):

from nemo_platform.types.evaluation import BleuMetricParam, MetricOfflineJob

# First, create and store the metric
metric = client.evaluation.metrics.create(
    **BleuMetricParam(
        name="my-bleu-metric",
        type="bleu",
        references=[
            "{{item.reference_1}}",
            "{{item.reference_2}}",
        ],
        candidate="{{item.model_output}}",
        description="BLEU score for translation quality",
        supported_job_types=["offline"],
    )
)
print(f"Created metric: {metric.name}")

# Then use it in a job by metric reference (workspace/metric-name)
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJob(
        metric="my-workspace/my-bleu-metric",
        dataset="my-workspace/dataset-name"
    )
)

You can create and store a metric first, then reference it by name (workspace/metric-name) as shown in the example above or configure the metric directly within your evaluation or job. Both approaches work for live evaluations and metric jobs (online and offline).

BLEU Metric#

BLEU (Bilingual Evaluation Understudy) measures the similarity between machine-generated text and reference translations by comparing n-gram overlap. It’s commonly used for evaluating machine translation and text generation tasks.

Use BLEU when:

  • Evaluating machine translation quality

  • Measuring text generation similarity to references

  • Comparing multiple reference texts

Metric Output: A score between 0 and 100, where 100 indicates perfect match with references.

from nemo_platform.types.evaluation import DatasetRows, BleuMetric

result = client.evaluation.metrics.evaluate(
    metric=BleuMetric(
        type="bleu",
        references=["{{item.reference_1}}", "{{item.reference_2}}"],
        candidate="{{item.model_output}}",
    ),
    dataset=DatasetRows(rows=[
        {
            "reference_1": "The cat sits on the mat.",
            "reference_2": "A cat is sitting on the mat.",
            "model_output": "The cat is on the mat.",
        },
        {
            "reference_1": "Hello world!",
            "reference_2": "Hi world!",
            "model_output": "Hello world!",
        },
    ]),
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")
from nemo_platform.types.evaluation import (
    BleuMetricParam,
    DatasetRows,
    MetricOfflineJob
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJob(
        metric=BleuMetricParam(
            type="bleu",
            references=[
                "{{item.reference_1}}",
                "{{item.reference_2}}",
            ],
            candidate="{{item.model_output}}",
            description="BLEU score for translation quality",
        ),
        dataset=DatasetRows(rows=[
            {
                "reference_1": "The cat sits on the mat.",
                "reference_2": "A cat is sitting on the mat.",
                "model_output": "The cat is on the mat.",
            },
            {
                "reference_1": "Hello world!",
                "reference_2": "Hi world!",
                "model_output": "Hello world!",
            },
        ]),
    )
)

print(f"Job created: {job.name}")
from nemo_platform.types.evaluation import (
    BleuMetricParam,
    DatasetRows,
    MetricOnlineJob,
    Model
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJob(
        metric=BleuMetricParam(
            type="bleu",
            references=[
                "{{item.reference_1}}",
                "{{item.reference_2}}",
            ],
            description="BLEU score for translation quality",
        ),
        dataset=DatasetRows(rows=[
            {
                "prompt": "Who is sitting on the mat?",
                "reference_1": "The cat sits on the mat.",
                "reference_2": "A cat is sitting on the mat.",
            },
            {
                "prompt": "Welcome!",
                "reference_1": "Hello world!",
                "reference_2": "Hi world!",
            },
        ]),
        model=Model(
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            url="<inference-url>/v1/chat/completions",
            api_key_secret="<optional-inference-auth-key>",
        ),
        prompt_template={
            "messages": [{
                "role": "user",
                "content": "{{item.prompt}}"
            }]
        }
    )
)

print(f"Job created: {job.name}")
{
  "scores": [
    {
      "name": "sentence",
      "count": 2,
      "mean": 76.86,
      "min": 53.73,
      "max": 100.00
    },
    {
      "name": "corpus",
      "count": 1,
      "mean": 53.895
    }
  ]
}

Exact Match Metric#

Exact Match compares the candidate text with the reference text for perfect equality. This metric returns 1 if the strings match exactly (after normalization) and 0 otherwise.

Use Exact Match when:

  • Evaluating classification tasks with discrete labels

  • Checking for exact answer correctness

  • Validating structured output formats

Metric Output: Binary score (0 or 1).

from nemo_platform.types.evaluation import (
    DatasetRows,
    ExactMatchMetric
)

result = client.evaluation.metrics.evaluate(
    metric=ExactMatchMetric(
        type="exact-match",
        reference="{{item.correct_answer}}",
        candidate="{{item.model_answer}}",
        description="Exact match for question answering",
    ),
    dataset=DatasetRows(rows=[
        {"correct_answer": "Paris", "model_answer": "Paris"},
        {"correct_answer": "42", "model_answer": "43"},
        {"correct_answer": "True", "model_answer": "true"},
    ]),
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")
from nemo_platform.types.evaluation import (
    ExactMatchMetricParam,
    MetricOfflineJob
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJob(
        metric=ExactMatchMetricParam(
            type="exact-match",
            reference="{{item.correct_answer}}",
            candidate="{{item.model_answer}}",
            description="Exact match for question answering",
        ),
        dataset="my-workspace/my-dataset",
    )
)

print(f"Job created: {job.name}")
from nemo_platform.types.evaluation import (
    ExactMatchMetricParam,
    MetricOnlineJob,
    Model
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJob(
        metric=ExactMatchMetricParam(
            type="exact-match",
            reference="{{item.correct_answer}}",
            candidate="{{item.model_answer}}",
            description="Exact match for question answering",
        ),
        dataset="my-workspace/my-dataset",
        model=Model(
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            url="<inference-url>/v1/chat/completions",
            api_key_secret="<optional-inference-auth-key>",
        ),
        prompt_template={
            "messages": [{
                "role": "user",
                "content": "{{item.prompt}}"
            }]
        }
    )
)

print(f"Job created: {job.name}")
{
  "scores": [
    {
      "name": "exact-match",
      "count": 3,
      "mean": 0.6667,
      "min": 0.0,
      "max": 1.0
    }
  ]
}

F1 Metric#

F1 score measures token-level overlap between the candidate and reference text using precision and recall. It’s the harmonic mean of precision (fraction of candidate tokens in reference) and recall (fraction of reference tokens in candidate).

Use F1 when:

  • Evaluating question answering systems

  • Measuring partial correctness in text generation

  • Assessing information extraction tasks

Metric Output: Score between 0 and 1, where 1 indicates perfect overlap.

from nemo_platform.types.evaluation import (
    DatasetRows,
    F1Metric
)

result = client.evaluation.metrics.evaluate(
    metric=F1Metric(
        type="f1",
        reference="{{item.ground_truth}}",
        candidate="{{item.predicted}}",
        description="F1 score for question answering",
    ),
    dataset=DatasetRows(rows=[
        {
            "ground_truth": "The capital of France is Paris",
            "predicted": "Paris is the capital of France",
        },
        {
            "ground_truth": "42",
            "predicted": "The answer is 42",
        },
    ]),
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")
from nemo_platform.types.evaluation import (
    DatasetRows,
    F1MetricParam,
    MetricOfflineJob
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJob(
        metric=F1MetricParam(
            type="f1",
            reference="{{item.ground_truth}}",
            candidate="{{item.predicted}}",
            description="F1 score for question answering",
        ),
        dataset=DatasetRows(rows=[
            {
                "ground_truth": "The capital of France is Paris",
                "predicted": "Paris is the capital of France",
            },
            {
                "ground_truth": "42",
                "predicted": "The answer is 42",
            },
        ]),
    )
)

print(f"Job created: {job.name}")
from nemo_platform.types.evaluation import (
    DatasetRows,
    F1MetricParam,
    MetricOnlineJob,
    Model
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJob(
        metric=F1MetricParam(
            type="f1",
            reference="{{item.ground_truth}}",
            candidate="{{sample.output_text}}",
            description="F1 score for question answering",
        ),
        dataset=DatasetRows(rows=[
            {
                "prompt": "What is the capital of France?",
                "ground_truth": "The capital of France is Paris",
            },
            {
                "prompt": "32+10=",
                "ground_truth": "42",
            },
        ]),
        model=Model(
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            url="<inference-url>/v1/chat/completions",
            api_key_secret="<optional-inference-auth-key>",
        ),
        prompt_template={
            "messages": [{
                "role": "user",
                "content": "{{item.prompt}}"
            }]
        }
    )
)

print(f"Job created: {job.name}")
{
  "scores": [
    {
      "name": "f1_score",
      "count": 2,
      "mean": 0.75,
      "min": 0.5,
      "max": 1.0
    }
  ]
}

Number Check Metric#

Number Check performs numerical comparisons and operations on extracted values. Supports equality, inequality, comparison operators, and absolute difference calculations.

Use Number Check when:

  • Validating numerical outputs (calculations, counts, scores)

  • Checking value ranges or thresholds

  • Comparing predicted vs expected numbers

Metric Output: 1 if the condition is true, 0 otherwise. For absolute difference, returns the difference value.

Supported Operations#

  • Equality: "equals", "=="

  • Inequality: "!=, "<>", "not equals"

  • Comparisons: ">", "gt", ">=", "gte", "<", "lt", "<=", "lte"

  • Absolute difference: "absolute difference" (requires epsilon parameter)

from nemo_platform.types.evaluation import NumberCheckMetricParam, DatasetRows

result = client.evaluation.metrics.evaluate(
    metric=NumberCheckMetricParam(
        type="number-check",
        operation="absolute difference",
        epsilon="0.5",
        left_template="{{item.expected}}",
        right_template="{{item.predicted}}",
        description="Check if values match within tolerance",
    ),
    dataset=DatasetRows(
        rows=[
            {"expected": "100", "predicted": "100"},
            {"expected": "42.5", "predicted": "42.3"},
            {"expected": "99", "predicted": "101"},
        ]
    ),
)
for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")
from nemo_platform.types.evaluation import (
    DatasetRows,
    NumberCheckMetricParam,
    MetricOfflineJob
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJob(
        metric=NumberCheckMetricParam(
            type="number-check",
            left_template="{{item.predicted}}",
            operation=">",
            right_template="0.5",
            description="Score must be greater than 0.5",
        ),
        dataset=DatasetRows(
            rows=[
                {"predicted": "1"},
                {"predicted": "0.75"},
                {"predicted": "0.5"},
                {"predicted": "0.1"},
            ]
        )
    )
)

print(f"Job created: {job.name}")
from nemo_platform.types.evaluation import (
    NumberCheckMetricParam,
    MetricOnlineJob,
    Model
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJob(
        metric=NumberCheckMetricParam(
            type="number-check",
            left_template="{{sample.output_text}}",
            operation=">",
            right_template="0.5",
            description="Score must be greater than 0.5",
        ),
        dataset="my-workspace/my-dataset",
        model=Model(
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            url="<inference-url>/v1/chat/completions",
            api_key_secret="<optional-inference-auth-key>",
        ),
        prompt_template={
            "messages": [{
                "role": "user",
                "content": "{{item.prompt}}"
            }]
        }
    )
)

print(f"Job created: {job.name}")
{
  "scores": [
    {
      "name": "number-check",
      "count": 3,
      "mean": 0.6667,
      "min": 0.0,
      "max": 1.0
    }
  ]
}

ROUGE Metric#

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures overlap between generated and reference summaries. It computes ROUGE-1 (unigram), ROUGE-2 (bigram), and ROUGE-L (longest common subsequence) scores.

Use ROUGE when:

  • Evaluating text summarization

  • Measuring content preservation in paraphrasing

  • Assessing abstractive text generation

Metric Output: Multiple scores (ROUGE-1, ROUGE-2, ROUGE-L), each with precision, recall, and F1 values.

from nemo_platform.types.evaluation import RougeMetricParam, DatasetRows

result = client.evaluation.metrics.evaluate(
    metric=RougeMetricParam(
            type="rouge",
            reference="{{item.reference_summary | lower}}",
            candidate="{{item.generated_summary | lower}}",
            description="ROUGE with normalized text",
        ),
    dataset=DatasetRows(
        rows=[
            {
                "reference_summary": "AI is transforming healthcare.",
                "generated_summary": "Artificial intelligence transforms medical care.",
            },
            {
                "reference_summary": "Climate change affects global weather patterns.",
                "generated_summary": "Global weather is impacted by climate change.",
            },
        ]
    )
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")
from nemo_platform.types.evaluation import (
    DatasetRows,
    RougeMetricParam,
    MetricOfflineJob
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJob(
        metric=RougeMetricParam(
            type="rouge",
            reference="{{item.reference_summary | lower}}",
            candidate="{{item.generated_summary | lower}}",
            description="ROUGE with normalized text",
        ),
        dataset=DatasetRows(
            rows=[
                {
                    "reference_summary": "AI is transforming healthcare.",
                    "generated_summary": "Artificial intelligence transforms medical care.",
                },
                {
                    "reference_summary": "Climate change affects global weather patterns.",
                    "generated_summary": "Global weather is impacted by climate change.",
                },
            ]
        )
    )
)

print(f"Job created: {job.name}")
from nemo_platform.types.evaluation import (
    MetricOnlineJob,
    Model,
    RougeMetricParam,
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJob(
        metric=RougeMetricParam(
            type="rouge",
            reference="{{item.reference_summary | lower}}",
            candidate="{{sample.output_text | lower}}",
            description="ROUGE with normalized text",
        ),
        dataset="my-workspace/my-dataset",
        model=Model(
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            url="<inference-url>/v1/chat/completions",
            api_key_secret="<optional-inference-auth-key>",
        ),
        prompt_template={
            "messages": [{
                "role": "user",
                "content": "{{item.prompt}}"
            }]
        }
    )
)

print(f"Job created: {job.name}")
{
  "scores": [
    {
      "name": "rouge_1_score",
      "count": 2,
      "mean": 0.419,
      "min": 0.222,
      "max": 0.615
    },
    {
      "name": "rouge_2_score",
      "count": 2,
      "mean": 0.273,
      "min": 0.182,
      "max": 0.363
    },
    {
      "name": "rouge_3_score",
      "count": 2,
      "mean": 0.0,
      "min": 0.0,
      "max": 0.0
    },
    {
      "name": "rouge_L_score",
      "count": 2,
      "mean": 0.265,
      "min": 0.222,
      "max": 0.308
    }
  ]
}

String Check Metric#

String Check performs various string operations and comparisons. Supports equality, containment, and pattern matching operations.

Use String Check when:

  • Validating text format or structure

  • Checking for keyword presence

  • Pattern matching in generated text

  • String-based classification

Metric Output: Binary score (1 if condition is true, 0 otherwise).

Supported Operations#

  • Equality: "equals", "=="

  • Inequality: "!=, "<>", "not equals"

  • Containment: "contains", "not contains"

  • Pattern: "startswith", "endswith"

from nemo_platform.types.evaluation import DatasetRows, StringCheckMetricParam

result = client.evaluation.metrics.evaluate(
    metric=StringCheckMetricParam(
            type="string-check",
            operation="contains",
            left_template="{{item.output | trim}}",
            right_template="{{item.must_contain}}",
        ),
    dataset=DatasetRows(
        rows=[
            {
                "output": "The answer is: 42",
                "must_contain": "answer",
            },
            {
                "output": "Result: Success",
                "must_contain": "Success",
            },
            {
                "output": "Error occurred",
                "must_contain": "Success",
            },
        ]
    )
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")
from nemo_platform.types.evaluation import (
    DatasetRows,
    StringCheckMetricParam,
    MetricOfflineJob
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJob(
        metric=StringCheckMetricParam(
            type="string-check",
            operation="contains",
            left_template="{{item.output | trim}}",
            right_template="{{item.must_contain}}",
        ),
        dataset=DatasetRows(
            rows=[
                {
                    "output": "The answer is: 42",
                    "must_contain": "answer",
                },
                {
                    "output": "Result: Success",
                    "must_contain": "Success",
                },
                {
                    "output": "Error occurred",
                    "must_contain": "Success",
                },
            ]
        )
    )
)

print(f"Job created: {job.name}")
from nemo_platform.types.evaluation import (
    MetricOnlineJob,
    Model,
    StringCheckMetricParam,
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJob(
        metric=StringCheckMetricParam(
            type="string-check",
            operation="startswith",
            left_template="{{sample.output_text}}",
            right_template="Answer:",
            description="Check if output starts with 'Answer:'",
        ),
        dataset="my-workspace/my-dataset",
        model=Model(
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            url="<inference-url>/v1/chat/completions",
            api_key_secret="<optional-inference-auth-key>",
        ),
        prompt_template={
            "messages": [{
                "role": "user",
                "content": "{{item.prompt}}"
            }]
        }
    )
)

print(f"Job created: {job.name}")
{
  "scores": [
    {
      "name": "string-check",
      "count": 3,
      "mean": 0.667,
      "min": 0.0,
      "max": 1.0
    }
  ]
}

Custom Dataset Format#

Custom evaluations can use datasets uploaded as a Fileset and support these file formats:

  • JSON (.json): Array of objects or single object

  • JSONL (.jsonl): Newline-delimited JSON objects

  • CSV (.csv): Tabular data with headers

  • Parquet (.parquet): Apache Parquet columnar format

  • ORC (.orc): Apache ORC columnar format

  • Feather / Arrow IPC (.feather, .arrow): Apache Arrow interchange format

All formats support gzip compression (e.g., data.jsonl.gz, dataset.csv.gz).

Note

Built-in metrics and industry benchmarks (in the system workspace) run in specialized containers that have their own format requirements. See the documentation for each built-in metric type for details.

File Selection with Patterns#

When referencing a fileset dataset, you can select specific files using a # fragment:

  • workspace/fileset – load all parsable files in the fileset

  • workspace/fileset#data.csv – load a specific file

  • workspace/fileset#*.parquet – load files matching a glob pattern

  • workspace/fileset#subdir/**/*.jsonl – recursive glob pattern

When multiple files are loaded, they are concatenated into a single dataset. Files with unsupported extensions are skipped.