Similarity Metrics#

NeMo Platform offers built-in metrics that can be configured to evaluate on your custom data. You can customize your own metric in the NeMo Platform ecosystem with similarity metrics that support templating.

Template functionality provides maximum flexibility for evaluating your models on proprietary, domain-specific, or novel tasks. You can bring your own datasets, define your own prompts and templates using Jinja, and select or implement the metrics that matter most for your use case. This approach is ideal when:

You want to evaluate on tasks, data, or formats not covered by industry benchmarks or built-in metrics.
You need to measure model performance using custom or business-specific criteria.
You want to experiment with new evaluation methodologies, metrics, or workflows.
You need to create custom prompts and templates for specific use cases.

import os
from nemo_platform import NeMoPlatform

client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)

Template Variables#

All similarity metrics support Jinja templating with these variables:

{{item}} - Access dataset columns (e.g., {{item.question}}, {{item.answer}})
{{sample.output_text}} - The model’s generated output (default for candidate)
Jinja filters: lower, upper, trim, replace, etc.

Use Jinja filters to normalize text before comparison:

metric = client.evaluation.metrics.create(
    name="normalized-comparison",
    workspace="my-workspace",
    type="exact-match",
    reference="{{item.expected | lower | trim}}",
    candidate="{{sample.output_text | lower | trim }}",
)

Create a Metric#

You can create metrics and store them to be reusable. Use a metric by specifying it by the unique workspace and metric name (workspace/name):

from nemo_platform.types.evaluation import BleuMetricParam, MetricOfflineJob

# First, create and store the metric
metric = client.evaluation.metrics.create(
    **BleuMetricParam(
        name="my-bleu-metric",
        type="bleu",
        references=[
            "{{item.reference_1}}",
            "{{item.reference_2}}",
        ],
        candidate="{{item.model_output}}",
        description="BLEU score for translation quality",
        supported_job_types=["offline"],
    )
)
print(f"Created metric: {metric.name}")

# Then use it in a job by metric reference (workspace/metric-name)
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJob(
        metric="my-workspace/my-bleu-metric",
        dataset="my-workspace/dataset-name"
    )
)

You can create and store a metric first, then reference it by name (workspace/metric-name) as shown in the example above or configure the metric directly within your evaluation or job. Both approaches work for live evaluations and metric jobs (online and offline).

BLEU Metric#

BLEU (Bilingual Evaluation Understudy) measures the similarity between machine-generated text and reference translations by comparing n-gram overlap. It’s commonly used for evaluating machine translation and text generation tasks.

Use BLEU when:

Evaluating machine translation quality
Measuring text generation similarity to references
Comparing multiple reference texts

Metric Output: A score between 0 and 100, where 100 indicates perfect match with references.

Live Evaluation

from nemo_platform.types.evaluation import DatasetRows, BleuMetric

result = client.evaluation.metrics.evaluate(
    metric=BleuMetric(
        type="bleu",
        references=["{{item.reference_1}}", "{{item.reference_2}}"],
        candidate="{{item.model_output}}",
    ),
    dataset=DatasetRows(rows=[
        {
            "reference_1": "The cat sits on the mat.",
            "reference_2": "A cat is sitting on the mat.",
            "model_output": "The cat is on the mat.",
        },
        {
            "reference_1": "Hello world!",
            "reference_2": "Hi world!",
            "model_output": "Hello world!",
        },
    ]),
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")

Offline Job Evaluation

from nemo_platform.types.evaluation import (
    BleuMetricParam,
    DatasetRows,
    MetricOfflineJob
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJob(
        metric=BleuMetricParam(
            type="bleu",
            references=[
                "{{item.reference_1}}",
                "{{item.reference_2}}",
            ],
            candidate="{{item.model_output}}",
            description="BLEU score for translation quality",
        ),
        dataset=DatasetRows(rows=[
            {
                "reference_1": "The cat sits on the mat.",
                "reference_2": "A cat is sitting on the mat.",
                "model_output": "The cat is on the mat.",
            },
            {
                "reference_1": "Hello world!",
                "reference_2": "Hi world!",
                "model_output": "Hello world!",
            },
        ]),
    )
)

print(f"Job created: {job.name}")

Online Job Evaluation

from nemo_platform.types.evaluation import (
    BleuMetricParam,
    DatasetRows,
    MetricOnlineJob,
    Model
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJob(
        metric=BleuMetricParam(
            type="bleu",
            references=[
                "{{item.reference_1}}",
                "{{item.reference_2}}",
            ],
            description="BLEU score for translation quality",
        ),
        dataset=DatasetRows(rows=[
            {
                "prompt": "Who is sitting on the mat?",
                "reference_1": "The cat sits on the mat.",
                "reference_2": "A cat is sitting on the mat.",
            },
            {
                "prompt": "Welcome!",
                "reference_1": "Hello world!",
                "reference_2": "Hi world!",
            },
        ]),
        model=Model(
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            url="<inference-url>/v1/chat/completions",
            api_key_secret="<optional-inference-auth-key>",
        ),
        prompt_template={
            "messages": [{
                "role": "user",
                "content": "{{item.prompt}}"
            }]
        }
    )
)

print(f"Job created: {job.name}")

Example Result

{
  "scores": [
    {
      "name": "sentence",
      "count": 2,
      "mean": 76.86,
      "min": 53.73,
      "max": 100.00
    },
    {
      "name": "corpus",
      "count": 1,
      "mean": 53.895
    }
  ]
}

Exact Match Metric#

Exact Match compares the candidate text with the reference text for perfect equality. This metric returns 1 if the strings match exactly (after normalization) and 0 otherwise.

Use Exact Match when:

Evaluating classification tasks with discrete labels
Checking for exact answer correctness
Validating structured output formats

Metric Output: Binary score (0 or 1).

Live Evaluation

from nemo_platform.types.evaluation import (
    DatasetRows,
    ExactMatchMetric
)

result = client.evaluation.metrics.evaluate(
    metric=ExactMatchMetric(
        type="exact-match",
        reference="{{item.correct_answer}}",
        candidate="{{item.model_answer}}",
        description="Exact match for question answering",
    ),
    dataset=DatasetRows(rows=[
        {"correct_answer": "Paris", "model_answer": "Paris"},
        {"correct_answer": "42", "model_answer": "43"},
        {"correct_answer": "True", "model_answer": "true"},
    ]),
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")

Offline Job Evaluation

from nemo_platform.types.evaluation import (
    ExactMatchMetricParam,
    MetricOfflineJob
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJob(
        metric=ExactMatchMetricParam(
            type="exact-match",
            reference="{{item.correct_answer}}",
            candidate="{{item.model_answer}}",
            description="Exact match for question answering",
        ),
        dataset="my-workspace/my-dataset",
    )
)

print(f"Job created: {job.name}")

Online Job Evaluation

from nemo_platform.types.evaluation import (
    ExactMatchMetricParam,
    MetricOnlineJob,
    Model
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJob(
        metric=ExactMatchMetricParam(
            type="exact-match",
            reference="{{item.correct_answer}}",
            candidate="{{item.model_answer}}",
            description="Exact match for question answering",
        ),
        dataset="my-workspace/my-dataset",
        model=Model(
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            url="<inference-url>/v1/chat/completions",
            api_key_secret="<optional-inference-auth-key>",
        ),
        prompt_template={
            "messages": [{
                "role": "user",
                "content": "{{item.prompt}}"
            }]
        }
    )
)

print(f"Job created: {job.name}")

Example Result

{
  "scores": [
    {
      "name": "exact-match",
      "count": 3,
      "mean": 0.6667,
      "min": 0.0,
      "max": 1.0
    }
  ]
}

F1 Metric#

F1 score measures token-level overlap between the candidate and reference text using precision and recall. It’s the harmonic mean of precision (fraction of candidate tokens in reference) and recall (fraction of reference tokens in candidate).

Use F1 when:

Evaluating question answering systems
Measuring partial correctness in text generation
Assessing information extraction tasks

Metric Output: Score between 0 and 1, where 1 indicates perfect overlap.

Live Evaluation

from nemo_platform.types.evaluation import (
    DatasetRows,
    F1Metric
)

result = client.evaluation.metrics.evaluate(
    metric=F1Metric(
        type="f1",
        reference="{{item.ground_truth}}",
        candidate="{{item.predicted}}",
        description="F1 score for question answering",
    ),
    dataset=DatasetRows(rows=[
        {
            "ground_truth": "The capital of France is Paris",
            "predicted": "Paris is the capital of France",
        },
        {
            "ground_truth": "42",
            "predicted": "The answer is 42",
        },
    ]),
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")

Offline Job Evaluation

from nemo_platform.types.evaluation import (
    DatasetRows,
    F1MetricParam,
    MetricOfflineJob
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJob(
        metric=F1MetricParam(
            type="f1",
            reference="{{item.ground_truth}}",
            candidate="{{item.predicted}}",
            description="F1 score for question answering",
        ),
        dataset=DatasetRows(rows=[
            {
                "ground_truth": "The capital of France is Paris",
                "predicted": "Paris is the capital of France",
            },
            {
                "ground_truth": "42",
                "predicted": "The answer is 42",
            },
        ]),
    )
)

print(f"Job created: {job.name}")

Online Job Evaluation

from nemo_platform.types.evaluation import (
    DatasetRows,
    F1MetricParam,
    MetricOnlineJob,
    Model
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJob(
        metric=F1MetricParam(
            type="f1",
            reference="{{item.ground_truth}}",
            candidate="{{sample.output_text}}",
            description="F1 score for question answering",
        ),
        dataset=DatasetRows(rows=[
            {
                "prompt": "What is the capital of France?",
                "ground_truth": "The capital of France is Paris",
            },
            {
                "prompt": "32+10=",
                "ground_truth": "42",
            },
        ]),
        model=Model(
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            url="<inference-url>/v1/chat/completions",
            api_key_secret="<optional-inference-auth-key>",
        ),
        prompt_template={
            "messages": [{
                "role": "user",
                "content": "{{item.prompt}}"
            }]
        }
    )
)

print(f"Job created: {job.name}")

Example Result

{
  "scores": [
    {
      "name": "f1_score",
      "count": 2,
      "mean": 0.75,
      "min": 0.5,
      "max": 1.0
    }
  ]
}

Number Check Metric#

Number Check performs numerical comparisons and operations on extracted values. Supports equality, inequality, comparison operators, and absolute difference calculations.

Use Number Check when:

Validating numerical outputs (calculations, counts, scores)
Checking value ranges or thresholds
Comparing predicted vs expected numbers

Metric Output: 1 if the condition is true, 0 otherwise. For absolute difference, returns the difference value.

Supported Operations#

Equality: "equals", "=="
Inequality: "!=, "<>", "not equals"
Comparisons: ">", "gt", ">=", "gte", "<", "lt", "<=", "lte"
Absolute difference: "absolute difference" (requires epsilon parameter)

Live Evaluation

from nemo_platform.types.evaluation import NumberCheckMetricParam, DatasetRows

result = client.evaluation.metrics.evaluate(
    metric=NumberCheckMetricParam(
        type="number-check",
        operation="absolute difference",
        epsilon="0.5",
        left_template="{{item.expected}}",
        right_template="{{item.predicted}}",
        description="Check if values match within tolerance",
    ),
    dataset=DatasetRows(
        rows=[
            {"expected": "100", "predicted": "100"},
            {"expected": "42.5", "predicted": "42.3"},
            {"expected": "99", "predicted": "101"},
        ]
    ),
)
for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")

Offline Job Evaluation

from nemo_platform.types.evaluation import (
    DatasetRows,
    NumberCheckMetricParam,
    MetricOfflineJob
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJob(
        metric=NumberCheckMetricParam(
            type="number-check",
            left_template="{{item.predicted}}",
            operation=">",
            right_template="0.5",
            description="Score must be greater than 0.5",
        ),
        dataset=DatasetRows(
            rows=[
                {"predicted": "1"},
                {"predicted": "0.75"},
                {"predicted": "0.5"},
                {"predicted": "0.1"},
            ]
        )
    )
)

print(f"Job created: {job.name}")

Online Job Evaluation

from nemo_platform.types.evaluation import (
    NumberCheckMetricParam,
    MetricOnlineJob,
    Model
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJob(
        metric=NumberCheckMetricParam(
            type="number-check",
            left_template="{{sample.output_text}}",
            operation=">",
            right_template="0.5",
            description="Score must be greater than 0.5",
        ),
        dataset="my-workspace/my-dataset",
        model=Model(
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            url="<inference-url>/v1/chat/completions",
            api_key_secret="<optional-inference-auth-key>",
        ),
        prompt_template={
            "messages": [{
                "role": "user",
                "content": "{{item.prompt}}"
            }]
        }
    )
)

print(f"Job created: {job.name}")

Example Result

{
  "scores": [
    {
      "name": "number-check",
      "count": 3,
      "mean": 0.6667,
      "min": 0.0,
      "max": 1.0
    }
  ]
}

ROUGE Metric#

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures overlap between generated and reference summaries. It computes ROUGE-1 (unigram), ROUGE-2 (bigram), and ROUGE-L (longest common subsequence) scores.

Use ROUGE when:

Evaluating text summarization
Measuring content preservation in paraphrasing
Assessing abstractive text generation

Metric Output: Multiple scores (ROUGE-1, ROUGE-2, ROUGE-L), each with precision, recall, and F1 values.

Live Evaluation

from nemo_platform.types.evaluation import RougeMetricParam, DatasetRows

result = client.evaluation.metrics.evaluate(
    metric=RougeMetricParam(
            type="rouge",
            reference="{{item.reference_summary | lower}}",
            candidate="{{item.generated_summary | lower}}",
            description="ROUGE with normalized text",
        ),
    dataset=DatasetRows(
        rows=[
            {
                "reference_summary": "AI is transforming healthcare.",
                "generated_summary": "Artificial intelligence transforms medical care.",
            },
            {
                "reference_summary": "Climate change affects global weather patterns.",
                "generated_summary": "Global weather is impacted by climate change.",
            },
        ]
    )
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")

Offline Job Evaluation

from nemo_platform.types.evaluation import (
    DatasetRows,
    RougeMetricParam,
    MetricOfflineJob
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJob(
        metric=RougeMetricParam(
            type="rouge",
            reference="{{item.reference_summary | lower}}",
            candidate="{{item.generated_summary | lower}}",
            description="ROUGE with normalized text",
        ),
        dataset=DatasetRows(
            rows=[
                {
                    "reference_summary": "AI is transforming healthcare.",
                    "generated_summary": "Artificial intelligence transforms medical care.",
                },
                {
                    "reference_summary": "Climate change affects global weather patterns.",
                    "generated_summary": "Global weather is impacted by climate change.",
                },
            ]
        )
    )
)

print(f"Job created: {job.name}")

Online Job Evaluation

from nemo_platform.types.evaluation import (
    MetricOnlineJob,
    Model,
    RougeMetricParam,
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJob(
        metric=RougeMetricParam(
            type="rouge",
            reference="{{item.reference_summary | lower}}",
            candidate="{{sample.output_text | lower}}",
            description="ROUGE with normalized text",
        ),
        dataset="my-workspace/my-dataset",
        model=Model(
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            url="<inference-url>/v1/chat/completions",
            api_key_secret="<optional-inference-auth-key>",
        ),
        prompt_template={
            "messages": [{
                "role": "user",
                "content": "{{item.prompt}}"
            }]
        }
    )
)

print(f"Job created: {job.name}")

Example Result

{
  "scores": [
    {
      "name": "rouge_1_score",
      "count": 2,
      "mean": 0.419,
      "min": 0.222,
      "max": 0.615
    },
    {
      "name": "rouge_2_score",
      "count": 2,
      "mean": 0.273,
      "min": 0.182,
      "max": 0.363
    },
    {
      "name": "rouge_3_score",
      "count": 2,
      "mean": 0.0,
      "min": 0.0,
      "max": 0.0
    },
    {
      "name": "rouge_L_score",
      "count": 2,
      "mean": 0.265,
      "min": 0.222,
      "max": 0.308
    }
  ]
}

String Check Metric#

String Check performs various string operations and comparisons. Supports equality, containment, and pattern matching operations.

Use String Check when:

Validating text format or structure
Checking for keyword presence
Pattern matching in generated text
String-based classification

Metric Output: Binary score (1 if condition is true, 0 otherwise).

Supported Operations#

Equality: "equals", "=="
Inequality: "!=, "<>", "not equals"
Containment: "contains", "not contains"
Pattern: "startswith", "endswith"

Live Evaluation

from nemo_platform.types.evaluation import DatasetRows, StringCheckMetricParam

result = client.evaluation.metrics.evaluate(
    metric=StringCheckMetricParam(
            type="string-check",
            operation="contains",
            left_template="{{item.output | trim}}",
            right_template="{{item.must_contain}}",
        ),
    dataset=DatasetRows(
        rows=[
            {
                "output": "The answer is: 42",
                "must_contain": "answer",
            },
            {
                "output": "Result: Success",
                "must_contain": "Success",
            },
            {
                "output": "Error occurred",
                "must_contain": "Success",
            },
        ]
    )
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")

Offline Job Evaluation

from nemo_platform.types.evaluation import (
    DatasetRows,
    StringCheckMetricParam,
    MetricOfflineJob
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJob(
        metric=StringCheckMetricParam(
            type="string-check",
            operation="contains",
            left_template="{{item.output | trim}}",
            right_template="{{item.must_contain}}",
        ),
        dataset=DatasetRows(
            rows=[
                {
                    "output": "The answer is: 42",
                    "must_contain": "answer",
                },
                {
                    "output": "Result: Success",
                    "must_contain": "Success",
                },
                {
                    "output": "Error occurred",
                    "must_contain": "Success",
                },
            ]
        )
    )
)

print(f"Job created: {job.name}")

Online Job Evaluation

from nemo_platform.types.evaluation import (
    MetricOnlineJob,
    Model,
    StringCheckMetricParam,
)

# Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJob(
        metric=StringCheckMetricParam(
            type="string-check",
            operation="startswith",
            left_template="{{sample.output_text}}",
            right_template="Answer:",
            description="Check if output starts with 'Answer:'",
        ),
        dataset="my-workspace/my-dataset",
        model=Model(
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            url="<inference-url>/v1/chat/completions",
            api_key_secret="<optional-inference-auth-key>",
        ),
        prompt_template={
            "messages": [{
                "role": "user",
                "content": "{{item.prompt}}"
            }]
        }
    )
)

print(f"Job created: {job.name}")

Example Result

{
  "scores": [
    {
      "name": "string-check",
      "count": 3,
      "mean": 0.667,
      "min": 0.0,
      "max": 1.0
    }
  ]
}

Custom Dataset Format#

Custom evaluations can use datasets uploaded as a Fileset and support these file formats:

JSON (.json): Array of objects or single object
JSONL (.jsonl): Newline-delimited JSON objects
CSV (.csv): Tabular data with headers
Parquet (.parquet): Apache Parquet columnar format
ORC (.orc): Apache ORC columnar format
Feather / Arrow IPC (.feather, .arrow): Apache Arrow interchange format

All formats support gzip compression (e.g., data.jsonl.gz, dataset.csv.gz).

Note

Built-in metrics and industry benchmarks (in the system workspace) run in specialized containers that have their own format requirements. See the documentation for each built-in metric type for details.

File Selection with Patterns#

When referencing a fileset dataset, you can select specific files using a # fragment:

workspace/fileset – load all parsable files in the fileset
workspace/fileset#data.csv – load a specific file
workspace/fileset#*.parquet – load files matching a glob pattern
workspace/fileset#subdir/**/*.jsonl – recursive glob pattern

When multiple files are loaded, they are concatenated into a single dataset. Files with unsupported extensions are skipped.

Similarity Metrics#

Template Variables#

Create a Metric#

BLEU Metric#

Exact Match Metric#

F1 Metric#

Number Check Metric#

Supported Operations#

ROUGE Metric#

String Check Metric#

Supported Operations#

Custom Dataset Format#

File Selection with Patterns#

Related Topics#