Similarity Metrics Evaluation Type#

Similarity Metrics evaluation enables evaluating a model on custom datasets by comparing the LLM generated response with a ground truth response. Use this evaluation type for tasks where outputs can be directly compared to ground truth using metrics such as BLEU, ROUGE, accuracy, EM, and F1.

{
    "type": "similarity_metrics",
    "name": "similarity-metrics-basic",
    "namespace": "my-organization",
    "params": {
        "max_tokens": 200,
        "temperature": 0.7,
        "extra": {
            "top_k": 20
        }
    }, 
    "tasks": {
        "my-similarity-metrics-task": {
            "type": "default",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
            },
            "metrics": {
                "accuracy": {"type": "accuracy"},
                "bleu": {"type": "bleu"},
                "rouge": {"type": "rouge"},
                "em": {"type": "em"},
                "f1": {"type": "f1"}
            }
        }
    }
}
{
  "input": "What is the capital of France?",
  "reference": "Paris",
  "output": "Paris"
}
{
  "groups": {
    "evaluation": {
      "metrics": {
        "evaluation": {
          "scores": {
            "accuracy": {"value": 1.0},
            "bleu_score": {"value": 1.0},
            "rouge_1_score": {"value": 1.0},
            "em_score": {"value": 1.0},
            "f1_score": {"value": 1.0}
          }
        }
      }
    }
  }
}

Metrics#

Metric

Config Key

Description

Result Naming / Notes

Accuracy

accuracy

Fraction of predictions that exactly match the reference.

accuracy

BLEU

bleu

Bilingual Evaluation Understudy Score, for translation/text generation.

bleu_score, sometimes corpus

ROUGE

rouge

Recall-Oriented Understudy for Gisting Evaluation. Includes sub-scores (e.g., ROUGE-1, ROUGE-2, ROUGE-L).

rouge_1_score, rouge_2_score, rouge_L_score, etc.

Exact Match (EM)

em

Percentage of predictions that match the reference exactly.

em

F1 Score

f1

Harmonic mean of precision and recall, often used for QA/classification.

f1

Custom Dataset Format#

input.json#

The input.json is a JSON file containing a list of input data as a dictionary (key/value pairs).

An input data dictionary can have the following fields:

Field

Type

Required

Default

Description

prompt

string

Yes

The prompt supplied to the model for inference.

ideal_response

string

Yes

The ideal ground truth response for this prompt.

category

string

Yes

“”

Category metadata for this input data dictionary. (Default: empty string)

source

string

No

“”

Data metadata for this input data dictionary. (Optional, default: empty string)

Sample input.json with 2 entries:

[
    {
        "prompt":"prompt 1",
        "ideal_response": "ideal response 1",
        "category": "",
        "source": ""
    },
    {
        "prompt":"prompt 2",
        "ideal_response": "ideal response 2",
        "category": "",
        "source": ""
    }
]

output.json#

The output.json is a JSON file, containing a list of output data as a dictionary.

Note

The llm_name field may be set to "offline" if the evaluation is run in offline mode.

An output data dictionary has the following fields:

Field

Type

Required

Default

Description

input

object

Yes

The input data dictionary, as described above.

response

string

Yes

The response or prediction string generated by the LLM, corresponding to the input.

llm_name

string

Yes

The name of the LLM that generated the response. This matches the model name provided in the evaluation API call.

Sample output.json with two entries:

[
    {
        "input": {
            "prompt": "prompt 1", 
            "ideal_response": "response 1",
            "category": "",
            "source": ""
        },
        "response": "generated response 1",
        "llm_name": "llm name"
    },
    {
        "input": {
            "prompt": "prompt 2", 
            "ideal_response": "response 2",
            "category": "",
            "source": ""
        },
        "response": "generated response 2",
        "llm_name": "llm name"
    }
]