Similarity Metrics Evaluation Type#
Similarity Metrics evaluation enables evaluating a model on custom datasets by comparing the LLM generated response with a ground truth response. Use this evaluation type for tasks where outputs can be directly compared to ground truth using metrics such as BLEU, ROUGE, accuracy, EM, and F1.
{
"type": "similarity_metrics",
"name": "similarity-metrics-basic",
"namespace": "my-organization",
"params": {
"max_tokens": 200,
"temperature": 0.7,
"extra": {
"top_k": 20
}
},
"tasks": {
"my-similarity-metrics-task": {
"type": "default",
"dataset": {
"files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
},
"metrics": {
"accuracy": {"type": "accuracy"},
"bleu": {"type": "bleu"},
"rouge": {"type": "rouge"},
"em": {"type": "em"},
"f1": {"type": "f1"}
}
}
}
}
{
"input": "What is the capital of France?",
"reference": "Paris",
"output": "Paris"
}
{
"groups": {
"evaluation": {
"metrics": {
"evaluation": {
"scores": {
"accuracy": {"value": 1.0},
"bleu_score": {"value": 1.0},
"rouge_1_score": {"value": 1.0},
"em_score": {"value": 1.0},
"f1_score": {"value": 1.0}
}
}
}
}
}
}
Metrics#
Metric |
Config Key |
Description |
Result Naming / Notes |
---|---|---|---|
Accuracy |
|
Fraction of predictions that exactly match the reference. |
|
BLEU |
|
Bilingual Evaluation Understudy Score, for translation/text generation. |
|
ROUGE |
|
Recall-Oriented Understudy for Gisting Evaluation. Includes sub-scores (e.g., ROUGE-1, ROUGE-2, ROUGE-L). |
|
Exact Match (EM) |
|
Percentage of predictions that match the reference exactly. |
|
F1 Score |
|
Harmonic mean of precision and recall, often used for QA/classification. |
|
Custom Dataset Format#
input.json#
The input.json
is a JSON file containing a list of input data as a dictionary (key/value pairs).
An input data dictionary can have the following fields:
Field |
Type |
Required |
Default |
Description |
---|---|---|---|---|
|
string |
Yes |
— |
The prompt supplied to the model for inference. |
|
string |
Yes |
— |
The ideal ground truth response for this prompt. |
|
string |
Yes |
“” |
Category metadata for this input data dictionary. (Default: empty string) |
|
string |
No |
“” |
Data metadata for this input data dictionary. (Optional, default: empty string) |
Sample input.json
with 2 entries:
[
{
"prompt":"prompt 1",
"ideal_response": "ideal response 1",
"category": "",
"source": ""
},
{
"prompt":"prompt 2",
"ideal_response": "ideal response 2",
"category": "",
"source": ""
}
]
output.json#
The output.json
is a JSON file, containing a list of output data as a dictionary.
Note
The llm_name
field may be set to "offline"
if the evaluation is run in offline mode.
An output data dictionary has the following fields:
Field |
Type |
Required |
Default |
Description |
---|---|---|---|---|
|
object |
Yes |
— |
The input data dictionary, as described above. |
|
string |
Yes |
— |
The response or prediction string generated by the LLM, corresponding to the input. |
|
string |
Yes |
— |
The name of the LLM that generated the response. This matches the model name provided in the evaluation API call. |
Sample output.json
with two entries:
[
{
"input": {
"prompt": "prompt 1",
"ideal_response": "response 1",
"category": "",
"source": ""
},
"response": "generated response 1",
"llm_name": "llm name"
},
{
"input": {
"prompt": "prompt 2",
"ideal_response": "response 2",
"category": "",
"source": ""
},
"response": "generated response 2",
"llm_name": "llm name"
}
]