Similarity Metrics Evaluation Type#

Similarity Metrics evaluation enables evaluating a model on custom datasets by comparing the LLM generated response with a ground truth response. Use this evaluation type for tasks where outputs can be directly compared to ground truth using metrics such as BLEU, ROUGE, accuracy, EM, and F1.

Config

{
    "type": "similarity_metrics",
    "name": "similarity-metrics-basic",
    "namespace": "my-organization",
    "params": {
        "max_tokens": 200,
        "temperature": 0.7,
        "extra": {
            "top_k": 20
        }
    }, 
    "tasks": {
        "my-similarity-metrics-task": {
            "type": "default",
            "dataset": {
                "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>/<my-dataset-file-path>"
            },
            "metrics": {
                "accuracy": {"type": "accuracy"},
                "bleu": {"type": "bleu"},
                "rouge": {"type": "rouge"},
                "em": {"type": "em"},
                "f1": {"type": "f1"}
            }
        }
    }
}

Data Format

{
  "input": "What is the capital of France?",
  "reference": "Paris",
  "output": "Paris"
}

Result

{
  "groups": {
    "evaluation": {
      "metrics": {
        "evaluation": {
          "scores": {
            "accuracy": {"value": 1.0},
            "bleu_score": {"value": 1.0},
            "rouge_1_score": {"value": 1.0},
            "em_score": {"value": 1.0},
            "f1_score": {"value": 1.0}
          }
        }
      }
    }
  }
}

Metrics#

Metric	Config Key	Description	Result Naming / Notes
Accuracy	`accuracy`	Fraction of predictions that exactly match the reference.	`accuracy`
BLEU	`bleu`	Bilingual Evaluation Understudy Score, for translation/text generation.	`bleu_score`, sometimes `corpus`
ROUGE	`rouge`	Recall-Oriented Understudy for Gisting Evaluation. Includes sub-scores (e.g., ROUGE-1, ROUGE-2, ROUGE-L).	`rouge_1_score`, `rouge_2_score`, `rouge_L_score`, etc.
Exact Match (EM)	`em`	Percentage of predictions that match the reference exactly.	`em`
F1 Score	`f1`	Harmonic mean of precision and recall, often used for QA/classification.	`f1`

Custom Dataset Format#

input.json#

Fields

The input.json is a JSON file containing a list of input data as a dictionary (key/value pairs).

An input data dictionary can have the following fields:

Field	Type	Required	Default	Description
`prompt`	string	Yes	—	The prompt supplied to the model for inference.
`ideal_response`	string	Yes	—	The ideal ground truth response for this prompt.
`category`	string	Yes	“”	Category metadata for this input data dictionary. (Default: empty string)
`source`	string	No	“”	Data metadata for this input data dictionary. (Optional, default: empty string)

Example

Sample input.json with 2 entries:

[
    {
        "prompt":"prompt 1",
        "ideal_response": "ideal response 1",
        "category": "",
        "source": ""
    },
    {
        "prompt":"prompt 2",
        "ideal_response": "ideal response 2",
        "category": "",
        "source": ""
    }
]

output.json#

Fields

The output.json is a JSON file, containing a list of output data as a dictionary.

Note

The llm_name field may be set to "offline" if the evaluation is run in offline mode.

An output data dictionary has the following fields:

Field	Type	Required	Default	Description
`input`	object	Yes	—	The input data dictionary, as described above.
`response`	string	Yes	—	The response or prediction string generated by the LLM, corresponding to the input.
`llm_name`	string	Yes	—	The name of the LLM that generated the response. This matches the model name provided in the evaluation API call.

Example

Sample output.json with two entries:

[
    {
        "input": {
            "prompt": "prompt 1", 
            "ideal_response": "response 1",
            "category": "",
            "source": ""
        },
        "response": "generated response 1",
        "llm_name": "llm name"
    },
    {
        "input": {
            "prompt": "prompt 2", 
            "ideal_response": "response 2",
            "category": "",
            "source": ""
        },
        "response": "generated response 2",
        "llm_name": "llm name"
    }
]