Use Custom Data with NVIDIA NeMo Evaluator#

You can use your own custom datasets with NVIDIA NeMo Evaluator. NeMo Evaluator supports custom datasets that are formatted as json, jsonl, or csv.

Currently, NeMo Evaluator supports custom data for the following evaluation types:

Similarity Metrics Evaluations
LLM-as-a-Judge Evaluations
Retriever Evaluations
RAG Evaluations

Upload Your Custom Data#

You must upload custom datasets to NeMo Data Store. Custom datasets are referred to as hf://datasets/<Dataset Namespace>/<Dataset Name> in NeMo Evaluator evaluation targets and evaluation configurations.

To upload datasets, you can use either the Hugging Face Hub client, or the NeMo Data Store APIs.

You can upload your custom dataset by using the Hugging Face Hub client as shown in the following code.

Important

Before you run the following code, replace <DATASTORE_ENDPOINT_HOSTNAME>, <Dataset Namespace>/<Dataset Name>, and <path_to_my-prompt-dataset_folder_path> with your own values. Ensure that the folder_path contains only the files you want included in the dataset repository.

import huggingface_hub as hh
import requests

DATASTORE_HOSTNAME = "<DATASTORE_ENDPOINT_HOSTNAME>" # Update this before you run the code

# Create the Hugging Face API client with the upload token and the Data Store endpoint URL
api = HfApi(endpoint=f"http://{DATASTORE_HOSTNAME}/v1/hf", token="")

repo_type = "dataset"
repo_id = f'datasets/<Dataset Namespace>/<Dataset Name>' # Update this before you run the code
folder_path = "<path_to_my-prompt-dataset_folder_path>" # Update this before you run the code

# create repo
hf_api.create_repo(
    repo_id=repo_id,
    repo_type=repo_type,
)

# upload dir
path_in_repo = "."
result = hf_api.upload_folder(repo_id=repo_name, folder_path=dir_path, path_in_repo=path_in_repo, repo_type=repo_type)

Uploading Custom Datasets#

In order to run evaluation on custom datasets, the data needs to be uploaded to the NeMo Data Store. Follow the steps below to upload your custom dataset to the NeMo Data Store.

(1) Installing `huggingface_hub`#

NVIDIA NeMo Data Store is compatible with the Hugging Face Hub API. Let’s install this library via pip and import it:

# Install the library
%pip install huggingface_hub

# Import
import huggingface_hub as hh

(2) Preparing the dataset for upload#

Let’s create a sample JSON dataset of prompt/response pairs to upload. Each entry in the dataset should contain the "prompt" key, and optionally can contain "ideal_response" and "category" fields.

Use the following Python code to create a sample dataset to use in the following examples.

import json

my_prompt_dataset = [
    {
        "prompt": "Where is the Pacific Ocean?",
        "ideal_response": "The Pacific Ocean is located between Asia and Australia to the east, and the Americas to the west.",
        "category": "Closed QA"
    },
    {
        "prompt": "What are some places to visit in Paris?",
        "ideal_response": "Some popular places to visit in Paris include the Eiffel Tower, the Louvre Museum and Notre-Dame Cathedral",
        "category": "Generation"
    },
    {
        "prompt": "Summarize this article's main points in 3-4 sentences:\n\nThe article discusses the impact of climate change on the Arctic region. It highlights the rapid rate of warming in the Arctic, which is causing the ice to melt at an alarming rate. The article also discusses the impact of this warming on the region's wildlife and indigenous communities. Finally, it outlines potential solutions to mitigate the effects of climate change in the Arctic.",
        "ideal_response": "The article discusses the impact of climate change on the Arctic region, highlighting the rapid rate of warming and the resulting ice melt. It also discusses the impact on wildlife and indigenous communities, and outlines potential solutions to mitigate the effects of climate change in the Arctic.",
        "category": "Summarization"
    },
    {
        "prompt": "What do you think about the recent developments in AI?",
        "ideal_response": "Recent developments in AI have been significant, with advancements in natural language processing, computer vision, and machine learning. These developments have the potential to revolutionize various industries and improve the way we live and work.",
        "category": "Generation"
    },
    {
        "prompt": "What is common between the colors red, yellow, and blue?",
        "ideal_response": "The colors red, yellow, and blue are all primary colors. This means that they cannot be created by mixing other colors together, and they are used as the basis for creating all other colors.",
        "category": "Closed QA"
    },
    {
        "prompt": "What are some favorite foods in Italy?",
        "ideal_response": "Some favorite foods in Italy include pasta, pizza, gelato, and risotto. Italian cuisine is known for its use of fresh, high-quality ingredients and simple yet flavorful dishes.",
        "category": "Open QA"
    },
    {
        "prompt": "What are the main causes of deforestation?",
        "ideal_response": "The main causes of deforestation include agricultural expansion, logging, and infrastructure development. These activities result in the clearing of forests, which has a significant impact on the environment and biodiversity.",
        "category": "Closed QA"
    },
    {
        "prompt": "What are some common symptoms of COVID-19?",
        "ideal_response": "Some common symptoms of COVID-19 include fever, cough, and shortness of breath. Other symptoms may include fatigue, body aches, and loss of taste or smell.",
        "category": "Classification"
    },
    {
        "prompt": "What are some ways to reduce plastic waste?",
        "ideal_response": "Some ways to reduce plastic waste include using reusable bags and containers, recycling plastic products, and avoiding single-use plastics. By reducing plastic waste, we can help protect the environment and reduce the impact of plastic pollution.",
        "category": "Generation"
    },
]

# Write the dataset to disk.
my_prompt_dataset_filepath = 'my-prompt-dataset.json'
with open(my_prompt_dataset_filepath, "w") as json_file:
    json.dump(my_prompt_dataset, json_file, indent=4)

(3) Create a dataset repository#

The next step is to create a repository for our dataset in NeMo Data Store, where the data will be stored.

Refer to the NeMo Data Store documentation on how to set up the NeMo Data Store (link) and use the URL of the installation referred as DATASTORE_ENDPOINT_URL below.

To create a dataset repository use the Hugging Face Hub API client as shown in the following code.

Important

Before you run the following code, replace <DATASTORE_ENDPOINT_HOSTNAME>, <Dataset Namespace>/<Dataset Name>, and <path_to_my-prompt-dataset_folder_path> with your own values. Ensure that the folder_path contains only the files you want included in the dataset repository.

import requests
from huggingface_hub import HfApi # huggingface_hub version >= 0.26.2

DATASTORE_HOSTNAME = "<DATASTORE_ENDPOINT_HOSTNAME>" # Update this before you run the code

# Create the Hugging Face API client with the upload token and the Data Store endpoint URL
api = HfApi(endpoint=f"http://{DATASTORE_HOSTNAME}/v1/hf", token="")

repo_type = "dataset"
repo_id = f'datasets/<Dataset Namespace>/<Dataset Name>' # Update this before you run the code
folder_path = "<path_to_my-prompt-dataset_folder_path>" # Update this before you run the code

# create the repo
url = api.create_repo(
    repo_id=repo_id,
    repo_type=repo_type,
)

(4) Upload the Dataset#

You need to upload the dataset before you can use it. Use the Hugging Face Hub API as shown in the following code.

# Upload the dataset file to the Data Store.
upload_url = api.upload_folder(
    repo_id=repo_id, 
    folder_path=folder_path,
    repo_type=repo_type)

print(f"Uploaded File to {upload_url}")

The uploaded URL is returned if the upload is successful. If something goes wrong during the upload step, the Hugging Face Hub API would raise errors detailing what went wrong.

Custom Data for Similarity Metrics#

Input file#

The input.json is a JSON file containing a list of input data as a dictionary (key/value pairs).

An input data dictionary can have four fields:

prompt — The prompt is a string supplied to the model for inference.
ideal_response — A string representing the ideal ground truth response for this prompt.
category — A string parameter to capture catagory metadata for this input data dictionary.
source — Optional parameter. A string parameter to capture data metadata for this input data dictionary.

Sample input_file with 2 entries:

[
    {
        "prompt":"prompt 1",
        "ideal_response": "ideal response 1",
        "category": "",
        "source": ""
    },
    {
        "prompt":"prompt 2",
        "ideal_response": "ideal response 2",
        "category": "",
        "source": ""
    }
]

Output file#

The output.json is a JSON file, containing a list of output data as a dictionary.

An output data dictionary has three fields:

input — The input data dictionary, the structure of which previously described
response — The response/prediction string generated by the LLM, corresponding to the input
llm_name — A string containing the name of the LLM that generated the response. This is the same as the model name provided while making a launch evaluation API call

Sample output_file with two entries:

[
    {
        "input": {
            "prompt": "prompt 1", 
            "ideal_response": "response 1",
            "category": "",
            "source": ""
        },
        "response": "generated response 1",
        "llm_name": "llm name"
    },
    {
        "input": {
            "prompt": "prompt 2", 
            "ideal_response": "response 2",
            "category": "",
            "source": ""
        },
        "response": "generated response 2",
        "llm_name": "llm name"
    }
]

Custom Data for Retriever and RAG#

For Retriever custom datasets, the BEIR and SQuAD formats are supported.

For RAG pipelines that includes a Retrieval pipeline, the BEIR and SQuAD formats are supported. For RAG pipelines with pre-generated retrieved documents, the Ragas format is supported.

Answer Generation + Answer Evaluation — The retriever pipeline is replaced by a cached_outputs field that contains pre-generated retrieved documents. The cached_outputs field specifies a file that adheres to the Ragas dataset format.
Answer Evaluation. — The rag pipeline is replaced by a cached_outputs field that contains pre-generated retrieved documents and pre-generated answers. The cached_outputs field specifies a file that adheres to the Ragas dataset format.

BEIR#

For BEIR, make sure the dataset is in the following format:

corpus file — A .jsonl file (jsonlines) that contains a list of dictionaries, each with three fields _id and a unique document identifier, title with document title (optional), and text with document paragraph or passage. For example: {"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}
queries file — A .jsonl file (jsonlines) that contains a list of dictionaries, each with two fields _id and a unique query identifier and text with query text. For example: {"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}
qrels file — A .tsv file (tab-separated) that contains three columns, such as the query-id, corpus-id, and score in this order. Keep the first row as a header. For example: q1 doc1 1

SQuAD#

For SQuAD, make sure the dataset is a JSON file formatted as follows:

Note

The id field should be of type string.

{
   "data": [
      {
         "paragraphs": [
            {
               "context": "my context", 
               "document_id": "my id", 
               "qas": [
                  {
                     "question": "my question", 
                     "id": "my id", 
                     "answers": [
                        {"text": "my answer"}
                     ]
                  }
               ]
            }
         ]
      }
   ]
}

Ragas#

For Ragas, make sure the dataset is a JSON file in the following format. This dataset format is called Ragas because it is the format used by the ragas library.

{
    "question": ["question #1", "question #2"],
    # Optional. Used for Answer Generation and Answer Evaluation (for some specific RAG metrics)
    "contexts": [["context #1 for question #1", "context #2 for question #1"], ["context #1 for question #2", "context #2 for question #2"]],  
    # Optional. Used for Answer Evaluation (for some specific RAG metrics)
    "answer": ["predicted answer for question #1", "predicted answer for question #2"],  
    # Optional. Used for Answer Evaluation (for some specific RAG metrics)
    "ground_truths": ["ground truth answer for question #1", "ground truth answer for question #2"]  
}