Customize the Evaluation Loop#
In this tutorial, you learn how to fine-tune the Llama 3.1 8B Instruct model on a sample dataset.
Note
The time to complete this tutorial is approximately 45 minutes. In this tutorial, you run a customization job and two evaluation jobs. Customization job duration increases with model parameter count and dataset size. For more information about evaluation job duration, refer to Expected Evaluation Duration.
Prerequisites#
Before you begin, complete the following prerequisites:
Upload Datasets#
To fine-tune a model, you first need to have a dataset split into three parts: one for training, one for testing, and one for validation. All data should be in jsonl format. Each line of the file should contain an example prompt and completion field with values that fit the task you are trying to train the model to perform. An example line for a dataset geared around question answering could be:
{"prompt": "What is the distance from the earth to the sun? A: ", "completion": "93 millions miles"}
The following steps show how to utilize the Hugging Face APIs integrated in NeMo Data Store, download demo datasets prepared for demonstration purposes, and upload the sample datasets to the default database. The demo dataset is a subset of the SQuAD for question answer generation.
Set the following environment variables:
export HF_ENDPOINT="http://data-store.test/v1/hf" export HF_TOKEN="dummy-unused-value"
Create a dataset repository:
huggingface-cli repo create sample-basic-test --type dataset
Create a new folder
~/tmp/sample_test_data
:mkdir -p ~/tmp/sample_test_data
Download the following sample datasets that you’ll use for fine-tuning and evaluating the Llama 3.1 8B Instruct model. Ensure that the datasets are accessible in the respective directories outlined as follows:
Save the training dataset (618.61 KiB) in the local
~/tmp/sample_test_data/training
directory.Save the validation dataset (75.69 KiB) in the local
~/tmp/sample_test_data/validation
directory.Save the test dataset (80.68 KiB) in the local
~/tmp/sample_test_data/testing
directory.
The following is the folder structure under
~/tmp/sample_test_data
:├── testing │ └── testing.jsonl ├── training │ └── training.jsonl └── validation └── validation.jsonl
Note
NeMo Customizer and NeMo Evaluator expect testing files to be in the
testing
folder, training files to be thetraining
folder, and validation files to be in thevalidation
folder. Make sure that you put the files in the right places.Upload the datasets:
Note
Make sure you point at folders that only contain the
.jsonl
files you want to use in the dataset. If your dataset folder is large, you may have to upload the files individually.huggingface-cli upload --repo-type dataset default/sample-basic-test ~/tmp/sample_test_data
Example Output
Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details. Start hashing 3 files. Finished hashing 3 files. training.jsonl: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 618.6k/618.6k [00:00<00:00, 131kB/s] testing.jsonl: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 82.6k/82.6k [00:00<00:00, 141kB/s] validation.jsonl: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 77.5k/77.5k [00:00<00:00, 126kB/s] Upload 3 LFS files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 2.00it/s] https://nemo-datastore-endpoint/v1/hf/datasets/sample-basic-test/tree/main/.
To use a dataset for operations such as evaluations and customizations, you need to register a dataset using the
/v1/datasets
endpoint. Registering the dataset enables you to refer to it by its namespace and name afterward.Register the dataset created in the previous step. Format the
files_url
field ashf://datasets/{namespace}/{dataset-name}
.curl -X POST "http://nemo.test/v1/datasets" \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "name": "sample-basic-test", "namespace": "default", "description": "This is an example of a dataset", "files_url": "hf://datasets/default/sample-basic-test", "project": "sample_project" }' | jq
Example Output
{ "schema_version": "1.0", "id": "dataset-9C5GLZ6i9ZgW3cZ9RHP9ej", "description": "This is a example of a dataset", "type_prefix": null, "namespace": "default", "project": "sample_project", "created_at": "2024-12-09T01:01:28.54288", "updated_at": "2024-12-09T01:01:28.606851", "custom_fields": {}, "ownership": null, "name": "sample-basic-test", "version_id": "main", "version_tags": [], "format": null, "files_url": "hf://datasets/default/sample-basic-test" }
Evaluate the Llama 3.1 8B Instruct Model#
Run a custom evaluation job to evaluate a model on custom datasets by comparing the LLM-generated response with a ground truth response. The custom evaluation with bleu
and string-check
metrics is ideal for cases where the LLM generations are not expected to be highly creative.
For more information, refer to Custom Evaluations.
To evaluate the Llama 3.1 8B Instruct model using the dataset that was uploaded in the Upload Datasets section, complete the following:
Create a target by running the following command:
curl -X POST \ "http://nemo.test/v1/evaluation/targets" \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d ' { "type": "model", "model": { "api_endpoint": { "url": "http://nemo-nim-proxy:8000/v1/completions", "model_id": "meta/llama-3.1-8b-instruct" } } }' | jq
This returns a target ID. To see a sample response, refer to Create Target Response.
Set an environment variable for the target:
export EVALUATOR_TARGET=default/<target-id>
Create a
custom
evaluation config:curl -X POST \ "http://nemo.test/v1/evaluation/configs" \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "type": "custom", "params": { "parallelism": 4 }, "tasks": { "my-custom-task": { "type": "completion", "params": { "template": { "prompt": "{{prompt}}", "max_tokens": 20, "temperature": 0.7, "top_p": 0.9 } }, "dataset": { "files_url": "hf://datasets/default/sample-basic-test/testing/testing.jsonl" }, "metrics": { "bleu": { "type": "bleu", "params": {"references": ["{{ideal_response}}"]} }, "string-check": { "type": "string-check", "params": {"check": ["{{ideal_response | trim}}", "equals", "{{output_text | trim}}"]} } } } } } ' | jq
This command returns the configuration ID. To see a sample response, refer to Create Config Response.
Store the configuration information as an environment variable:
export EVALUATOR_CONFIG=default/<config-id>
Submit an evaluation job using the target and the configuration environment variables:
curl -X POST \ "http://nemo.test/v1/evaluation/jobs" \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d "{ \"target\": \"${EVALUATOR_TARGET}\", \"config\": \"${EVALUATOR_CONFIG}\" }" | jq
To see a sample response, refer to Create Job Response.
Copy and save the evaluation job ID from the response.
export EVALUATION_JOB_ID=<evaluation-job-id>
Check the status of the evaluation job using the following command:
curl -X GET \ "http://nemo.test/v1/evaluation/jobs/${EVALUATION_JOB_ID}" \ -H 'accept: application/json' | jq
After the job completes, you can see the evaluation results by running the following command:
curl -X GET \ "http://nemo.test/v1/evaluation/jobs/${EVALUATION_JOB_ID}/results" \ -H 'accept: application/json' | jq
Example Results
{ "created_at": "2025-03-19T17:21:23.016944", "updated_at": "2025-03-19T17:21:23.016946", "id": "evaluation_result-4hDyVY4XwLnDDp645yVgu1", "job": "eval-EHBprrm8kMYQgjPZX31GSs", "tasks": { "my-custom-task": { "metrics": { "bleu": { "scores": { "sentence": { "value": 3.316167505330637, "stats": { "count": 90, "sum": 298.45507547975734, "mean": 3.316167505330637 } }, "corpus": { "value": 2.1254915037706192 } } }, "string-check": { "scores": { "string-check": { "value": 0, "stats": { "count": 90, "sum": 0, "mean": 0 } } } } } } }, "groups": {}, "namespace": "default", "custom_fields": {} }
For more information about custom evaluation jobs, refer to Custom Evaluations.
Customizing the Llama 3.1 8B Instruct Model#
Use the following procedure to fine-tune the Llama 3.1 8B Instruct model with the uploaded datasets.
Run the customization command:
curl -X POST \ "http://nemo.test/v1/customization/jobs" \ -H 'Accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "config": "meta/llama-3.1-8b-instruct", "dataset": { "name": "sample-basic-test", "namespace": "default" }, "hyperparameters": { "training_type": "sft", "finetuning_type": "lora", "epochs": 3, "batch_size": 16, "learning_rate": 0.0001, "lora": {"adapter_dim": 16} }, "project": "test-project", "ownership": { "created_by": "me", "access_policies": { "arbitrary": "json" } }, "output_model": "default/test-example-model@v1" }' | jq
Example Output
{ "id": "cust-JGTaMbJMdqjJU8WbQdN9Q2", "created_at": "2024-12-09T04:06:28.542884", "updated_at": "2024-12-09T04:06:28.542884", "config": { "schema_version": "1.0", "id": "af783f5b-d985-4e5b-bbb7-f9eec39cc0b1", "created_at": "2024-12-09T04:06:28.542657", "updated_at": "2024-12-09T04:06:28.569837", "custom_fields": {}, "name": "meta/llama-3_1-8b-instruct", "base_model": "meta/llama-3_1-8b-instruct", "model_path": "llama-3_1-8b-instruct", "training_types": [], "finetuning_types": ["lora"], "precision": "bf16", "num_gpus": 4, "num_nodes": 1, "micro_batch_size": 1, "tensor_parallel_size": 1, "max_seq_length": 4096 }, "dataset": { "schema_version": "1.0", "id": "dataset-XU4pvGzr5tvawnbVxeJMTb", "created_at": "2024-12-09T04:06:28.542657", "updated_at": "2024-12-09T04:06:28.542660", "custom_fields": {}, "name": "default/sample-basic-test", "version_id": "main", "version_tags": [] }, "hyperparameters": { "finetuning_type": "lora", "training_type": "sft", "batch_size": 16, "epochs": 10, "learning_rate": 0.0001, "lora": { "adapter_dim": 16 } }, "output_model": "default/test-example-model@v1", "status": "created", "project": "test-project", "custom_fields": {}, "ownership": { "created_by": "me", "access_policies": { "arbitrary": "json" } } }
Store the customization job ID as an environment variable. Replace the placeholder ID with the actual value from the JSON response:
export CUST_ID=<cust-JGTaMbJMdqjJU8WbQdN9Q2>
Check the status of the customization job. Use the following command to verify that the job has completed:
curl "http://nemo.test/v1/customization/jobs/${CUST_ID}/status" | jq
If the
status
field changes tocompleted
, it indicates that it finished creating and uploading theoutput_model
.Example Successful Response
{ "created_at": "2024-12-09T04:06:28.580220", "updated_at": "2024-12-09T04:21:19.852832", "status": "completed", "steps_completed": 1210, "epochs_completed": 10, "percentage_done": 100.0, "best_epoch": 3, "train_loss": 1.718016266822815, "val_loss": 1.8661999702453613 }
Test the fine-tuned model by sending a prompt to the
output_model
.curl -X POST "http://nim.test/v1/completions" \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "default/test-example-model@v1", "prompt": "When is the upcoming GTC event? GTC 2018 attracted over 8,400 attendees. Due to the COVID pandemic of 2020, GTC 2020 was converted to a digital event and drew roughly 59,000 registrants. The 2021 GTC keynote, which was streamed on YouTube on April 12, included a portion that was made with CGI using the Nvidia Omniverse real-time rendering platform. This next GTC will take place in the middle of March, 2023. Answer: ", "max_tokens": 128 }' | jq
Running Another Custom Evaluation Job on the Fine-tuned Model#
Run another evaluation job with the same configuration that you used to evaluate the Llama 3.1 8B Instruct model. Verify that the evaluation metrics have changed after the customization.
To run a custom evaluation job on the fine-tuned model, do the following:
Create a new target by running the following command:
curl -X POST \ "http://nemo.test/v1/evaluation/targets" \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d ' { "type": "model", "model": { "api_endpoint": { "url": "http://nemo-nim-proxy:8000/v1/completions", "model_id": "default/test-example-model@v1" } } }' | jq
To see a sample response, refer to Create Target Response.
Set up an environment variable for the newly created target from the previous step.
export CUSTOMIZED_MODEL_TARGET=default/<new-target-id>
Submit an evaluation job with the new target and the existing configuration from the section Evaluate the Llama 3.1 8B Instruct Model.
curl -X POST \ 'http://nemo.test/v1/evaluation/jobs' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d "{ \"target\": \"${CUSTOMIZED_MODEL_TARGET}\", \"config\": \"${EVALUATOR_CONFIG}\" }" | jq
To see a sample response, refer to Create Job Response.
Create an environment variable for the newly created evaluation job ID.
export CUSTOMIZED_MODEL_EVALUATION_JOB_ID=<new-evaluation-job-id>
Check the status of the evaluation job by using the following command:
curl -X GET \ "http://nemo.test/v1/evaluation/jobs/${CUSTOMIZED_MODEL_EVALUATION_JOB_ID}" \ -H 'accept: application/json' | jq
While the job is running, the status is
running
. After the job completes, the status iscompleted
.After the job completes, you can see the results of the evaluation by using the following command:
curl -X GET \ "http://nemo.test/v1/evaluation/jobs/${CUSTOMIZED_MODEL_EVALUATION_JOB_ID}/results" \ -H 'accept: application/json' | jq
Example Results
{ "created_at": "2025-03-19T17:46:34.216345", "updated_at": "2025-03-19T17:46:34.216347", "id": "evaluation_result-LDRwuXthNK8XtjLwNtL2ZG", "job": "eval-RNLGdCpnWc7XYGkXER2sAv", "tasks": { "my-custom-task": { "metrics": { "bleu": { "scores": { "sentence": { "value": 58.27715543476614, "stats": { "count": 90, "sum": 5244.943989128952, "mean": 58.27715543476614 } }, "corpus": { "value": 30.477136043689168 } } }, "string-check": { "scores": { "string-check": { "value": 0.4666666666666667, "stats": { "count": 90, "sum": 42, "mean": 0.4666666666666667 } } } } } } }, "groups": {}, "namespace": "default", "custom_fields": {} }
Check for improvement in the scores.
For example, the corpus
score for bleu
improves from ~2.12
to ~30.47
.