Customize and Evaluate Large Language Models#
In this tutorial, you learn how to fine-tune the Llama 3.1 8B Instruct model on a sample dataset and evaluate the model performance before and after customization.
Note
The time to complete this tutorial is approximately 45 minutes. In this tutorial, you run a customization job and two evaluation jobs. Customization job duration increases with model parameter count and dataset size. For more information about evaluation job duration, refer to Expected Evaluation Duration.
Prerequisites#
Before you begin, complete the following prerequisites:
Upload Datasets#
To fine-tune a model, you need to split your dataset into three parts: one for training, one for testing, and one for validation.
All dataset files must be in JSONL format. The dataset in this example is in the prompt-completion format, where each line of the file must contain prompt
and completion
fields with values tailored to the specific task for which you train the model. The following is an example line of a dataset JSONL file designed for question-answering tasks:
{"prompt": "What is the distance from the earth to the sun? A: ", "completion": "93 millions miles"}
The following steps show how to utilize the Hugging Face APIs integrated in NeMo Data Store, download demo datasets prepared for demonstration purposes, and upload the sample datasets to the default database. The demo dataset is a subset of the SQuAD for question answer generation.
Set the following environment variables:
export HF_ENDPOINT="http://data-store.test/v1/hf" export HF_TOKEN="dummy-unused-value"
Create a dataset repository:
huggingface-cli repo create sample-basic-test --type dataset
Create a new folder
~/tmp/sample_test_data
:mkdir -p ~/tmp/sample_test_data
Download the following sample datasets that you’ll use for fine-tuning and evaluating the Llama 3.1 8B Instruct model. Ensure that the datasets are accessible in the respective directories outlined as follows:
Save the training dataset (618.61 KiB) in the local
~/tmp/sample_test_data/training
directory.Save the validation dataset (75.69 KiB) in the local
~/tmp/sample_test_data/validation
directory.Save the test dataset (80.68 KiB) in the local
~/tmp/sample_test_data/testing
directory.
The following is the folder structure under
~/tmp/sample_test_data
:├── testing │ └── testing.jsonl ├── training │ └── training.jsonl └── validation └── validation.jsonl
Note
NeMo Customizer and NeMo Evaluator expect testing files to be in the
testing
folder, training files to be in thetraining
folder, and validation files to be in thevalidation
folder. Make sure that you put the files in the right places.Upload the dataset files. The following command uploads the dataset files to the
sample-basic-test
repository in thedefault
namespace. Thedefault
namespace here is a namespace in the NeMo microservices platform managed by the NeMo Entity Store microservice and not associated with Kubernetes.Note
Make sure you point at folders that only contain the
.jsonl
files you want to use as the dataset files. If your dataset folder is large, you might need to upload the files individually.huggingface-cli upload --repo-type dataset \ default/sample-basic-test \ ~/tmp/sample_test_data
Example Output
Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details. Start hashing 3 files. Finished hashing 3 files. training.jsonl: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 618.6k/618.6k [00:00<00:00, 131kB/s] testing.jsonl: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 82.6k/82.6k [00:00<00:00, 141kB/s] validation.jsonl: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 77.5k/77.5k [00:00<00:00, 126kB/s] Upload 3 LFS files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 2.00it/s] https://nemo-datastore-endpoint/v1/hf/datasets/sample-basic-test/tree/main/.
To use a dataset for operations such as evaluations and customizations, you need to register a dataset using the
/v1/datasets
endpoint. Registering the dataset enables you to refer to it by its namespace and name afterward.Register the dataset created in the previous step. Format the
files_url
field ashf://datasets/{namespace}/{dataset-name}
.import time from nemo_microservices import NeMoMicroservices client = NeMoMicroservices( base_url="http://nemo.test", inference_base_url="http://nim.test", ) dataset = client.datasets.create( name="sample-basic-test", namespace="default", description="This is an example of a dataset", files_url="hf://datasets/default/sample-basic-test", project="sample_project" ) print(dataset)
curl -X POST "http://nemo.test/v1/datasets" \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "name": "sample-basic-test", "namespace": "default", "description": "This is an example of a dataset", "files_url": "hf://datasets/default/sample-basic-test", "project": "sample_project" }' | jq
Example Output
{ "schema_version": "1.0", "id": "dataset-9C5GLZ6i9ZgW3cZ9RHP9ej", "description": "This is a example of a dataset", "type_prefix": null, "namespace": "default", "project": "sample_project", "created_at": "2024-12-09T01:01:28.54288", "updated_at": "2024-12-09T01:01:28.606851", "custom_fields": {}, "ownership": null, "name": "sample-basic-test", "version_id": "main", "version_tags": [], "format": null, "files_url": "hf://datasets/default/sample-basic-test" }
Evaluate the Llama 3.1 8B Instruct Model Before Customization#
Run a custom evaluation job to evaluate a model on custom datasets by comparing the LLM-generated response with a ground truth response. The custom evaluation with bleu
and string-check
metrics is ideal for cases where the LLM generations are not expected to be highly creative.
To evaluate the Llama 3.1 8B Instruct model using the dataset that was uploaded in the Upload Datasets section, complete the following:
Create an evaluation job with the target and the configuration defined.
target
is the model under evaluation, andconfig
is the configuration for the evaluation job.The dataset that was uploaded in previous section is specified within the configuration
config.tasks.<task-name>.dataset.files_url
. The dataset URL is expected to be in Hugging Face formathf://datasets/<dataset-namespace>/<dataset-name>/<relative-path-to-file>
.job = client.evaluation.jobs.create( target={ "type": "model", "model": { "api_endpoint": { "url": "http://nemo-nim-proxy:8000/v1/completions", "model_id": "meta/llama-3.1-8b-instruct" } } }, config={ "type": "custom", "params": { "parallelism": 4 }, "tasks": { "my-custom-task": { "type": "completion", "params": { "template": { "prompt": "{{prompt}}", "max_tokens": 20, "temperature": 0.7, "top_p": 0.9 } }, "dataset": { "files_url": "hf://datasets/default/sample-basic-test/testing/testing.jsonl" }, "metrics": { "bleu": { "type": "bleu", "params": {"references": ["{{ideal_response}}"]} }, "string-check": { "type": "string-check", "params": {"check": ["{{ideal_response | trim}}", "equals", "{{output_text | trim}}"]} } } } } } ) print(job)
For more information about the parameters in the code above, refer to the following references:
curl -X POST \ "http://nemo.test/v1/evaluation/jobs" \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "target": { "type": "model", "model": { "api_endpoint": { "url": "http://nemo-nim-proxy:8000/v1/completions", "model_id": "meta/llama-3.1-8b-instruct" } } }, "config": { "type": "custom", "params": { "parallelism": 4 }, "tasks": { "my-custom-task": { "type": "completion", "params": { "template": { "prompt": "{{prompt}}", "max_tokens": 20, "temperature": 0.7, "top_p": 0.9 } }, "dataset": { "files_url": "hf://datasets/default/sample-basic-test/testing/testing.jsonl" }, "metrics": { "bleu": { "type": "bleu", "params": {"references": ["{{ideal_response}}"]} }, "string-check": { "type": "string-check", "params": {"check": ["{{ideal_response | trim}}", "equals", "{{output_text | trim}}"]} } } } } } }' | jq
To see a sample response, refer to To Create an Evaluation Job.
Save the job ID:
# The job ID is available from the job object created in the previous step evaluation_job_id = job.id
export EVALUATION_JOB_ID=<evaluation-job-id>
Check the status of the evaluation job using the following command:
# Using the job ID from the previous step job_status = client.evaluation.jobs.retrieve(evaluation_job_id) print(job_status)
curl -X GET \ "http://nemo.test/v1/evaluation/jobs/${EVALUATION_JOB_ID}" \ -H 'accept: application/json' | jq
After the job completes, you can see the evaluation results by running the following command:
# Using the job ID from the previous step results = client.evaluation.jobs.results(evaluation_job_id) print(results)
curl -X GET \ "http://nemo.test/v1/evaluation/jobs/${EVALUATION_JOB_ID}/results" \ -H 'accept: application/json' | jq
Evaluation Result Before Customization
{ "created_at": "2025-03-19T17:21:23.016944", "updated_at": "2025-03-19T17:21:23.016946", "id": "evaluation_result-4hDyVY4XwLnDDp645yVgu1", "job": "eval-EHBprrm8kMYQgjPZX31GSs", "tasks": { "my-custom-task": { "metrics": { "bleu": { "scores": { "sentence": { "value": 3.316167505330637, "stats": { "count": 90, "sum": 298.45507547975734, "mean": 3.316167505330637 } }, "corpus": { "value": 2.1254915037706192 } } }, "string-check": { "scores": { "string-check": { "value": 0, "stats": { "count": 90, "sum": 0, "mean": 0 } } } } } } }, "groups": {}, "namespace": "default", "custom_fields": {} }
Make note of the corpus-level BLEU and string-check scores for the result. For this example, the corpus-level BLEU score is 2.12 and the string-check score is 0 before customization.
Customize the Llama 3.1 8B Instruct Model#
Use the following procedure to fine-tune the Llama 3.1 8B Instruct model with the uploaded datasets.
Run the following command to run a customization job:
Note
You can check the available customization configs by running
client.customization.configs.list()
in the Python SDK orGET /v1/customization/configs
in cURL. The customization configs with theA100
tag are suitable for most GPUs with 80 GB of memory, while those with theL40
tag are suitable for most GPUs with 40 GB of memory. Choose the appropriate config for your GPU and specify it in theconfig
parameter.customization_job = client.customization.jobs.create( config="meta/llama-3.1-8b-instruct@v1.0.0+A100", dataset={ "name": "sample-basic-test", "namespace": "default" }, hyperparameters={ "training_type": "sft", "finetuning_type": "lora", "epochs": 3, "batch_size": 16, "learning_rate": 0.0001, "lora": {"adapter_dim": 16} }, project="test-project", ownership={ "created_by": "your-username", "access_policies": { "arbitrary": "json" } }, output_model="default/test-example-model@v1" ) print(customization_job)
For more information about the parameters in the code above, refer to Customization Jobs Resource.
curl -X POST \ "http://nemo.test/v1/customization/jobs" \ -H 'Accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "config": "meta/llama-3.1-8b-instruct@v1.0.0+A100", "dataset": { "name": "sample-basic-test", "namespace": "default" }, "hyperparameters": { "training_type": "sft", "finetuning_type": "lora", "epochs": 3, "batch_size": 16, "learning_rate": 0.0001, "lora": {"adapter_dim": 16} }, "project": "test-project", "ownership": { "created_by": "your-username", "access_policies": { "arbitrary": "json" } }, "output_model": "default/test-example-model@v1" }' | jq
Example Output
{ "id": "cust-JGTaMbJMdqjJU8WbQdN9Q2", "created_at": "2024-12-09T04:06:28.542884", "updated_at": "2024-12-09T04:06:28.542884", "config": { "schema_version": "1.0", "id": "af783f5b-d985-4e5b-bbb7-f9eec39cc0b1", "created_at": "2024-12-09T04:06:28.542657", "updated_at": "2024-12-09T04:06:28.569837", "custom_fields": {}, "name": "meta/llama-3_1-8b-instruct", "base_model": "meta/llama-3_1-8b-instruct", "model_path": "llama-3_1-8b-instruct", "training_types": [], "finetuning_types": ["lora"], "precision": "bf16", "num_gpus": 4, "num_nodes": 1, "micro_batch_size": 1, "tensor_parallel_size": 1, "max_seq_length": 4096 }, "dataset": { "schema_version": "1.0", "id": "dataset-XU4pvGzr5tvawnbVxeJMTb", "created_at": "2024-12-09T04:06:28.542657", "updated_at": "2024-12-09T04:06:28.542660", "custom_fields": {}, "name": "default/sample-basic-test", "version_id": "main", "version_tags": [] }, "hyperparameters": { "finetuning_type": "lora", "training_type": "sft", "batch_size": 16, "epochs": 10, "learning_rate": 0.0001, "lora": { "adapter_dim": 16 } }, "output_model": "default/test-example-model@v1", "status": "created", "project": "test-project", "custom_fields": {}, "ownership": { "created_by": "me", "access_policies": { "arbitrary": "json" } } }
Save the customization job ID:
# The job ID is available from the customization_job object created in the previous step cust_id = customization_job.id
export CUST_ID=<customization-job-id>
Check the status of the customization job. Use the following command to verify that the job has completed:
# Using the customization job ID from the previous step job_status = None while job_status not in ("completed", "failed", "cancelled"): status = client.customization.jobs.status.retrieve(cust_id) job_status = status.status time.sleep(60) print(job_status)
watch -n 60 "curl 'http://nemo.test/v1/customization/jobs/${CUST_ID}/status' | jq"
If the
status
field changes tocompleted
, it indicates that it finished creating and uploading theoutput_model
.Example Successful Response
{ "created_at": "2024-12-09T04:06:28.580220", "updated_at": "2024-12-09T04:21:19.852832", "status": "completed", "steps_completed": 1210, "epochs_completed": 10, "percentage_done": 100.0, "best_epoch": 3, "train_loss": 1.718016266822815, "val_loss": 1.8661999702453613 }
Test the fine-tuned model by sending a prompt to the
output_model
.completion = client.completions.create( model="default/test-example-model@v1", prompt="When is the upcoming GTC event? GTC 2018 attracted over 8,400 attendees. Due to the COVID pandemic of 2020, GTC 2020 was converted to a digital event and drew roughly 59,000 registrants. The 2021 GTC keynote, which was streamed on YouTube on April 12, included a portion that was made with CGI using the Nvidia Omniverse real-time rendering platform. This next GTC will take place in the middle of March, 2023. Answer: ", max_tokens=128 ) print(completion)
curl -X POST "http://nim.test/v1/completions" \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "default/test-example-model@v1", "prompt": "When is the upcoming GTC event? GTC 2018 attracted over 8,400 attendees. Due to the COVID pandemic of 2020, GTC 2020 was converted to a digital event and drew roughly 59,000 registrants. The 2021 GTC keynote, which was streamed on YouTube on April 12, included a portion that was made with CGI using the Nvidia Omniverse real-time rendering platform. This next GTC will take place in the middle of March, 2023. Answer: ", "max_tokens": 128 }' | jq
Evaluate the Llama 3.1 8B Instruct Model After Customization#
Run another evaluation job with the same configuration that you used for evaluating the Llama 3.1 8B Instruct model before customization. By re-using the same evaluation config, you can compare the evaluation metrics before and after customization.
To run an evaluation job on the fine-tuned model, specify the custom model as the target:
Submit an evaluation job with a new evaluation target but the same evaluation config from the section Evaluate the Llama 3.1 8B Instruct Model Before Customization.
customized_job = client.evaluation.jobs.create( target={ "type": "model", "model": { "api_endpoint": { "url": "http://nemo-nim-proxy:8000/v1/completions", "model_id": "default/test-example-model@v1" } } }, config={ "type": "custom", "params": { "parallelism": 4 }, "tasks": { "my-custom-task": { "type": "completion", "params": { "template": { "prompt": "{{prompt}}", "max_tokens": 20, "temperature": 0.7, "top_p": 0.9 } }, "dataset": { "files_url": "hf://datasets/default/sample-basic-test/testing/testing.jsonl" }, "metrics": { "bleu": { "type": "bleu", "params": {"references": ["{{ideal_response}}"]} }, "string-check": { "type": "string-check", "params": {"check": ["{{ideal_response | trim}}", "equals", "{{output_text | trim}}"]} } } } } } ) print(customized_job)
For more information about the parameters in the code above, refer to Evaluation Jobs Resource.
curl -X POST \ 'http://nemo.test/v1/evaluation/jobs' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "target": { "type": "model", "model": { "api_endpoint": { "url": "http://nemo-nim-proxy:8000/v1/completions", "model_id": "default/test-example-model@v1" } } }, "config": { "type": "custom", "params": { "parallelism": 4 }, "tasks": { "my-custom-task": { "type": "completion", "params": { "template": { "prompt": "{{prompt}}", "max_tokens": 20, "temperature": 0.7, "top_p": 0.9 } }, "dataset": { "files_url": "hf://datasets/default/sample-basic-test/testing/testing.jsonl" }, "metrics": { "bleu": { "type": "bleu", "params": {"references": ["{{ideal_response}}"]} }, "string-check": { "type": "string-check", "params": {"check": ["{{ideal_response | trim}}", "equals", "{{output_text | trim}}"]} } } } } } }' | jq
To see a sample response, refer to To Create an Evaluation Job.
Save the evaluation job ID:
# The job ID is available from the customized_job object created in the previous step customized_model_evaluation_job_id = customized_job.id
export CUSTOMIZED_MODEL_EVALUATION_JOB_ID=<new-evaluation-job-id>
Check the status of the evaluation job by using the following command:
# Using the job ID from the previous step job_status = None while job_status not in ("completed", "failed", "cancelled"): job = client.evaluation.jobs.retrieve(customized_model_evaluation_job_id) job_status = job.status print(job_status) time.sleep(60)
watch -n 60 "curl -X GET \ 'http://nemo.test/v1/evaluation/jobs/${CUSTOMIZED_MODEL_EVALUATION_JOB_ID}' \ -H 'accept: application/json' | jq"
While the job is running, the status is
running
. After the job completes, the status iscompleted
.After the job completes, you can see the results of the evaluation by using the following command:
# Using the job ID from the previous step results = client.evaluation.jobs.results(customized_model_evaluation_job_id) print(results)
For more information about the parameters in the code above, refer to Evaluation Results Resource.
curl -X GET \ "http://nemo.test/v1/evaluation/jobs/${CUSTOMIZED_MODEL_EVALUATION_JOB_ID}/results" \ -H 'accept: application/json' | jq
Evaluation Result After Customization
{ "created_at": "2025-03-19T17:46:34.216345", "updated_at": "2025-03-19T17:46:34.216347", "id": "evaluation_result-LDRwuXthNK8XtjLwNtL2ZG", "job": "eval-RNLGdCpnWc7XYGkXER2sAv", "tasks": { "my-custom-task": { "metrics": { "bleu": { "scores": { "sentence": { "value": 58.27715543476614, "stats": { "count": 90, "sum": 5244.943989128952, "mean": 58.27715543476614 } }, "corpus": { "value": 30.477136043689168 } } }, "string-check": { "scores": { "string-check": { "value": 0.4666666666666667, "stats": { "count": 90, "sum": 42, "mean": 0.4666666666666667 } } } } } } }, "groups": {}, "namespace": "default", "custom_fields": {} }
Make note of the corpus-level BLEU and string-check scores for the result. For this example, the corpus-level BLEU score is 30.47 and the string-check score is 0.46 after customization.
Conclusion#
Compare the scores from the original model evaluation to those from the evaluation of the fine-tuned model. In the example outputs provided in this tutorial, both the corpus-level BLEU and string-check scores improve.
The corpus-level BLEU score improves from
~2.12
to~30.47
.The string-check score improves from
~0
to~0.46
.