Customize and Evaluate Large Language Models#

In this tutorial, you learn how to fine-tune the Llama 3.1 8B Instruct model on a sample dataset and evaluate the model performance before and after customization.

Prerequisites
Upload Datasets
Customize the Llama 3.1 8B Instruct Model
Evaluate the Llama 3.1 8B Instruct Model After Customization
Conclusion

Note

The time to complete this tutorial is approximately 45 minutes. In this tutorial, you run a customization job and two evaluation jobs. Customization job duration increases with model parameter count and dataset size. For more information about evaluation job duration, refer to Expected Evaluation Duration.

Prerequisites#

Before you begin, complete the following prerequisites:

Upload Datasets#

To fine-tune a model, you need to split your dataset into three parts: one for training, one for testing, and one for validation.

All dataset files must be in JSONL format. The dataset in this example is in the prompt-completion format, where each line of the file must contain prompt and completion fields with values tailored to the specific task for which you train the model. The following is an example line of a dataset JSONL file designed for question-answering tasks:

{"prompt": "What is the distance from the earth to the sun? A: ", "completion": "93 millions miles"}

The following steps show how to utilize the Hugging Face APIs integrated in NeMo Data Store, download demo datasets prepared for demonstration purposes, and upload the sample datasets to the default database. The demo dataset is a subset of the SQuAD for question answer generation.

Set the following environment variables:

export HF_ENDPOINT="http://data-store.test/v1/hf"
export HF_TOKEN="dummy-unused-value"

Create a dataset repository:

huggingface-cli repo create sample-basic-test --type dataset

Create a new folder ~/tmp/sample_test_data:
```
mkdir -p ~/tmp/sample_test_data
```
Download the following sample datasets that you’ll use for fine-tuning and evaluating the Llama 3.1 8B Instruct model. Ensure that the datasets are accessible in the respective directories outlined as follows:
- Save the training dataset (618.61 KiB) in the local ~/tmp/sample_test_data/training directory.
- Save the validation dataset (75.69 KiB) in the local ~/tmp/sample_test_data/validation directory.
- Save the test dataset (80.68 KiB) in the local ~/tmp/sample_test_data/testing directory.
The following is the folder structure under ~/tmp/sample_test_data:
```
├── testing
│   └── testing.jsonl
├── training
│   └── training.jsonl
└── validation
    └── validation.jsonl
```
Note

NeMo Customizer and NeMo Evaluator expect testing files to be in the testing folder, training files to be in the training folder, and validation files to be in the validation folder. Make sure that you put the files in the right places.

Upload the dataset files. The following command uploads the dataset files to the sample-basic-test repository in the default namespace. The default namespace here is a namespace in the NeMo microservices platform managed by the NeMo Entity Store microservice and not associated with Kubernetes.

Note

Make sure you point at folders that only contain the .jsonl files you want to use as the dataset files. If your dataset folder is large, you might need to upload the files individually.

huggingface-cli upload --repo-type dataset \
   default/sample-basic-test \
   ~/tmp/sample_test_data

To use a dataset for operations such as evaluations and customizations, you need to register a dataset using the /v1/datasets endpoint. Registering the dataset enables you to refer to it by its namespace and name afterward.

Register the dataset created in the previous step. Format the files_url field as hf://datasets/{namespace}/{dataset-name}.

Python SDK

import time
from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
   base_url="http://nemo.test",
   inference_base_url="http://nim.test",
)

dataset = client.datasets.create(
    name="sample-basic-test",
    namespace="default",
    description="This is an example of a dataset",
    files_url="hf://datasets/default/sample-basic-test",
    project="sample_project"
)
print(dataset)

cURL

curl -X POST "http://nemo.test/v1/datasets" \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
      "name": "sample-basic-test",
      "namespace": "default",
      "description": "This is an example of a dataset",
      "files_url": "hf://datasets/default/sample-basic-test",
      "project": "sample_project"
   }' | jq

Evaluate the Llama 3.1 8B Instruct Model Before Customization#

Run a custom evaluation job to evaluate a model on custom datasets by comparing the LLM-generated response with a ground truth response. The custom evaluation with bleu and string-check metrics is ideal for cases where the LLM generations are not expected to be highly creative.

To evaluate the Llama 3.1 8B Instruct model using the dataset that was uploaded in the Upload Datasets section, complete the following:

Create an evaluation job with the target and the configuration defined.

target is the model under evaluation, and config is the configuration for the evaluation job.

The dataset that was uploaded in previous section is specified within the configuration config.tasks.<task-name>.dataset.files_url. The dataset URL is expected to be in Hugging Face format hf://datasets/<dataset-namespace>/<dataset-name>/<relative-path-to-file>.

Python SDK

job = client.evaluation.jobs.create(
   target={
      "type": "model",
      "model": {
         "api_endpoint": {
            "url": "http://nemo-nim-proxy:8000/v1/completions",
            "model_id": "meta/llama-3.1-8b-instruct"
         }
      }
   },
   config={
      "type": "custom",
      "params": {
         "parallelism": 4
      },
      "tasks": {
         "my-custom-task": {
            "type": "completion",
            "params": {
               "template": {
                  "prompt": "{{prompt}}",
                  "max_tokens": 20,
                  "temperature": 0.7,
                  "top_p": 0.9
               }
            },
            "dataset": {
               "files_url": "hf://datasets/default/sample-basic-test/testing/testing.jsonl"
            },
            "metrics": {
               "bleu": {
                  "type": "bleu",
                  "params": {"references": ["{{ideal_response}}"]}
               },
               "string-check": {
                  "type": "string-check",
                  "params": {"check": ["{{ideal_response | trim}}", "equals", "{{output_text | trim}}"]}
               }
            }
         }
      }
   }
)
print(job)

For more information about the parameters in the code above, refer to the following references:

cURL

curl -X POST \
   "http://nemo.test/v1/evaluation/jobs" \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
      "target": {
         "type": "model",
         "model": {
            "api_endpoint": {
               "url": "http://nemo-nim-proxy:8000/v1/completions",
               "model_id": "meta/llama-3.1-8b-instruct"
            }
         }
      },
      "config": {
         "type": "custom",
         "params": {
            "parallelism": 4
         },
         "tasks": {
            "my-custom-task": {
               "type": "completion",
               "params": {
                  "template": {
                     "prompt": "{{prompt}}",
                     "max_tokens": 20,
                     "temperature": 0.7,
                     "top_p": 0.9
                  }
               },
               "dataset": {
                  "files_url": "hf://datasets/default/sample-basic-test/testing/testing.jsonl"
               },
               "metrics": {
                  "bleu": {
                     "type": "bleu",
                     "params": {"references": ["{{ideal_response}}"]}
                  },
                  "string-check": {
                     "type": "string-check",
                     "params": {"check": ["{{ideal_response | trim}}", "equals", "{{output_text | trim}}"]}
                  }
               }
            }
         }
      }
   }' | jq

To see a sample response, refer to To Create an Evaluation Job.

Save the job ID:

Python SDK

# The job ID is available from the job object created in the previous step
evaluation_job_id = job.id

cURL

export EVALUATION_JOB_ID=<evaluation-job-id>

Check the status of the evaluation job using the following command:

Python SDK

# Using the job ID from the previous step
job_status = client.evaluation.jobs.retrieve(evaluation_job_id)
print(job_status)

cURL

curl -X GET \
   "http://nemo.test/v1/evaluation/jobs/${EVALUATION_JOB_ID}" \
   -H 'accept: application/json' | jq

After the job completes, you can see the evaluation results by running the following command:

Python SDK

# Using the job ID from the previous step
results = client.evaluation.jobs.results(evaluation_job_id)
print(results)

cURL

curl -X GET \
"http://nemo.test/v1/evaluation/jobs/${EVALUATION_JOB_ID}/results" \
-H 'accept: application/json' | jq

Make note of the corpus-level BLEU and string-check scores for the result. For this example, the corpus-level BLEU score is 2.12 and the string-check score is 0 before customization.

Customize the Llama 3.1 8B Instruct Model#

Use the following procedure to fine-tune the Llama 3.1 8B Instruct model with the uploaded datasets.

Run the following command to run a customization job:

Note

You can check the available customization configs by running client.customization.configs.list() in the Python SDK or GET /v1/customization/configs in cURL. The customization configs with the A100 tag are suitable for most GPUs with 80 GB of memory, while those with the L40 tag are suitable for most GPUs with 40 GB of memory. Choose the appropriate config for your GPU and specify it in the config parameter.

Python SDK

customization_job = client.customization.jobs.create(
    config="meta/llama-3.1-8b-instruct@v1.0.0+A100",
    dataset={
        "name": "sample-basic-test",
        "namespace": "default"
    },
    hyperparameters={
        "training_type": "sft",
        "finetuning_type": "lora",
        "epochs": 3,
        "batch_size": 16,
        "learning_rate": 0.0001,
        "lora": {"adapter_dim": 16}
    },
    project="test-project",
    ownership={
        "created_by": "your-username",
        "access_policies": {
            "arbitrary": "json"
        }
    },
    output_model="default/test-example-model@v1"
)
print(customization_job)

For more information about the parameters in the code above, refer to Customization Jobs Resource.

cURL

curl -X POST \
   "http://nemo.test/v1/customization/jobs" \
   -H 'Accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
         "config": "meta/llama-3.1-8b-instruct@v1.0.0+A100",
         "dataset": {
            "name": "sample-basic-test",
            "namespace": "default"
         },
         "hyperparameters": {
            "training_type": "sft",
            "finetuning_type": "lora",
            "epochs": 3,
            "batch_size": 16,
            "learning_rate": 0.0001,
            "lora": {"adapter_dim": 16}
         },
         "project": "test-project",
         "ownership": {
            "created_by": "your-username",
            "access_policies": {
               "arbitrary": "json"
            }
         },
         "output_model": "default/test-example-model@v1"
      }' | jq

Save the customization job ID:

Python SDK

# The job ID is available from the customization_job object created in the previous step
cust_id = customization_job.id

cURL

export CUST_ID=<customization-job-id>

Check the status of the customization job. Use the following command to verify that the job has completed:

Python SDK

# Using the customization job ID from the previous step
job_status = None
while job_status not in ("completed", "failed", "cancelled"):
   status = client.customization.jobs.status.retrieve(cust_id)
   job_status = status.status
   time.sleep(60)
   print(job_status)

cURL

watch -n 60 "curl 'http://nemo.test/v1/customization/jobs/${CUST_ID}/status' | jq"

If the status field changes to completed, it indicates that it finished creating and uploading the output_model.

Test the fine-tuned model by sending a prompt to the output_model.

Python SDK

completion = client.completions.create(
    model="default/test-example-model@v1",
    prompt="When is the upcoming GTC event? GTC 2018 attracted over 8,400 attendees. Due to the COVID pandemic of 2020, GTC 2020 was converted to a digital event and drew roughly 59,000 registrants. The 2021 GTC keynote, which was streamed on YouTube on April 12, included a portion that was made with CGI using the Nvidia Omniverse real-time rendering platform. This next GTC will take place in the middle of March, 2023. Answer: ",
    max_tokens=128
)
print(completion)

cURL

curl -X POST "http://nim.test/v1/completions" \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
      "model": "default/test-example-model@v1",
      "prompt": "When is the upcoming GTC event? GTC 2018 attracted over 8,400 attendees. Due to the COVID pandemic of 2020, GTC 2020 was converted to a digital event and drew roughly 59,000 registrants. The 2021 GTC keynote, which was streamed on YouTube on April 12, included a portion that was made with CGI using the Nvidia Omniverse real-time rendering platform. This next GTC will take place in the middle of March, 2023. Answer: ",
      "max_tokens": 128
   }' | jq

Evaluate the Llama 3.1 8B Instruct Model After Customization#

Run another evaluation job with the same configuration that you used for evaluating the Llama 3.1 8B Instruct model before customization. By re-using the same evaluation config, you can compare the evaluation metrics before and after customization.

To run an evaluation job on the fine-tuned model, specify the custom model as the target:

Submit an evaluation job with a new evaluation target but the same evaluation config from the section Evaluate the Llama 3.1 8B Instruct Model Before Customization.

Python SDK

customized_job = client.evaluation.jobs.create(
   target={
      "type": "model",
      "model": {
         "api_endpoint": {
            "url": "http://nemo-nim-proxy:8000/v1/completions",
            "model_id": "default/test-example-model@v1"
         }
      }
   },
   config={
      "type": "custom",
      "params": {
         "parallelism": 4
      },
      "tasks": {
         "my-custom-task": {
            "type": "completion",
            "params": {
               "template": {
                  "prompt": "{{prompt}}",
                  "max_tokens": 20,
                  "temperature": 0.7,
                  "top_p": 0.9
               }
            },
            "dataset": {
               "files_url": "hf://datasets/default/sample-basic-test/testing/testing.jsonl"
            },
            "metrics": {
               "bleu": {
                  "type": "bleu",
                  "params": {"references": ["{{ideal_response}}"]}
               },
               "string-check": {
                  "type": "string-check",
                  "params": {"check": ["{{ideal_response | trim}}", "equals", "{{output_text | trim}}"]}
               }
            }
         }
      }
   }
)
print(customized_job)

For more information about the parameters in the code above, refer to Evaluation Jobs Resource.

cURL

curl -X POST \
   'http://nemo.test/v1/evaluation/jobs' \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
      "target": {
         "type": "model",
         "model": {
            "api_endpoint": {
               "url": "http://nemo-nim-proxy:8000/v1/completions",
               "model_id": "default/test-example-model@v1"
            }
         }
      },
      "config": {
         "type": "custom",
         "params": {
            "parallelism": 4
         },
         "tasks": {
            "my-custom-task": {
               "type": "completion",
               "params": {
                  "template": {
                     "prompt": "{{prompt}}",
                     "max_tokens": 20,
                     "temperature": 0.7,
                     "top_p": 0.9
                  }
               },
               "dataset": {
                  "files_url": "hf://datasets/default/sample-basic-test/testing/testing.jsonl"
               },
               "metrics": {
                  "bleu": {
                     "type": "bleu",
                     "params": {"references": ["{{ideal_response}}"]}
                  },
                  "string-check": {
                     "type": "string-check",
                     "params": {"check": ["{{ideal_response | trim}}", "equals", "{{output_text | trim}}"]}
                  }
               }
            }
         }
      }
   }' | jq

To see a sample response, refer to To Create an Evaluation Job.

Save the evaluation job ID:

Python SDK

# The job ID is available from the customized_job object created in the previous step
customized_model_evaluation_job_id = customized_job.id

cURL

export CUSTOMIZED_MODEL_EVALUATION_JOB_ID=<new-evaluation-job-id>

Check the status of the evaluation job by using the following command:

Python SDK

# Using the job ID from the previous step
job_status = None
while job_status not in ("completed", "failed", "cancelled"):
   job = client.evaluation.jobs.retrieve(customized_model_evaluation_job_id)
   job_status = job.status
   print(job_status)
   time.sleep(60)

cURL

watch -n 60 "curl -X GET \
   'http://nemo.test/v1/evaluation/jobs/${CUSTOMIZED_MODEL_EVALUATION_JOB_ID}' \
   -H 'accept: application/json' | jq"

While the job is running, the status is running. After the job completes, the status is completed.

After the job completes, you can see the results of the evaluation by using the following command:

Python SDK

# Using the job ID from the previous step
results = client.evaluation.jobs.results(customized_model_evaluation_job_id)
print(results)

For more information about the parameters in the code above, refer to Evaluation Results Resource.

cURL

curl -X GET \
"http://nemo.test/v1/evaluation/jobs/${CUSTOMIZED_MODEL_EVALUATION_JOB_ID}/results" \
-H 'accept: application/json' | jq

Make note of the corpus-level BLEU and string-check scores for the result. For this example, the corpus-level BLEU score is 30.47 and the string-check score is 0.46 after customization.

Conclusion#

Compare the scores from the original model evaluation to those from the evaluation of the fine-tuned model. In the example outputs provided in this tutorial, both the corpus-level BLEU and string-check scores improve.

The corpus-level BLEU score improves from ~2.12 to ~30.47.
The string-check score improves from ~0 to ~0.46.