Customize and Evaluate Large Language Models#

In this tutorial, you learn how to fine-tune the Llama 3.1 8B Instruct model on a sample dataset and evaluate the model performance before and after customization.

Note

The time to complete this tutorial is approximately 45 minutes. In this tutorial, you run a customization job and two evaluation jobs. Customization job duration increases with model parameter count and dataset size. For more information about evaluation job duration, refer to Expected Evaluation Duration.

Prerequisites#

Before you begin, complete the following prerequisites:

Upload Datasets#

To fine-tune a model, you need to split your dataset into three parts: one for training, one for testing, and one for validation.

All dataset files must be in JSONL format. The dataset in this example is in the prompt-completion format, where each line of the file must contain prompt and completion fields with values tailored to the specific task for which you train the model. The following is an example line of a dataset JSONL file designed for question-answering tasks:

{"prompt": "What is the distance from the earth to the sun? A: ", "completion": "93 millions miles"}

The following steps show how to utilize the Hugging Face APIs integrated in NeMo Data Store, download demo datasets prepared for demonstration purposes, and upload the sample datasets to the default database. The demo dataset is a subset of the SQuAD for question answer generation.

  1. Set the following environment variables:

    export HF_ENDPOINT="http://data-store.test/v1/hf"
    export HF_TOKEN="dummy-unused-value"
    
  2. Create a dataset repository:

    huggingface-cli repo create sample-basic-test --type dataset
    
  3. Create a new folder ~/tmp/sample_test_data:

    mkdir -p ~/tmp/sample_test_data
    
  4. Download the following sample datasets that you’ll use for fine-tuning and evaluating the Llama 3.1 8B Instruct model. Ensure that the datasets are accessible in the respective directories outlined as follows:

    The following is the folder structure under ~/tmp/sample_test_data:

    ├── testing
    │   └── testing.jsonl
    ├── training
    │   └── training.jsonl
    └── validation
        └── validation.jsonl
    

    Note

    NeMo Customizer and NeMo Evaluator expect testing files to be in the testing folder, training files to be in the training folder, and validation files to be in the validation folder. Make sure that you put the files in the right places.

  5. Upload the dataset files. The following command uploads the dataset files to the sample-basic-test repository in the default namespace. The default namespace here is a namespace in the NeMo microservices platform managed by the NeMo Entity Store microservice and not associated with Kubernetes.

    Note

    Make sure you point at folders that only contain the .jsonl files you want to use as the dataset files. If your dataset folder is large, you might need to upload the files individually.

    huggingface-cli upload --repo-type dataset \
       default/sample-basic-test \
       ~/tmp/sample_test_data
    
    Example Output
    Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
    Start hashing 3 files.
    Finished hashing 3 files.
    training.jsonl: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 618.6k/618.6k [00:00<00:00, 131kB/s]
    testing.jsonl: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 82.6k/82.6k [00:00<00:00, 141kB/s]
    validation.jsonl: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 77.5k/77.5k [00:00<00:00, 126kB/s]
    Upload 3 LFS files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  2.00it/s]
    https://nemo-datastore-endpoint/v1/hf/datasets/sample-basic-test/tree/main/.
    
  6. To use a dataset for operations such as evaluations and customizations, you need to register a dataset using the /v1/datasets endpoint. Registering the dataset enables you to refer to it by its namespace and name afterward.

    Register the dataset created in the previous step. Format the files_url field as hf://datasets/{namespace}/{dataset-name}.

    import time
    from nemo_microservices import NeMoMicroservices
    
    client = NeMoMicroservices(
       base_url="http://nemo.test",
       inference_base_url="http://nim.test",
    )
    
    dataset = client.datasets.create(
        name="sample-basic-test",
        namespace="default",
        description="This is an example of a dataset",
        files_url="hf://datasets/default/sample-basic-test",
        project="sample_project"
    )
    print(dataset)
    
    curl -X POST "http://nemo.test/v1/datasets" \
       -H 'accept: application/json' \
       -H 'Content-Type: application/json' \
       -d '{
          "name": "sample-basic-test",
          "namespace": "default",
          "description": "This is an example of a dataset",
          "files_url": "hf://datasets/default/sample-basic-test",
          "project": "sample_project"
       }' | jq
    
    Example Output
    {
      "schema_version": "1.0",
      "id": "dataset-9C5GLZ6i9ZgW3cZ9RHP9ej",
      "description": "This is a example of a dataset",
      "type_prefix": null,
      "namespace": "default",
      "project": "sample_project",
      "created_at": "2024-12-09T01:01:28.54288",
      "updated_at": "2024-12-09T01:01:28.606851",
      "custom_fields": {},
      "ownership": null,
      "name": "sample-basic-test",
      "version_id": "main",
      "version_tags": [],
      "format": null,
      "files_url": "hf://datasets/default/sample-basic-test"
    }
    

Evaluate the Llama 3.1 8B Instruct Model Before Customization#

Run a custom evaluation job to evaluate a model on custom datasets by comparing the LLM-generated response with a ground truth response. The custom evaluation with bleu and string-check metrics is ideal for cases where the LLM generations are not expected to be highly creative.

To evaluate the Llama 3.1 8B Instruct model using the dataset that was uploaded in the Upload Datasets section, complete the following:

  1. Create an evaluation job with the target and the configuration defined.

    target is the model under evaluation, and config is the configuration for the evaluation job.

    The dataset that was uploaded in previous section is specified within the configuration config.tasks.<task-name>.dataset.files_url. The dataset URL is expected to be in Hugging Face format hf://datasets/<dataset-namespace>/<dataset-name>/<relative-path-to-file>.

    job = client.evaluation.jobs.create(
       target={
          "type": "model",
          "model": {
             "api_endpoint": {
                "url": "http://nemo-nim-proxy:8000/v1/completions",
                "model_id": "meta/llama-3.1-8b-instruct"
             }
          }
       },
       config={
          "type": "custom",
          "params": {
             "parallelism": 4
          },
          "tasks": {
             "my-custom-task": {
                "type": "completion",
                "params": {
                   "template": {
                      "prompt": "{{prompt}}",
                      "max_tokens": 20,
                      "temperature": 0.7,
                      "top_p": 0.9
                   }
                },
                "dataset": {
                   "files_url": "hf://datasets/default/sample-basic-test/testing/testing.jsonl"
                },
                "metrics": {
                   "bleu": {
                      "type": "bleu",
                      "params": {"references": ["{{ideal_response}}"]}
                   },
                   "string-check": {
                      "type": "string-check",
                      "params": {"check": ["{{ideal_response | trim}}", "equals", "{{output_text | trim}}"]}
                   }
                }
             }
          }
       }
    )
    print(job)
    

    For more information about the parameters in the code above, refer to the following references:

    curl -X POST \
       "http://nemo.test/v1/evaluation/jobs" \
       -H 'accept: application/json' \
       -H 'Content-Type: application/json' \
       -d '{
          "target": {
             "type": "model",
             "model": {
                "api_endpoint": {
                   "url": "http://nemo-nim-proxy:8000/v1/completions",
                   "model_id": "meta/llama-3.1-8b-instruct"
                }
             }
          },
          "config": {
             "type": "custom",
             "params": {
                "parallelism": 4
             },
             "tasks": {
                "my-custom-task": {
                   "type": "completion",
                   "params": {
                      "template": {
                         "prompt": "{{prompt}}",
                         "max_tokens": 20,
                         "temperature": 0.7,
                         "top_p": 0.9
                      }
                   },
                   "dataset": {
                      "files_url": "hf://datasets/default/sample-basic-test/testing/testing.jsonl"
                   },
                   "metrics": {
                      "bleu": {
                         "type": "bleu",
                         "params": {"references": ["{{ideal_response}}"]}
                      },
                      "string-check": {
                         "type": "string-check",
                         "params": {"check": ["{{ideal_response | trim}}", "equals", "{{output_text | trim}}"]}
                      }
                   }
                }
             }
          }
       }' | jq
    

    To see a sample response, refer to To Create an Evaluation Job.

  2. Save the job ID:

    # The job ID is available from the job object created in the previous step
    evaluation_job_id = job.id
    
    export EVALUATION_JOB_ID=<evaluation-job-id>
    
  3. Check the status of the evaluation job using the following command:

    # Using the job ID from the previous step
    job_status = client.evaluation.jobs.retrieve(evaluation_job_id)
    print(job_status)
    
    curl -X GET \
       "http://nemo.test/v1/evaluation/jobs/${EVALUATION_JOB_ID}" \
       -H 'accept: application/json' | jq
    
  4. After the job completes, you can see the evaluation results by running the following command:

    # Using the job ID from the previous step
    results = client.evaluation.jobs.results(evaluation_job_id)
    print(results)
    
    curl -X GET \
    "http://nemo.test/v1/evaluation/jobs/${EVALUATION_JOB_ID}/results" \
    -H 'accept: application/json' | jq
    
    Evaluation Result Before Customization
    {
       "created_at": "2025-03-19T17:21:23.016944",
       "updated_at": "2025-03-19T17:21:23.016946",
       "id": "evaluation_result-4hDyVY4XwLnDDp645yVgu1",
       "job": "eval-EHBprrm8kMYQgjPZX31GSs",
       "tasks": {
          "my-custom-task": {
             "metrics": {
                "bleu": {
                   "scores": {
                      "sentence": {
                         "value": 3.316167505330637,
                         "stats": {
                            "count": 90,
                            "sum": 298.45507547975734,
                            "mean": 3.316167505330637
                         }
                      },
                      "corpus": {
                         "value": 2.1254915037706192
                      }
                   }
                },
                "string-check": {
                   "scores": {
                      "string-check": {
                         "value": 0,
                         "stats": {
                            "count": 90,
                            "sum": 0,
                            "mean": 0
                         }
                      }
                   }
                }
             }
          }
       },
       "groups": {},
       "namespace": "default",
       "custom_fields": {}
    }
    

    Make note of the corpus-level BLEU and string-check scores for the result. For this example, the corpus-level BLEU score is 2.12 and the string-check score is 0 before customization.

Customize the Llama 3.1 8B Instruct Model#

Use the following procedure to fine-tune the Llama 3.1 8B Instruct model with the uploaded datasets.

  1. Run the following command to run a customization job:

    Note

    You can check the available customization configs by running client.customization.configs.list() in the Python SDK or GET /v1/customization/configs in cURL. The customization configs with the A100 tag are suitable for most GPUs with 80 GB of memory, while those with the L40 tag are suitable for most GPUs with 40 GB of memory. Choose the appropriate config for your GPU and specify it in the config parameter.

    customization_job = client.customization.jobs.create(
        config="meta/llama-3.1-8b-instruct@v1.0.0+A100",
        dataset={
            "name": "sample-basic-test",
            "namespace": "default"
        },
        hyperparameters={
            "training_type": "sft",
            "finetuning_type": "lora",
            "epochs": 3,
            "batch_size": 16,
            "learning_rate": 0.0001,
            "lora": {"adapter_dim": 16}
        },
        project="test-project",
        ownership={
            "created_by": "your-username",
            "access_policies": {
                "arbitrary": "json"
            }
        },
        output_model="default/test-example-model@v1"
    )
    print(customization_job)
    

    For more information about the parameters in the code above, refer to Customization Jobs Resource.

    curl -X POST \
       "http://nemo.test/v1/customization/jobs" \
       -H 'Accept: application/json' \
       -H 'Content-Type: application/json' \
       -d '{
             "config": "meta/llama-3.1-8b-instruct@v1.0.0+A100",
             "dataset": {
                "name": "sample-basic-test",
                "namespace": "default"
             },
             "hyperparameters": {
                "training_type": "sft",
                "finetuning_type": "lora",
                "epochs": 3,
                "batch_size": 16,
                "learning_rate": 0.0001,
                "lora": {"adapter_dim": 16}
             },
             "project": "test-project",
             "ownership": {
                "created_by": "your-username",
                "access_policies": {
                   "arbitrary": "json"
                }
             },
             "output_model": "default/test-example-model@v1"
          }' | jq
    
    Example Output
    {
      "id": "cust-JGTaMbJMdqjJU8WbQdN9Q2",
      "created_at": "2024-12-09T04:06:28.542884",
      "updated_at": "2024-12-09T04:06:28.542884",
      "config": {
        "schema_version": "1.0",
        "id": "af783f5b-d985-4e5b-bbb7-f9eec39cc0b1",
        "created_at": "2024-12-09T04:06:28.542657",
        "updated_at": "2024-12-09T04:06:28.569837",
        "custom_fields": {},
        "name": "meta/llama-3_1-8b-instruct",
        "base_model": "meta/llama-3_1-8b-instruct",
        "model_path": "llama-3_1-8b-instruct",
        "training_types": [],
        "finetuning_types": ["lora"],
        "precision": "bf16",
        "num_gpus": 4,
        "num_nodes": 1,
        "micro_batch_size": 1,
        "tensor_parallel_size": 1,
        "max_seq_length": 4096
      },
      "dataset": {
        "schema_version": "1.0",
        "id": "dataset-XU4pvGzr5tvawnbVxeJMTb",
        "created_at": "2024-12-09T04:06:28.542657",
        "updated_at": "2024-12-09T04:06:28.542660",
        "custom_fields": {},
        "name": "default/sample-basic-test",
        "version_id": "main",
        "version_tags": []
      },
      "hyperparameters": {
        "finetuning_type": "lora",
        "training_type": "sft",
        "batch_size": 16,
        "epochs": 10,
        "learning_rate": 0.0001,
        "lora": {
          "adapter_dim": 16
        }
      },
      "output_model": "default/test-example-model@v1",
      "status": "created",
      "project": "test-project",
      "custom_fields": {},
      "ownership": {
        "created_by": "me",
        "access_policies": {
          "arbitrary": "json"
        }
      }
    }
    
  2. Save the customization job ID:

    # The job ID is available from the customization_job object created in the previous step
    cust_id = customization_job.id
    
    export CUST_ID=<customization-job-id>
    
  3. Check the status of the customization job. Use the following command to verify that the job has completed:

    # Using the customization job ID from the previous step
    job_status = None
    while job_status not in ("completed", "failed", "cancelled"):
       status = client.customization.jobs.status.retrieve(cust_id)
       job_status = status.status
       time.sleep(60)
       print(job_status)
    
    watch -n 60 "curl 'http://nemo.test/v1/customization/jobs/${CUST_ID}/status' | jq"
    

    If the status field changes to completed, it indicates that it finished creating and uploading the output_model.

    Example Successful Response
    {
      "created_at": "2024-12-09T04:06:28.580220",
      "updated_at": "2024-12-09T04:21:19.852832",
      "status": "completed",
      "steps_completed": 1210,
      "epochs_completed": 10,
      "percentage_done": 100.0,
      "best_epoch": 3,
      "train_loss": 1.718016266822815,
      "val_loss": 1.8661999702453613
    }
    
  4. Test the fine-tuned model by sending a prompt to the output_model.

    completion = client.completions.create(
        model="default/test-example-model@v1",
        prompt="When is the upcoming GTC event? GTC 2018 attracted over 8,400 attendees. Due to the COVID pandemic of 2020, GTC 2020 was converted to a digital event and drew roughly 59,000 registrants. The 2021 GTC keynote, which was streamed on YouTube on April 12, included a portion that was made with CGI using the Nvidia Omniverse real-time rendering platform. This next GTC will take place in the middle of March, 2023. Answer: ",
        max_tokens=128
    )
    print(completion)
    
    curl -X POST "http://nim.test/v1/completions" \
       -H 'accept: application/json' \
       -H 'Content-Type: application/json' \
       -d '{
          "model": "default/test-example-model@v1",
          "prompt": "When is the upcoming GTC event? GTC 2018 attracted over 8,400 attendees. Due to the COVID pandemic of 2020, GTC 2020 was converted to a digital event and drew roughly 59,000 registrants. The 2021 GTC keynote, which was streamed on YouTube on April 12, included a portion that was made with CGI using the Nvidia Omniverse real-time rendering platform. This next GTC will take place in the middle of March, 2023. Answer: ",
          "max_tokens": 128
       }' | jq
    

Evaluate the Llama 3.1 8B Instruct Model After Customization#

Run another evaluation job with the same configuration that you used for evaluating the Llama 3.1 8B Instruct model before customization. By re-using the same evaluation config, you can compare the evaluation metrics before and after customization.

To run an evaluation job on the fine-tuned model, specify the custom model as the target:

  1. Submit an evaluation job with a new evaluation target but the same evaluation config from the section Evaluate the Llama 3.1 8B Instruct Model Before Customization.

    customized_job = client.evaluation.jobs.create(
       target={
          "type": "model",
          "model": {
             "api_endpoint": {
                "url": "http://nemo-nim-proxy:8000/v1/completions",
                "model_id": "default/test-example-model@v1"
             }
          }
       },
       config={
          "type": "custom",
          "params": {
             "parallelism": 4
          },
          "tasks": {
             "my-custom-task": {
                "type": "completion",
                "params": {
                   "template": {
                      "prompt": "{{prompt}}",
                      "max_tokens": 20,
                      "temperature": 0.7,
                      "top_p": 0.9
                   }
                },
                "dataset": {
                   "files_url": "hf://datasets/default/sample-basic-test/testing/testing.jsonl"
                },
                "metrics": {
                   "bleu": {
                      "type": "bleu",
                      "params": {"references": ["{{ideal_response}}"]}
                   },
                   "string-check": {
                      "type": "string-check",
                      "params": {"check": ["{{ideal_response | trim}}", "equals", "{{output_text | trim}}"]}
                   }
                }
             }
          }
       }
    )
    print(customized_job)
    

    For more information about the parameters in the code above, refer to Evaluation Jobs Resource.

    curl -X POST \
       'http://nemo.test/v1/evaluation/jobs' \
       -H 'accept: application/json' \
       -H 'Content-Type: application/json' \
       -d '{
          "target": {
             "type": "model",
             "model": {
                "api_endpoint": {
                   "url": "http://nemo-nim-proxy:8000/v1/completions",
                   "model_id": "default/test-example-model@v1"
                }
             }
          },
          "config": {
             "type": "custom",
             "params": {
                "parallelism": 4
             },
             "tasks": {
                "my-custom-task": {
                   "type": "completion",
                   "params": {
                      "template": {
                         "prompt": "{{prompt}}",
                         "max_tokens": 20,
                         "temperature": 0.7,
                         "top_p": 0.9
                      }
                   },
                   "dataset": {
                      "files_url": "hf://datasets/default/sample-basic-test/testing/testing.jsonl"
                   },
                   "metrics": {
                      "bleu": {
                         "type": "bleu",
                         "params": {"references": ["{{ideal_response}}"]}
                      },
                      "string-check": {
                         "type": "string-check",
                         "params": {"check": ["{{ideal_response | trim}}", "equals", "{{output_text | trim}}"]}
                      }
                   }
                }
             }
          }
       }' | jq
    

    To see a sample response, refer to To Create an Evaluation Job.

  2. Save the evaluation job ID:

    # The job ID is available from the customized_job object created in the previous step
    customized_model_evaluation_job_id = customized_job.id
    
    export CUSTOMIZED_MODEL_EVALUATION_JOB_ID=<new-evaluation-job-id>
    
  3. Check the status of the evaluation job by using the following command:

    # Using the job ID from the previous step
    job_status = None
    while job_status not in ("completed", "failed", "cancelled"):
       job = client.evaluation.jobs.retrieve(customized_model_evaluation_job_id)
       job_status = job.status
       print(job_status)
       time.sleep(60)
    
    watch -n 60 "curl -X GET \
       'http://nemo.test/v1/evaluation/jobs/${CUSTOMIZED_MODEL_EVALUATION_JOB_ID}' \
       -H 'accept: application/json' | jq"
    

    While the job is running, the status is running. After the job completes, the status is completed.

  4. After the job completes, you can see the results of the evaluation by using the following command:

    # Using the job ID from the previous step
    results = client.evaluation.jobs.results(customized_model_evaluation_job_id)
    print(results)
    

    For more information about the parameters in the code above, refer to Evaluation Results Resource.

    curl -X GET \
    "http://nemo.test/v1/evaluation/jobs/${CUSTOMIZED_MODEL_EVALUATION_JOB_ID}/results" \
    -H 'accept: application/json' | jq
    
    Evaluation Result After Customization
    {
       "created_at": "2025-03-19T17:46:34.216345",
       "updated_at": "2025-03-19T17:46:34.216347",
       "id": "evaluation_result-LDRwuXthNK8XtjLwNtL2ZG",
       "job": "eval-RNLGdCpnWc7XYGkXER2sAv",
       "tasks": {
          "my-custom-task": {
             "metrics": {
             "bleu": {
                "scores": {
                   "sentence": {
                      "value": 58.27715543476614,
                      "stats": {
                         "count": 90,
                         "sum": 5244.943989128952,
                         "mean": 58.27715543476614
                      }
                   },
                   "corpus": {
                      "value": 30.477136043689168
                   }
                }
             },
             "string-check": {
                "scores": {
                   "string-check": {
                      "value": 0.4666666666666667,
                      "stats": {
                         "count": 90,
                         "sum": 42,
                         "mean": 0.4666666666666667
                      }
                   }
                }
             }
             }
          }
       },
       "groups": {},
       "namespace": "default",
       "custom_fields": {}
    }
    

    Make note of the corpus-level BLEU and string-check scores for the result. For this example, the corpus-level BLEU score is 30.47 and the string-check score is 0.46 after customization.

Conclusion#

Compare the scores from the original model evaluation to those from the evaluation of the fine-tuned model. In the example outputs provided in this tutorial, both the corpus-level BLEU and string-check scores improve.

  • The corpus-level BLEU score improves from ~2.12 to ~30.47.

  • The string-check score improves from ~0 to ~0.46.