Model Evaluation

NVIDIA provides a simple tool to help evaluate trained checkpoints. You can evaluate the capabilities of the StarCoder2 models on the following task:

  • human_eval

Run Evaluation

To run evaluation update conf/config.yaml:

defaults:
  - evaluation: starcoder2/human_eval.yaml

stages:
  - evaluation

Execute launcher pipeline: python3 main.py

Configuration

Default configurations for evaluation can be found in conf/evaluation/starcoder2/human_eval.yaml

run:
     name: eval_${.task_name}_${.model_train_name}
     time_limit: "04:00:00"
     dependency: "singleton"
     ntasks_per_node: 1
     convert_name: convert_nemo
     model_train_name: starcoder2
     task_name: "human_eval"  # HumanEval
     convert_dir: ${base_results_dir}/${.model_train_name}/${.convert_name}
     fine_tuning_dir: ${base_results_dir}/${.model_train_name}/${.task_name}
     results_dir: ${base_results_dir}/${.model_train_name}/${.task_name}_evaluation

tasks sets the evaluation task to execute. Currently only HumanEval is supported.

model:
    model_type: nemo-StarCoder2
    nemo_model: null # specify path to .nemo file, produced when converted interleaved checkpoints
    tensor_model_parallel_size: 1
    pipeline_model_parallel_size: 1
    model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}}
    precision: bf16 # must match training precision - 32, 16 or bf16
    eval_batch_size: 4

nemo_model sets the path to .nemo checkpoint to run evaluation.