BigCode Evaluation Type#
BigCode Evaluation Harness is a framework for the evaluation of code generation models. Use this evaluation type to benchmark code generation tasks such as HumanEval, MBPP, and others.
Tip
For the full list of BigCode tasks, refer to tasks.
Prerequisites#
Set up or select an existing evaluation target.
Target Configuration#
BigCode evaluations require specific endpoint configurations depending on the task:
Completions endpoint: Required for
humaneval
,humanevalplus
, and allmultiple-*
tasksChat endpoint: Required for
humaneval_instruct
andmbppplus_nemo
Either endpoint: Supported for
mbpp
andmbppplus
{
"target": {
"type": "model",
"model": {
"api_endpoint": {
"url": "https://nim.int.aire.nvidia.com/completions",
"model_id": "meta/llama-3.3-70b-instruct"
}
}
}
}
{
"target": {
"type": "model",
"model": {
"api_endpoint": {
"url": "https://nim.int.aire.nvidia.com/chat/completions",
"model_id": "meta/llama-3.3-70b-instruct"
}
}
}
}
Important
The endpoint URL in your target configuration must match the model_type
parameter in your evaluation configuration:
Completions endpoint (
/completions
) →"model_type": "completions"
Chat endpoint (
/chat/completions
) →"model_type": "chat"
Supported Tasks#
The BigCode evaluation harness supports the following tasks in this implementation:
Task |
Description |
Supported Endpoints |
---|---|---|
|
Original HumanEval task |
completions only |
|
Instruction-following version |
chat only |
|
Enhanced HumanEval with additional test cases |
completions only |
|
Mostly Basic Python Problems |
chat and completions |
|
Enhanced MBPP with additional test cases |
chat and completions |
|
NEMO-specific MBPP variant |
chat only |
Task |
Language |
Supported Endpoints |
---|---|---|
|
C++ |
completions only |
|
C# |
completions only |
|
D |
completions only |
|
Go |
completions only |
|
Java |
completions only |
|
Julia |
completions only |
|
JavaScript |
completions only |
|
Lua |
completions only |
|
PHP |
completions only |
|
Perl |
completions only |
|
R |
completions only |
|
Ruby |
completions only |
|
Racket |
completions only |
|
Rust |
completions only |
|
Scala |
completions only |
|
Shell |
completions only |
|
Swift |
completions only |
|
TypeScript |
completions only |
All multi-language tasks evaluate code generation capabilities in their respective programming languages and only support the completions endpoint.
Options#
HumanEval#
The HumanEval task evaluates a model’s ability to generate correct Python code for a set of programming problems. Each problem includes a function signature and a docstring, and the model must generate a correct implementation.
{
"type": "humaneval",
"name": "my-configuration-bigcode-humaneval-1",
"namespace": "my-organization",
"params": {
"parallelism": 10,
"request_timeout": 300,
"limit_samples": 10,
"max_tokens": 512,
"temperature": 1.0,
"top_p": 0.01,
"extra": {
"batch_size": 1,
"top_k": 1,
"model_type": "completions"
}
}
}
{
"task_id": "HumanEval/0",
"prompt": "def add(a, b):\n",
"reference": "def add(a, b):\n return a + b\n",
"output": "def add(a, b):\n return a + b\n"
}
{
"tasks": {
"humaneval": {
"metrics": {
"pass@1": {
"scores": {
"pass@1": {
"value": 1.0
}
}
}
}
}
}
}
MBPP#
The MBPP (Mostly Basic Python Problems) task evaluates a model’s ability to solve basic Python programming problems. Each problem includes a prompt and test cases.
{
"type": "mbpp",
"name": "my-configuration-bigcode-mbpp-completions",
"namespace": "my-organization",
"params": {
"parallelism": 10,
"request_timeout": 300,
"limit_samples": 10,
"max_tokens": 512,
"temperature": 1.0,
"top_p": 0.01,
"extra": {
"batch_size": 1,
"top_k": 1,
"model_type": "completions"
}
}
}
{
"type": "mbpp",
"name": "my-configuration-bigcode-mbpp-chat",
"namespace": "my-organization",
"params": {
"parallelism": 10,
"request_timeout": 300,
"limit_samples": 10,
"max_tokens": 512,
"temperature": 1.0,
"top_p": 0.01,
"extra": {
"batch_size": 1,
"top_k": 1,
"model_type": "chat"
}
}
}
{
"task_id": "MBPP/0",
"prompt": "def is_even(n):\n",
"reference": "def is_even(n):\n return n % 2 == 0\n",
"output": "def is_even(n):\n return n % 2 == 0\n"
}
{
"tasks": {
"mbpp": {
"metrics": {
"pass@1": {
"scores": {
"pass@1": {
"value": 1.0
}
}
}
}
}
}
}
MBPP+#
The MBPP+ task is an enhanced version of MBPP with additional test cases for more comprehensive evaluation.
{
"type": "mbppplus",
"name": "my-configuration-bigcode-mbppplus-completions",
"namespace": "my-organization",
"params": {
"parallelism": 1,
"limit_samples": 1,
"max_tokens": 512,
"temperature": 1.0,
"top_p": 0.01,
"extra": {
"batch_size": 1,
"top_k": 1,
"model_type": "completions"
}
}
}
{
"type": "mbppplus",
"name": "my-configuration-bigcode-mbppplus-chat",
"namespace": "my-organization",
"params": {
"parallelism": 1,
"limit_samples": 1,
"max_tokens": 512,
"temperature": 1.0,
"top_p": 0.01,
"extra": {
"batch_size": 1,
"top_k": 1,
"model_type": "chat"
}
}
}
{
"task_id": "MBPP+/0",
"prompt": "def is_even(n):\n",
"reference": "def is_even(n):\n return n % 2 == 0\n",
"output": "def is_even(n):\n return n % 2 == 0\n"
}
{
"tasks": {
"mbppplus": {
"metrics": {
"pass@1": {
"scores": {
"pass@1": {
"value": 1.0
}
}
}
}
}
}
}
HumanEval+#
The HumanEval+ task is an enhanced version of HumanEval with additional test cases to provide more robust evaluation.
{
"type": "humanevalplus",
"name": "my-configuration-bigcode-humanevalplus-1",
"namespace": "my-organization",
"params": {
"parallelism": 1,
"limit_samples": 1,
"max_tokens": 512,
"temperature": 1.0,
"top_p": 0.01,
"extra": {
"batch_size": 1,
"top_k": 1,
"model_type": "completions"
}
}
}
{
"task_id": "HumanEval+/0",
"prompt": "def add(a, b):\n",
"reference": "def add(a, b):\n return a + b\n",
"output": "def add(a, b):\n return a + b\n"
}
{
"tasks": {
"humanevalplus": {
"metrics": {
"pass@1": {
"scores": {
"pass@1": {
"value": 1.0
}
}
}
}
}
}
}
Multi-language Code Generation#
The multi-language tasks evaluate code generation in various programming languages. All multi-language tasks use the completions model type.
{
"type": "multiple-java",
"name": "my-configuration-bigcode-java-1",
"namespace": "my-organization",
"params": {
"parallelism": 1,
"limit_samples": 1,
"max_tokens": 512,
"temperature": 1.0,
"top_p": 0.01,
"extra": {
"batch_size": 1,
"top_k": 1,
"model_type": "completions"
}
}
}
{
"task_id": "HumanEval-Java/0",
"prompt": "public static int add(int a, int b) {\n",
"reference": "public static int add(int a, int b) {\n return a + b;\n}\n",
"output": "public static int add(int a, int b) {\n return a + b;\n}\n"
}
{
"tasks": {
"multiple-java": {
"metrics": {
"pass@1": {
"scores": {
"pass@1": {
"value": 1.0
}
}
}
}
}
}
}
Parameters#
Parameter |
Description |
Required |
Default |
---|---|---|---|
|
Number of parallel processes for evaluation. |
No |
1 |
|
Limit the number of samples to evaluate (useful for testing). |
No |
All samples |
|
Maximum number of tokens to generate. |
Yes |
— |
|
Controls randomness in generation (0.0 = deterministic). |
No |
1.0 |
|
Nucleus sampling parameter. |
No |
0.01 |
|
List of stop sequences to terminate generation. |
No |
[] |
|
Batch size for generation. |
No |
1 |
|
Top-k sampling parameter. |
No |
1 |
|
Model endpoint type: “chat” or “completions.” Required for most tasks. |
Conditional |
Auto-detected |
|
HuggingFace token for accessing private models or datasets. |
No |
— |
Task-Specific Model Type Support#
Task |
Supported Model Types |
Notes |
---|---|---|
|
completions |
Original HumanEval task |
|
chat |
Instruction-following variant |
|
completions |
Enhanced version with more tests |
|
chat, completions |
Both formats supported |
|
chat, completions |
Both formats supported |
|
chat |
NEMO-specific variant |
|
completions |
All 18 multi-language tasks |
Metrics#
Metric Name |
Description |
Value Range |
Notes |
---|---|---|---|
|
Fraction of problems for which at least one of the model’s |
0.0 to 1.0 (where 1.0 means all problems were solved correctly) |
|