BigCode Evaluation Type#

BigCode Evaluation Harness is a framework for the evaluation of code generation models. Use this evaluation type to benchmark code generation tasks such as HumanEval, MBPP, and others.

Tip

For the full list of BigCode tasks, refer to tasks.

Prerequisites#

Target Configuration#

BigCode evaluations require specific endpoint configurations depending on the task:

  • Completions endpoint: Required for humaneval, humanevalplus, and all multiple-* tasks

  • Chat endpoint: Required for humaneval_instruct and mbppplus_nemo

  • Either endpoint: Supported for mbpp and mbppplus

{
  "target": {
    "type": "model",
    "model": {
      "api_endpoint": {
        "url": "https://nim.int.aire.nvidia.com/completions",
        "model_id": "meta/llama-3.3-70b-instruct"
      }
    }
  }
}
{
  "target": {
    "type": "model", 
    "model": {
      "api_endpoint": {
        "url": "https://nim.int.aire.nvidia.com/chat/completions",
        "model_id": "meta/llama-3.3-70b-instruct"
      }
    }
  }
}

Important

The endpoint URL in your target configuration must match the model_type parameter in your evaluation configuration:

  • Completions endpoint (/completions) → "model_type": "completions"

  • Chat endpoint (/chat/completions) → "model_type": "chat"


Supported Tasks#

The BigCode evaluation harness supports the following tasks in this implementation:

Python Code Generation Tasks#

Task

Description

Supported Endpoints

humaneval

Original HumanEval task

completions only

humaneval_instruct

Instruction-following version

chat only

humanevalplus

Enhanced HumanEval with additional test cases

completions only

mbpp

Mostly Basic Python Problems

chat and completions

mbppplus

Enhanced MBPP with additional test cases

chat and completions

mbppplus_nemo

NEMO-specific MBPP variant

chat only

Multi-language Code Generation Tasks#

Task

Language

Supported Endpoints

multiple-cpp

C++

completions only

multiple-cs

C#

completions only

multiple-d

D

completions only

multiple-go

Go

completions only

multiple-java

Java

completions only

multiple-jl

Julia

completions only

multiple-js

JavaScript

completions only

multiple-lua

Lua

completions only

multiple-php

PHP

completions only

multiple-pl

Perl

completions only

multiple-r

R

completions only

multiple-rb

Ruby

completions only

multiple-rkt

Racket

completions only

multiple-rs

Rust

completions only

multiple-scala

Scala

completions only

multiple-sh

Shell

completions only

multiple-swift

Swift

completions only

multiple-ts

TypeScript

completions only

All multi-language tasks evaluate code generation capabilities in their respective programming languages and only support the completions endpoint.


Options#

HumanEval#

The HumanEval task evaluates a model’s ability to generate correct Python code for a set of programming problems. Each problem includes a function signature and a docstring, and the model must generate a correct implementation.

{
    "type": "humaneval",
    "name": "my-configuration-bigcode-humaneval-1",
    "namespace": "my-organization",
    "params": {
        "parallelism": 10,
        "request_timeout": 300,
        "limit_samples": 10,
        "max_tokens": 512,
        "temperature": 1.0,
        "top_p": 0.01,
        "extra": {
            "batch_size": 1,
            "top_k": 1,
            "model_type": "completions"
        }
    }
}
{
  "task_id": "HumanEval/0",
  "prompt": "def add(a, b):\n",
  "reference": "def add(a, b):\n    return a + b\n",
  "output": "def add(a, b):\n    return a + b\n"
}
{
  "tasks": {
    "humaneval": {
      "metrics": {
        "pass@1": {
          "scores": {
            "pass@1": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

MBPP#

The MBPP (Mostly Basic Python Problems) task evaluates a model’s ability to solve basic Python programming problems. Each problem includes a prompt and test cases.

{
    "type": "mbpp",
    "name": "my-configuration-bigcode-mbpp-completions",
    "namespace": "my-organization",
    "params": {
        "parallelism": 10,
        "request_timeout": 300,
        "limit_samples": 10,
        "max_tokens": 512,
        "temperature": 1.0,
        "top_p": 0.01,
        "extra": {
            "batch_size": 1,
            "top_k": 1,
            "model_type": "completions"
        }
    }
}
{
    "type": "mbpp",
    "name": "my-configuration-bigcode-mbpp-chat",
    "namespace": "my-organization",
    "params": {
        "parallelism": 10,
        "request_timeout": 300,
        "limit_samples": 10,
        "max_tokens": 512,
        "temperature": 1.0,
        "top_p": 0.01,
        "extra": {
            "batch_size": 1,
            "top_k": 1,
            "model_type": "chat"
        }
    }
}
{
  "task_id": "MBPP/0",
  "prompt": "def is_even(n):\n",
  "reference": "def is_even(n):\n    return n % 2 == 0\n",
  "output": "def is_even(n):\n    return n % 2 == 0\n"
}
{
  "tasks": {
    "mbpp": {
      "metrics": {
        "pass@1": {
          "scores": {
            "pass@1": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

MBPP+#

The MBPP+ task is an enhanced version of MBPP with additional test cases for more comprehensive evaluation.

{
    "type": "mbppplus",
    "name": "my-configuration-bigcode-mbppplus-completions",
    "namespace": "my-organization",
    "params": {
        "parallelism": 1,
        "limit_samples": 1,
        "max_tokens": 512,
        "temperature": 1.0,
        "top_p": 0.01,
        "extra": {
            "batch_size": 1,
            "top_k": 1,
            "model_type": "completions"
        }
    }
}
{
    "type": "mbppplus",
    "name": "my-configuration-bigcode-mbppplus-chat",
    "namespace": "my-organization",
    "params": {
        "parallelism": 1,
        "limit_samples": 1,
        "max_tokens": 512,
        "temperature": 1.0,
        "top_p": 0.01,
        "extra": {
            "batch_size": 1,
            "top_k": 1,
            "model_type": "chat"
        }
    }
}
{
  "task_id": "MBPP+/0",
  "prompt": "def is_even(n):\n",
  "reference": "def is_even(n):\n    return n % 2 == 0\n",
  "output": "def is_even(n):\n    return n % 2 == 0\n"
}
{
  "tasks": {
    "mbppplus": {
      "metrics": {
        "pass@1": {
          "scores": {
            "pass@1": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

HumanEval+#

The HumanEval+ task is an enhanced version of HumanEval with additional test cases to provide more robust evaluation.

{
    "type": "humanevalplus",
    "name": "my-configuration-bigcode-humanevalplus-1",
    "namespace": "my-organization",
    "params": {
        "parallelism": 1,
        "limit_samples": 1,
        "max_tokens": 512,
        "temperature": 1.0,
        "top_p": 0.01,
        "extra": {
            "batch_size": 1,
            "top_k": 1,
            "model_type": "completions"
        }
    }
}
{
  "task_id": "HumanEval+/0",
  "prompt": "def add(a, b):\n",
  "reference": "def add(a, b):\n    return a + b\n",
  "output": "def add(a, b):\n    return a + b\n"
}
{
  "tasks": {
    "humanevalplus": {
      "metrics": {
        "pass@1": {
          "scores": {
            "pass@1": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Multi-language Code Generation#

The multi-language tasks evaluate code generation in various programming languages. All multi-language tasks use the completions model type.

{
    "type": "multiple-java",
    "name": "my-configuration-bigcode-java-1",
    "namespace": "my-organization",
    "params": {
        "parallelism": 1,
        "limit_samples": 1,
        "max_tokens": 512,
        "temperature": 1.0,
        "top_p": 0.01,
        "extra": {
            "batch_size": 1,
            "top_k": 1,
            "model_type": "completions"
        }
    }
}
{
  "task_id": "HumanEval-Java/0",
  "prompt": "public static int add(int a, int b) {\n",
  "reference": "public static int add(int a, int b) {\n    return a + b;\n}\n",
  "output": "public static int add(int a, int b) {\n    return a + b;\n}\n"
}
{
  "tasks": {
    "multiple-java": {
      "metrics": {
        "pass@1": {
          "scores": {
            "pass@1": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Parameters#

BigCode Configuration Parameters#

Parameter

Description

Required

Default

parallelism

Number of parallel processes for evaluation.

No

1

limit_samples

Limit the number of samples to evaluate (useful for testing).

No

All samples

max_tokens

Maximum number of tokens to generate.

Yes

temperature

Controls randomness in generation (0.0 = deterministic).

No

1.0

top_p

Nucleus sampling parameter.

No

0.01

stop

List of stop sequences to terminate generation.

No

[]

extra.batch_size

Batch size for generation.

No

1

extra.top_k

Top-k sampling parameter.

No

1

extra.model_type

Model endpoint type: “chat” or “completions.” Required for most tasks.

Conditional

Auto-detected

extra.hf_token

HuggingFace token for accessing private models or datasets.

No

Task-Specific Model Type Support#

Task Model Type Compatibility#

Task

Supported Model Types

Notes

humaneval

completions

Original HumanEval task

humaneval_instruct

chat

Instruction-following variant

humanevalplus

completions

Enhanced version with more tests

mbpp

chat, completions

Both formats supported

mbppplus

chat, completions

Both formats supported

mbppplus_nemo

chat

NEMO-specific variant

multiple-* (all languages)

completions

All 18 multi-language tasks

Metrics#

BigCode Supported Metrics#

Metric Name

Description

Value Range

Notes

pass@k

Fraction of problems for which at least one of the model’s k generated solutions passes all test cases.

0.0 to 1.0 (where 1.0 means all problems were solved correctly)

k is typically 1, 5, or 10. Higher k values indicate performance with more sampling. Only pass@k is supported for BigCode tasks.