BigCode Evaluation Type#

BigCode Evaluation Harness is a framework for the evaluation of code generation models. Use this evaluation type to benchmark code generation tasks such as HumanEval, MBPP, and others.

Tip

For the full list of BigCode tasks, refer to tasks.

Prerequisites#


Options#

HumanEval#

The HumanEval task evaluates a model’s ability to generate correct Python code for a set of programming problems. Each problem includes a function signature and a docstring, and the model must generate a correct implementation.

{
    "type": "humaneval",
    "name": "my-configuration-bigcode-humaneval-1",
    "namespace": "my-organization",
    "params": {
        "parallelism": 1,
        "limit_samples": 1,
        "max_tokens": 512,
        "temperature": 1.0,
        "top_p": 0.0,
        "extra": {
            "batch_size": 1,
            "top_k": 1
        }
    }
}
{
  "task_id": "HumanEval/0",
  "prompt": "def add(a, b):\n",
  "reference": "def add(a, b):\n    return a + b\n",
  "output": "def add(a, b):\n    return a + b\n"
}
{
  "tasks": {
    "humaneval": {
      "metrics": {
        "pass@1": {
          "scores": {
            "pass@1": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

MBPP#

The MBPP (Mostly Basic Python Problems) task evaluates a model’s ability to solve basic Python programming problems. Each problem includes a prompt and test cases.

{
    "type": "mbpp",
    "name": "my-configuration-bigcode-mbpp-1",
    "namespace": "my-organization",
    "params": {
        "parallelism": 1,
        "limit_samples": 1,
        "max_tokens": 512,
        "temperature": 1.0,
        "top_p": 0.0,
        "extra": {
            "batch_size": 1,
            "top_k": 1
        }
    }
}
{
  "task_id": "MBPP/0",
  "prompt": "def is_even(n):\n",
  "reference": "def is_even(n):\n    return n % 2 == 0\n",
  "output": "def is_even(n):\n    return n % 2 == 0\n"
}
{
  "tasks": {
    "mbpp": {
      "metrics": {
        "pass@1": {
          "scores": {
            "pass@1": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Transcode (C++)#

The Transcode (C++) task evaluates a model’s ability to translate Python code to C++. Each problem includes a Python function and the expected C++ translation.

{
    "type": "transcode_cpp",
    "name": "my-configuration-bigcode-transcode-cpp-1",
    "namespace": "my-organization",
    "params": {
        "parallelism": 1,
        "limit_samples": 1,
        "max_tokens": 1024,
        "temperature": 1.0,
        "top_p": 0.0,
        "extra": {
            "batch_size": 1,
            "top_k": 1
        }
    }
}
{
  "task_id": "Transcode/0",
  "prompt": "def add(a, b):\n    return a + b\n",
  "reference": "int add(int a, int b) { return a + b; }\n",
  "output": "int add(int a, int b) { return a + b; }\n"
}
{
  "tasks": {
    "transcode_cpp": {
      "metrics": {
        "pass@1": {
          "scores": {
            "pass@1": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Transcode (Java)#

The Transcode (Java) task evaluates a model’s ability to translate Python code to Java. Each problem includes a Python function and the expected Java translation.

{
    "type": "transcode_java",
    "name": "my-configuration-bigcode-transcode-java-1",
    "namespace": "my-organization",
    "params": {
        "parallelism": 1,
        "limit_samples": 1,
        "max_tokens": 1024,
        "temperature": 1.0,
        "top_p": 0.0,
        "extra": {
            "batch_size": 1,
            "top_k": 1
        }
    }
}
{
  "task_id": "Transcode/1",
  "prompt": "def add(a, b):\n    return a + b\n",
  "reference": "public int add(int a, int b) { return a + b; }\n",
  "output": "public int add(int a, int b) { return a + b; }\n"
}
{
  "tasks": {
    "transcode_java": {
      "metrics": {
        "pass@1": {
          "scores": {
            "pass@1": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Metrics#

BigCode Supported Metrics#

Metric Name

Description

Value Range

Notes

pass@k

Fraction of problems for which at least one of the model’s k generated solutions passes all test cases.

0.0 to 1.0 (where 1.0 means all problems were solved correctly)

k is typically 1, 5, or 10. Higher k values indicate performance with more sampling. Only pass@k is supported for BigCode tasks.