BigCode Evaluation Type#

BigCode Evaluation Harness is a framework for the evaluation of code generation models. Use this evaluation type to benchmark code generation tasks such as HumanEval, MBPP, and others.

Tip

For the full list of BigCode tasks, refer to tasks.

Prerequisites#

Set up or select an existing evaluation target.

Target Configuration#

BigCode evaluations require specific endpoint configurations depending on the task:

Completions endpoint: Required for humaneval, humanevalplus, and all multiple-* tasks
Chat endpoint: Required for humaneval_instruct and mbppplus_nemo
Either endpoint: Supported for mbpp and mbppplus

Completions

{
  "target": {
    "type": "model",
    "model": {
      "api_endpoint": {
        "url": "https://nim.int.aire.nvidia.com/completions",
        "model_id": "meta/llama-3.3-70b-instruct"
      }
    }
  }
}

Chat

{
  "target": {
    "type": "model", 
    "model": {
      "api_endpoint": {
        "url": "https://nim.int.aire.nvidia.com/chat/completions",
        "model_id": "meta/llama-3.3-70b-instruct"
      }
    }
  }
}

Important

The endpoint URL in your target configuration must match the model_type parameter in your evaluation configuration:

Completions endpoint (/completions) → "model_type": "completions"
Chat endpoint (/chat/completions) → "model_type": "chat"

Supported Tasks#

The BigCode evaluation harness supports the following tasks in this implementation:

Core Python Tasks

Python Code Generation Tasks#
Task	Description	Supported Endpoints
`humaneval`	Original HumanEval task	completions only
`humaneval_instruct`	Instruction-following version	chat only
`humanevalplus`	Enhanced HumanEval with additional test cases	completions only
`mbpp`	Mostly Basic Python Problems	chat and completions
`mbppplus`	Enhanced MBPP with additional test cases	chat and completions
`mbppplus_nemo`	NEMO-specific MBPP variant	chat only

Multi-language Tasks

Multi-language Code Generation Tasks#
Task	Language	Supported Endpoints
`multiple-cpp`	C++	completions only
`multiple-cs`	C#	completions only
`multiple-d`	D	completions only
`multiple-go`	Go	completions only
`multiple-java`	Java	completions only
`multiple-jl`	Julia	completions only
`multiple-js`	JavaScript	completions only
`multiple-lua`	Lua	completions only
`multiple-php`	PHP	completions only
`multiple-pl`	Perl	completions only
`multiple-r`	R	completions only
`multiple-rb`	Ruby	completions only
`multiple-rkt`	Racket	completions only
`multiple-rs`	Rust	completions only
`multiple-scala`	Scala	completions only
`multiple-sh`	Shell	completions only
`multiple-swift`	Swift	completions only
`multiple-ts`	TypeScript	completions only

All multi-language tasks evaluate code generation capabilities in their respective programming languages and only support the completions endpoint.

Parameters#

BigCode Configuration Parameters#
Parameter	Description	Required	Default
`parallelism`	Number of parallel processes for evaluation.	No	1
`limit_samples`	Limit the number of samples to evaluate (useful for testing).	No	All samples
`max_tokens`	Maximum number of tokens to generate.	Yes	—
`temperature`	Controls randomness in generation (0.0 = deterministic).	No	1.0
`top_p`	Nucleus sampling parameter.	No	0.01
`stop`	List of stop sequences to terminate generation.	No	[]
`extra.batch_size`	Batch size for generation.	No	1
`extra.top_k`	Top-k sampling parameter.	No	1
`extra.model_type`	Model endpoint type: “chat” or “completions.” Required for most tasks.	Conditional	Auto-detected
`extra.hf_token`	HuggingFace token for accessing private models or datasets.	No	—

Task-Specific Model Type Support#

Task Model Type Compatibility#
Task	Supported Model Types	Notes
`humaneval`	completions	Original HumanEval task
`humaneval_instruct`	chat	Instruction-following variant
`humanevalplus`	completions	Enhanced version with more tests
`mbpp`	chat, completions	Both formats supported
`mbppplus`	chat, completions	Both formats supported
`mbppplus_nemo`	chat	NEMO-specific variant
`multiple-*` (all languages)	completions	All 18 multi-language tasks

Metrics#

BigCode Supported Metrics#
Metric Name	Description	Value Range	Notes
`pass@k`	Fraction of problems for which at least one of the model’s `k` generated solutions passes all test cases.	0.0 to 1.0 (where 1.0 means all problems were solved correctly)	`k` is typically 1, 5, or 10. Higher `k` values indicate performance with more sampling. Only `pass@k` is supported for BigCode tasks.

BigCode Evaluation Type#

Prerequisites#

Target Configuration#

Supported Tasks#

Options#

HumanEval#

MBPP#

MBPP+#

HumanEval+#

Multi-language Code Generation#

Parameters#

Task-Specific Model Type Support#

Metrics#