BigCode Evaluations#

BigCode Evaluation Harness is a framework for the evaluation of code generation models. Use this evaluation type to benchmark code generation tasks such as HumanEval, MBPP, and others.

Tip

Want to experiment first? You can try these benchmarks using the open-source NeMo Evaluator SDK before deploying the microservice. The SDK provides a lightweight way to test evaluation workflows locally.

Tip

For the full list of BigCode tasks, refer to tasks.

Target Configuration

BigCode evaluations require specific endpoint configurations depending on the task:

Completions endpoint: Required for humaneval, humanevalplus, and all multiple-* tasks
Chat endpoint: Required for humaneval_instruct and mbppplus_nemo
Either endpoint: Supported for mbpp and mbppplus

Completions

{
  "target": {
    "type": "model",
    "model": {
      "api_endpoint": {
        "url": "https://<nim-base-url>/v1/completions",
        "model_id": "meta/llama-3.3-70b-instruct"
      }
    }
  }
}

Chat

{
  "target": {
    "type": "model", 
    "model": {
      "api_endpoint": {
        "url": "https://<nim-base-url>/v1/chat/completions",
        "model_id": "meta/llama-3.3-70b-instruct"
      }
    }
  }
}

Important

The endpoint URL in your target configuration must match the model_type parameter in your evaluation configuration:

Completions endpoint (/v1/completions) → "model_type": "completions"
Chat endpoint (/v1/chat/completions) → "model_type": "chat"

Example Job Execution#

You can execute an Evaluation Job using either the Python SDK or cURL as follows, replacing <my-eval-config> with configs shown on this page:

Note

See Job Target and Configuration Matrix for details on target / config compatibility.

Python SDK v2

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http(s)://<your evaluator service endpoint>"
)
job = client.v2.evaluation.jobs.create(
    spec={
        "target": {
            "type": "model",
            "name": "my-target-dataset-1",
            "namespace": "my-organization",
            "model": {
                "api_endpoint": {
                    # Replace NIM_BASE_URL with your specific deployment
                    "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                    "model_id": "meta/llama-3.1-8b-instruct"
                }
            },
        },
        "config": <my-eval-config>
    }
)

cURL v2

curl -X "POST" "$EVALUATOR_BASE_URL/v2/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "spec": {
            "target": {
                "type": "model",
                "name": "my-target-dataset-1",
                "namespace": "my-organization",
                "model": {
                    "api_endpoint": {
                        # Replace NIM_BASE_URL with your specific deployment
                        "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                        "model_id": "meta/llama-3.1-8b-instruct"
                    }
                }
            },
            "config": <my-eval-config>
        }
    }'

Python SDK v1

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http(s)://<your evaluator service endpoint>"
)
job = client.evaluation.jobs.create(
    namespace="my-organization",
    target={
        "type": "model",
        "namespace": "my-organization",
        "model": {
            "api_endpoint": {
                # Replace NIM_BASE_URL with your specific deployment
                "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                "model_id": "meta/llama-3.1-8b-instruct"
            }
        },
    },
    config=<my-eval-config>
)

cURL v1

curl -X "POST" "$EVALUATOR_BASE_URL/v1/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "namespace": "my-organization",
        "target": {
            "type": "model",
            "namespace": "my-organization",
            "model": {
                "api_endpoint": {
                    # Replace NIM_BASE_URL with your specific deployment
                    "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                    "model_id": "meta/llama-3.1-8b-instruct"
                }
            }
        },
        "config": <my-eval-config>
    }'

For a full example, see Run an Academic LM Harness Eval

Supported Tasks#

The BigCode evaluation harness supports the following tasks in this implementation:

Core Python Tasks

Python Code Generation Tasks#
Task	Description	Supported Endpoints / Model Types
`humaneval`	Original HumanEval task	completions only
`humaneval_instruct`	Instruction-following version	chat only
`humanevalplus`	Enhanced HumanEval with additional test cases	completions only
`mbpp`	Mostly Basic Python Problems	chat and completions
`mbppplus`	Enhanced MBPP with additional test cases	chat and completions
`mbppplus_nemo`	NEMO-specific MBPP variant	chat only

Endpoints map directly to extra.model_type in your evaluation configuration: /v1/completions → completions; /v1/chat/completions → chat.

HumanEval#

The HumanEval task evaluates a model’s ability to generate correct Python code for a set of programming problems. Each problem includes a function signature and a docstring, and the model must generate a correct implementation.

Config

{
    "type": "humaneval",
    "params": {
        "parallelism": 10,
        "request_timeout": 300,
        "limit_samples": 10,
        "max_tokens": 512,
        "temperature": 1.0,
        "top_p": 0.01,
        "extra": {
            "batch_size": 1,
            "top_k": 1,
            "model_type": "completions"
        }
    }
}

Data Format

{
  "task_id": "HumanEval/0",
  "prompt": "def add(a, b):\n",
  "reference": "def add(a, b):\n    return a + b\n",
  "output": "def add(a, b):\n    return a + b\n"
}

Result

{
  "tasks": {
    "humaneval": {
      "metrics": {
        "pass@1": {
          "scores": {
            "pass@1": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

HumanEval+#

The HumanEval+ task is an enhanced version of HumanEval with additional test cases to provide more robust evaluation.

Config

{
    "type": "humanevalplus",
    "params": {
        "parallelism": 1,
        "limit_samples": 1,
        "max_tokens": 512,
        "temperature": 1.0,
        "top_p": 0.01,
        "extra": {
            "batch_size": 1,
            "top_k": 1,
            "model_type": "completions"
        }
    }
}

Data Format

{
  "task_id": "HumanEval+/0",
  "prompt": "def add(a, b):\n",
  "reference": "def add(a, b):\n    return a + b\n",
  "output": "def add(a, b):\n    return a + b\n"
}

Result

{
  "tasks": {
    "humanevalplus": {
      "metrics": {
        "pass@1": {
          "scores": {
            "pass@1": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

MBPP#

The MBPP (Mostly Basic Python Problems) task evaluates a model’s ability to solve basic Python programming problems. Each problem includes a prompt and test cases.

Completions Config

{
    "type": "mbpp",
    "params": {
        "parallelism": 10,
        "request_timeout": 300,
        "limit_samples": 10,
        "max_tokens": 512,
        "temperature": 1.0,
        "top_p": 0.01,
        "extra": {
            "batch_size": 1,
            "top_k": 1,
            "model_type": "completions"
        }
    }
}

Chat Config

{
    "type": "mbpp",
    "params": {
        "parallelism": 10,
        "request_timeout": 300,
        "limit_samples": 10,
        "max_tokens": 512,
        "temperature": 1.0,
        "top_p": 0.01,
        "extra": {
            "batch_size": 1,
            "top_k": 1,
            "model_type": "chat"
        }
    }
}

Data Format

{
  "task_id": "MBPP/0",
  "prompt": "def is_even(n):\n",
  "reference": "def is_even(n):\n    return n % 2 == 0\n",
  "output": "def is_even(n):\n    return n % 2 == 0\n"
}

Result

{
  "tasks": {
    "mbpp": {
      "metrics": {
        "pass@1": {
          "scores": {
            "pass@1": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

MBPP+#

The MBPP+ task is an enhanced version of MBPP with additional test cases for more comprehensive evaluation.

Completions Config

{
    "type": "mbppplus",
    "params": {
        "parallelism": 1,
        "limit_samples": 1,
        "max_tokens": 512,
        "temperature": 1.0,
        "top_p": 0.01,
        "extra": {
            "batch_size": 1,
            "top_k": 1,
            "model_type": "completions"
        }
    }
}

Chat Config

{
    "type": "mbppplus",
    "params": {
        "parallelism": 1,
        "limit_samples": 1,
        "max_tokens": 512,
        "temperature": 1.0,
        "top_p": 0.01,
        "extra": {
            "batch_size": 1,
            "top_k": 1,
            "model_type": "chat"
        }
    }
}

Data Format

{
  "task_id": "MBPP+/0",
  "prompt": "def is_even(n):\n",
  "reference": "def is_even(n):\n    return n % 2 == 0\n",
  "output": "def is_even(n):\n    return n % 2 == 0\n"
}

Result

{
  "tasks": {
    "mbppplus": {
      "metrics": {
        "pass@1": {
          "scores": {
            "pass@1": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Parameters#

BigCode Configuration Parameters#
Parameter	Description	Required	Default
`parallelism`	Number of parallel processes for evaluation.	No	1
`limit_samples`	Limit the number of samples to evaluate (useful for testing).	No	All samples
`max_tokens`	Maximum number of tokens to generate.	Yes	—
`temperature`	Controls randomness in generation (0.0 = deterministic).	No	1.0
`top_p`	Nucleus sampling parameter.	No	0.01
`stop`	List of stop sequences to terminate generation.	No	[]
`extra.batch_size`	Batch size for generation.	No	1
`extra.top_k`	Top-k sampling parameter.	No	1
`extra.model_type`	Model endpoint type: “chat” or “completions.” Required for most tasks.	Conditional	Auto-detected
`extra.hf_token`	HuggingFace token for accessing private models or datasets.	No	—

Metrics#

BigCode Supported Metrics#
Metric Name	Description	Value Range	Notes
`pass@k`	Fraction of problems for which at least one of the model’s `k` generated solutions passes all test cases.	0.0 to 1.0 (where 1.0 means all problems were solved correctly)	`k` is typically 1, 5, or 10. Higher `k` values indicate performance with more sampling. Only `pass@k` is supported for BigCode tasks.