Code Generation Evaluation#
Evaluate programming capabilities through code generation, completion, and algorithmic problem solving using the BigCode evaluation harness.
Overview#
Code generation evaluation assesses a model’s ability to:
Generate Code: Write complete functions from natural language descriptions
Code Completion: Fill in missing code segments
Algorithm Implementation: Solve programming challenges and competitive programming problems
Before You Start#
Ensure you have:
Model Endpoint: An OpenAI-compatible endpoint for your model
API Access: Valid API key for your model endpoint
Sufficient Context: Models with adequate context length for code problems
Pre-Flight Check#
Verify your setup before running code evaluation:
try:
response = requests.post(
endpoint_url,
headers={"Authorization": f"Bearer {api_key}"},
json={
"model": model_id,
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 10
},
timeout=10
)
assert response.status_code == 200, f"Endpoint returned status {response.status_code}"
print("✓ Endpoint ready for evaluation")
return True
except Exception as e:
print(f"✗ Endpoint check failed: {e}")
print("Ensure your API key is valid and the endpoint is accessible")
return False
Tip
Run this script directly: python docs/evaluation/_snippets/prerequisites/endpoint_check.py
Choose Your Approach#
Recommended - The fastest way to run code generation evaluations with unified CLI:
# List available code generation tasks
nv-eval ls tasks | grep -E "(mbpp|humaneval)"
# Run MBPP evaluation
nv-eval run \
--config-dir examples \
--config-name local_llama_3_1_8b_instruct \
-o 'evaluation.tasks=["mbpp"]' \
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o target.api_endpoint.api_key=${YOUR_API_KEY}
# Run multiple code generation benchmarks
nv-eval run \
--config-dir examples \
--config-name local_llama_3_1_8b_instruct \
-o 'evaluation.tasks=["mbpp", "humaneval"]'
For programmatic evaluation in custom workflows:
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig, EvaluationTarget, ApiEndpoint, ConfigParams, EndpointType
)
# Configure code generation evaluation
eval_config = EvaluationConfig(
type="mbpp",
output_dir="./results",
params=ConfigParams(
limit_samples=10, # Remove for full dataset
temperature=0.2, # Low temperature for consistent code
max_new_tokens=1024, # Sufficient tokens for complete functions
top_p=0.9
)
)
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
url="https://integrate.api.nvidia.com/v1/chat/completions",
model_id="meta/llama-3.1-8b-instruct",
type=EndpointType.CHAT,
api_key="your_api_key"
)
)
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
print(f"Evaluation completed: {result}")
For specialized container workflows:
# Pull and run BigCode evaluation container
docker run --rm -it --gpus all nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.08.1 bash
# Inside container - set environment
export MY_API_KEY=your_api_key_here
# Run code generation evaluation
eval-factory run_eval \
--eval_type mbpp \
--model_id meta/llama-3.1-8b-instruct \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--model_type chat \
--api_key_name MY_API_KEY \
--output_dir /tmp/results \
--overrides 'config.params.limit_samples=10,config.params.temperature=0.2'
Container Access#
The BigCode evaluation harness is available through Docker containers. No separate package installation is required:
docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.08.1
Discovering Available Tasks#
Use the launcher CLI to discover all available code generation tasks:
# List all available benchmarks
nv-eval ls tasks
# Filter for code generation tasks
nv-eval ls tasks | grep -E "(mbpp|humaneval)"
Available Tasks#
The BigCode harness provides these programming benchmarks:
Task |
Description |
Language |
Endpoint Type |
---|---|---|---|
|
Mostly Basic Programming Problems |
Python |
chat |
|
Extended MBPP with additional test cases |
Python |
chat |
|
Hand-written programming problems |
Python |
completions |
Basic Code Generation Evaluation#
The Most Basic Programming Problems (MBPP) benchmark tests fundamental programming skills. Use any of the three approaches above to run MBPP evaluations.
Understanding Results#
Code generation evaluations typically report pass@k metrics that indicate what percentage of problems were solved correctly within k attempts.
Advanced Configuration#
Custom Evaluation Parameters
# Advanced configuration for code generation
eval_params = ConfigParams(
limit_samples=100, # Evaluate on subset for testing
parallelism=4, # Concurrent evaluation requests
temperature=0.2, # Low temperature for consistent code
max_new_tokens=1024 # Sufficient tokens for complete functions
)
eval_config = EvaluationConfig(
type="mbpp",
output_dir="/results/mbpp_advanced/",
params=eval_params
)
Multiple Task Evaluation
Evaluate across different code generation benchmarks:
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig, EvaluationTarget, ApiEndpoint, ConfigParams, EndpointType
)
# Configure target endpoint (reused for all tasks)
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
url="https://integrate.api.nvidia.com/v1/chat/completions",
model_id="meta/llama-3.1-8b-instruct",
type=EndpointType.CHAT,
api_key="your_api_key"
)
)
code_tasks = ["mbpp", "mbppplus"]
results = {}
for task in code_tasks:
eval_config = EvaluationConfig(
type=task,
output_dir=f"./results/{task}/",
params=ConfigParams(
limit_samples=50,
temperature=0.1,
parallelism=2
)
)
results[task] = evaluate(
eval_cfg=eval_config,
target_cfg=target_config
)
Understanding Metrics#
Pass@k Interpretation#
Code generation evaluations typically report pass@k metrics:
Pass@1: Percentage of problems solved on the first attempt
Pass@k: Percentage of problems solved in k attempts (if multiple samples are generated)