Code Generation Evaluation#

Evaluate programming capabilities through code generation, completion, and algorithmic problem solving using the BigCode evaluation harness.

Overview#

Code generation evaluation assesses a model’s ability to:

Generate Code: Write complete functions from natural language descriptions
Code Completion: Fill in missing code segments
Algorithm Implementation: Solve programming challenges and competitive programming problems

Before You Start#

Ensure you have:

Model Endpoint: An OpenAI-compatible endpoint for your model
API Access: Valid API key for your model endpoint
Sufficient Context: Models with adequate context length for code problems

Pre-Flight Check#

Verify your setup before running code evaluation:

    try:
        response = requests.post(
            endpoint_url,
            headers={"Authorization": f"Bearer {api_key}"},
            json={
                "model": model_id,
                "messages": [{"role": "user", "content": "Hello"}],
                "max_tokens": 10
            },
            timeout=10
        )
        assert response.status_code == 200, f"Endpoint returned status {response.status_code}"
        print("✓ Endpoint ready for evaluation")
        return True
    except Exception as e:
        print(f"✗ Endpoint check failed: {e}")
        print("Ensure your API key is valid and the endpoint is accessible")
        return False

Tip

Run this script directly: python docs/evaluation/_snippets/prerequisites/endpoint_check.py

Choose Your Approach#

NeMo Evaluator Launcher

Recommended - The fastest way to run code generation evaluations with unified CLI:

# List available code generation tasks
nv-eval ls tasks | grep -E "(mbpp|humaneval)"

# Run MBPP evaluation
nv-eval run \
    --config-dir examples \
    --config-name local_llama_3_1_8b_instruct \
    -o 'evaluation.tasks=["mbpp"]' \
    -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o target.api_endpoint.api_key=${YOUR_API_KEY}

# Run multiple code generation benchmarks
nv-eval run \
    --config-dir examples \
    --config-name local_llama_3_1_8b_instruct \
    -o 'evaluation.tasks=["mbpp", "humaneval"]'

Core API

For programmatic evaluation in custom workflows:

from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
    EvaluationConfig, EvaluationTarget, ApiEndpoint, ConfigParams, EndpointType
)

# Configure code generation evaluation
eval_config = EvaluationConfig(
    type="mbpp",
    output_dir="./results",
    params=ConfigParams(
        limit_samples=10,    # Remove for full dataset
        temperature=0.2,     # Low temperature for consistent code
        max_new_tokens=1024, # Sufficient tokens for complete functions
        top_p=0.9
    )
)

target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        model_id="meta/llama-3.1-8b-instruct", 
        type=EndpointType.CHAT,
        api_key="your_api_key"
    )
)

result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
print(f"Evaluation completed: {result}")

Containers Directly

For specialized container workflows:

# Pull and run BigCode evaluation container
docker run --rm -it --gpus all nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.08.1 bash

# Inside container - set environment
export MY_API_KEY=your_api_key_here

# Run code generation evaluation
eval-factory run_eval \
    --eval_type mbpp \
    --model_id meta/llama-3.1-8b-instruct \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --model_type chat \
    --api_key_name MY_API_KEY \
    --output_dir /tmp/results \
    --overrides 'config.params.limit_samples=10,config.params.temperature=0.2'

Container Access#

The BigCode evaluation harness is available through Docker containers. No separate package installation is required:

docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.08.1

Discovering Available Tasks#

Use the launcher CLI to discover all available code generation tasks:

# List all available benchmarks
nv-eval ls tasks

# Filter for code generation tasks
nv-eval ls tasks | grep -E "(mbpp|humaneval)"

Available Tasks#

The BigCode harness provides these programming benchmarks:

Task	Description	Language	Endpoint Type
`mbpp`	Mostly Basic Programming Problems	Python	chat
`mbppplus`	Extended MBPP with additional test cases	Python	chat
`humaneval`	Hand-written programming problems	Python	completions

Basic Code Generation Evaluation#

The Most Basic Programming Problems (MBPP) benchmark tests fundamental programming skills. Use any of the three approaches above to run MBPP evaluations.

Understanding Results#

Code generation evaluations typically report pass@k metrics that indicate what percentage of problems were solved correctly within k attempts.

Advanced Configuration#

Understanding Metrics#

Pass@k Interpretation#

Code generation evaluations typically report pass@k metrics:

Pass@1: Percentage of problems solved on the first attempt
Pass@k: Percentage of problems solved in k attempts (if multiple samples are generated)