Code Generation Evaluation#

Evaluate programming capabilities through code generation, completion, and algorithmic problem solving using the BigCode evaluation harness.

Overview#

Code generation evaluation assesses a model’s ability to:

  • Generate Code: Write complete functions from natural language descriptions

  • Code Completion: Fill in missing code segments

  • Algorithm Implementation: Solve programming challenges and competitive programming problems

Before You Start#

Ensure you have:

  • Model Endpoint: An OpenAI-compatible endpoint for your model

  • API Access: Valid API key for your model endpoint

  • Sufficient Context: Models with adequate context length for code problems

Pre-Flight Check#

Verify your setup before running code evaluation:

    try:
        response = requests.post(
            endpoint_url,
            headers={"Authorization": f"Bearer {api_key}"},
            json={
                "model": model_id,
                "messages": [{"role": "user", "content": "Hello"}],
                "max_tokens": 10
            },
            timeout=10
        )
        assert response.status_code == 200, f"Endpoint returned status {response.status_code}"
        print("✓ Endpoint ready for evaluation")
        return True
    except Exception as e:
        print(f"✗ Endpoint check failed: {e}")
        print("Ensure your API key is valid and the endpoint is accessible")
        return False

Tip

Run this script directly: python docs/evaluation/_snippets/prerequisites/endpoint_check.py


Choose Your Approach#

Recommended - The fastest way to run code generation evaluations with unified CLI:

# List available code generation tasks
nv-eval ls tasks | grep -E "(mbpp|humaneval)"

# Run MBPP evaluation
nv-eval run \
    --config-dir examples \
    --config-name local_llama_3_1_8b_instruct \
    -o 'evaluation.tasks=["mbpp"]' \
    -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o target.api_endpoint.api_key=${YOUR_API_KEY}

# Run multiple code generation benchmarks
nv-eval run \
    --config-dir examples \
    --config-name local_llama_3_1_8b_instruct \
    -o 'evaluation.tasks=["mbpp", "humaneval"]'

For programmatic evaluation in custom workflows:

from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
    EvaluationConfig, EvaluationTarget, ApiEndpoint, ConfigParams, EndpointType
)

# Configure code generation evaluation
eval_config = EvaluationConfig(
    type="mbpp",
    output_dir="./results",
    params=ConfigParams(
        limit_samples=10,    # Remove for full dataset
        temperature=0.2,     # Low temperature for consistent code
        max_new_tokens=1024, # Sufficient tokens for complete functions
        top_p=0.9
    )
)

target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        model_id="meta/llama-3.1-8b-instruct", 
        type=EndpointType.CHAT,
        api_key="your_api_key"
    )
)

result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
print(f"Evaluation completed: {result}")

For specialized container workflows:

# Pull and run BigCode evaluation container
docker run --rm -it --gpus all nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.08.1 bash

# Inside container - set environment
export MY_API_KEY=your_api_key_here

# Run code generation evaluation
eval-factory run_eval \
    --eval_type mbpp \
    --model_id meta/llama-3.1-8b-instruct \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --model_type chat \
    --api_key_name MY_API_KEY \
    --output_dir /tmp/results \
    --overrides 'config.params.limit_samples=10,config.params.temperature=0.2'

Container Access#

The BigCode evaluation harness is available through Docker containers. No separate package installation is required:

docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.08.1

Discovering Available Tasks#

Use the launcher CLI to discover all available code generation tasks:

# List all available benchmarks
nv-eval ls tasks

# Filter for code generation tasks
nv-eval ls tasks | grep -E "(mbpp|humaneval)"

Available Tasks#

The BigCode harness provides these programming benchmarks:

Task

Description

Language

Endpoint Type

mbpp

Mostly Basic Programming Problems

Python

chat

mbppplus

Extended MBPP with additional test cases

Python

chat

humaneval

Hand-written programming problems

Python

completions

Basic Code Generation Evaluation#

The Most Basic Programming Problems (MBPP) benchmark tests fundamental programming skills. Use any of the three approaches above to run MBPP evaluations.

Understanding Results#

Code generation evaluations typically report pass@k metrics that indicate what percentage of problems were solved correctly within k attempts.

Advanced Configuration#

Custom Evaluation Parameters
# Advanced configuration for code generation
eval_params = ConfigParams(
    limit_samples=100,           # Evaluate on subset for testing
    parallelism=4,              # Concurrent evaluation requests
    temperature=0.2,            # Low temperature for consistent code
    max_new_tokens=1024         # Sufficient tokens for complete functions
)

eval_config = EvaluationConfig(
    type="mbpp",
    output_dir="/results/mbpp_advanced/",
    params=eval_params
)
Multiple Task Evaluation

Evaluate across different code generation benchmarks:

from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
    EvaluationConfig, EvaluationTarget, ApiEndpoint, ConfigParams, EndpointType
)

# Configure target endpoint (reused for all tasks)
target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        model_id="meta/llama-3.1-8b-instruct", 
        type=EndpointType.CHAT,
        api_key="your_api_key"
    )
)

code_tasks = ["mbpp", "mbppplus"]
results = {}

for task in code_tasks:
    eval_config = EvaluationConfig(
        type=task,
        output_dir=f"./results/{task}/",
        params=ConfigParams(
            limit_samples=50,
            temperature=0.1,
            parallelism=2
        )
    )
    
    results[task] = evaluate(
        eval_cfg=eval_config,
        target_cfg=target_config
    )

Understanding Metrics#

Pass@k Interpretation#

Code generation evaluations typically report pass@k metrics:

  • Pass@1: Percentage of problems solved on the first attempt

  • Pass@k: Percentage of problems solved in k attempts (if multiple samples are generated)