Evaluate EvalPlus

View as Markdown

EvalPlus evaluates Python function-completion ability. NeMo Gym’s evalplus resources server sends HumanEval+ or MBPP+ prompts to an agent, extracts the model’s Python code, runs EvalPlus base and plus tests, and returns a pass/fail score and verdict fields for each task.

Use this tutorial for a small, reproducible coding evaluation before scaling to a larger benchmark run.

What This Tutorial Runs

This walkthrough uses:

ComponentValue
Benchmark resources serverresources_servers/evalplus
Configresources_servers/evalplus/configs/evalplus.yaml
Agentevalplus_simple_agent
Datasetresources_servers/evalplus/data/example.jsonl
Tasks5 HumanEval examples
Pass score (reward field)1.0 when the generated code passes EvalPlus plus tests

The config currently selects dataset: humaneval. The same resources server also supports MBPP+ through the dataset field.

Prerequisites

Complete Installation and configure a model endpoint as shown in Quickstart.

EvalPlus executes generated Python code in the verifier. Run this tutorial in an environment where code execution is expected and isolated per the team’s policy.

When using a thinking model with vLLM, start vLLM with the appropriate reasoning parser so <think>...</think> text is stripped before EvalPlus extracts code.

Start Servers

Run from the repository root:

The values below are examples. Choose the --model-url, --model-type, and --model that match the endpoint and model you want to evaluate.

$: "${NVIDIA_API_KEY:?Set NVIDIA_API_KEY before running this tutorial.}"
$
$gym env start \
> --resources-server evalplus \
> --model-type vllm_model \
> --model-url "https://integrate.api.nvidia.com/v1" \
> --model-api-key "${NVIDIA_API_KEY}" \
> --model "nvidia/nemotron-3-ultra-550b-a55b" \

This starts:

  • evalplus, the resources server
  • evalplus_simple_agent, the agent server
  • policy_model, the model server

This example uses vllm_model as a Responses API to Chat Completions bridge against NVIDIA’s OpenAI-compatible endpoint. To use an external vLLM server instead, point the same model wrapper at your vLLM endpoint and choose the model name served by that endpoint:

$gym env start \
> --resources-server evalplus \
> --model-type vllm_model \
> --model-url "http://0.0.0.0:10240/v1" \
> --model-api-key "dummy_key" \
> --model "<served-model-name>"

How Evaluation Works

The evalplus resources server contains the evaluation logic. During rollout collection, the server’s verify() method (resources_servers/evalplus/app.py) runs three steps for each agent response:

  1. Extracts the Python code block from the model output
  2. Runs it against EvalPlus base and plus tests via check_correctness_remote
  3. Returns is_correct (base tests), is_correct_plus (plus tests), and reward=1.0 if plus tests pass

gym eval run --no-serve then aggregates these per-task results into pass@k and majority@k metrics in _aggregate_metrics.json.

Evaluation is not a separate step; it happens inside the resources server during rollout collection.

Collect Rollouts

In a second terminal, activate the NeMo Gym environment and collect a five-task smoke test:

$source .venv/bin/activate
$mkdir -p results
$
$gym eval run --no-serve \
> --agent evalplus_simple_agent \
> --input resources_servers/evalplus/data/example.jsonl \
> --output results/evalplus_rollouts.jsonl \
> --limit 5 \
> --num-repeats 1

This writes the rollout file and sidecar artifacts under results/, including:

  • results/evalplus_rollouts.jsonl
  • results/evalplus_rollouts_materialized_inputs.jsonl
  • results/evalplus_rollouts_aggregate_metrics.json

Inspect Rollouts

Each rollout row contains the model response, extracted code, verifier verdicts, and a pass/fail score. The most useful fields are:

FieldMeaning
extracted_model_codePython code EvalPlus extracted from the model response.
base_statusWhether the code passed the base HumanEval or MBPP tests.
plus_statusWhether the code passed EvalPlus extra tests.
is_correctBoolean base-test verdict.
is_correct_plusBoolean plus-test verdict.
reward1.0 when plus tests pass, otherwise 0.0.

Use the rollout JSONL to debug individual failures. For example, compare extracted_model_code with the prompt in responses_create_params.input and the verifier verdicts.

Read Aggregate Metrics

Use the aggregate metrics file for the top-level scorecard:

$cat results/evalplus_rollouts_aggregate_metrics.json

For EvalPlus, compare base-test and plus-test metrics. A model can pass the base tests while failing the stricter plus tests, so treat plus-test performance as the stricter headline metric.

What You Can Evaluate

This tutorial uses evalplus_simple_agent, a thin passthrough that sends the prompt directly to the model and returns its output, so scores measure the model’s coding ability in isolation.

The same setup works for evaluating an agent harness. Swap evalplus_simple_agent for any other agent config (e.g. a reflection agent or a code-execution agent) and set its resources_server.name to evalplus. The resources server and verifier are unchanged, so scores are directly comparable across agents and models.

What you’re measuringAgent config
Model coding abilityevalplus_simple_agent (passthrough)
Agent harness qualityAny agent wired to the evalplus resources server

The same gym eval run --no-serve command works in both cases; what changes is what the score reflects.

Scale the Run

After the smoke test succeeds:

  1. Remove --limit 5 to run the configured dataset.
  2. Increase --num-repeats to get statistically meaningful scores (see below).
  3. Keep the model, agent, dataset, sampling settings, and EvalPlus config fixed when comparing models.
  4. Save rollout JSONL, aggregate metrics, model ID, config files, and git SHA with the run notes.

Choosing --num-repeats

All models are stochastic at temperature > 0, so a single attempt per task is a point estimate with no measure of stability. --num-repeats controls how many independent attempts are collected per task, and it unlocks a richer set of aggregate metrics:

--num-repeatsMetrics available
1pass@1 only
>= 2pass@k, majority@k, pass@1 standard error and standard deviation across runs, within-task variance

Even --num-repeats 4 is enough to surface std_err_across_runs and pass@4, which tells you whether a score difference between two models is real or noise. Use --num-repeats 1 for smoke tests; use --num-repeats 4 or higher when reporting results.