Evaluate EvalPlus
EvalPlus evaluates Python function-completion ability. NeMo Gym’s evalplus resources server sends HumanEval+ or MBPP+ prompts to an agent, extracts the model’s Python code, runs EvalPlus base and plus tests, and returns a pass/fail score and verdict fields for each task.
Use this tutorial for a small, reproducible coding evaluation before scaling to a larger benchmark run.
What This Tutorial Runs
This walkthrough uses:
The config currently selects dataset: humaneval. The same resources server also supports MBPP+ through the dataset field.
Prerequisites
Complete Installation and configure a model endpoint as shown in Quickstart.
EvalPlus executes generated Python code in the verifier. Run this tutorial in an environment where code execution is expected and isolated per the team’s policy.
When using a thinking model with vLLM, start vLLM with the appropriate reasoning parser so <think>...</think> text is stripped before EvalPlus extracts code.
Start Servers
Run from the repository root:
The values below are examples. Choose the --model-url, --model-type, and --model that match the endpoint and model you want to evaluate.
This starts:
evalplus, the resources serverevalplus_simple_agent, the agent serverpolicy_model, the model server
This example uses vllm_model as a Responses API to Chat Completions bridge against NVIDIA’s OpenAI-compatible endpoint. To use an external vLLM server instead, point the same model wrapper at your vLLM endpoint and choose the model name served by that endpoint:
How Evaluation Works
The evalplus resources server contains the evaluation logic. During rollout collection, the server’s verify() method (resources_servers/evalplus/app.py) runs three steps for each agent response:
- Extracts the Python code block from the model output
- Runs it against EvalPlus base and plus tests via
check_correctness_remote - Returns
is_correct(base tests),is_correct_plus(plus tests), andreward=1.0if plus tests pass
gym eval run --no-serve then aggregates these per-task results into pass@k and majority@k metrics in _aggregate_metrics.json.
Evaluation is not a separate step; it happens inside the resources server during rollout collection.
Collect Rollouts
In a second terminal, activate the NeMo Gym environment and collect a five-task smoke test:
This writes the rollout file and sidecar artifacts under results/, including:
results/evalplus_rollouts.jsonlresults/evalplus_rollouts_materialized_inputs.jsonlresults/evalplus_rollouts_aggregate_metrics.json
Inspect Rollouts
Each rollout row contains the model response, extracted code, verifier verdicts, and a pass/fail score. The most useful fields are:
Use the rollout JSONL to debug individual failures. For example, compare extracted_model_code with the prompt in responses_create_params.input and the verifier verdicts.
Read Aggregate Metrics
Use the aggregate metrics file for the top-level scorecard:
For EvalPlus, compare base-test and plus-test metrics. A model can pass the base tests while failing the stricter plus tests, so treat plus-test performance as the stricter headline metric.
What You Can Evaluate
This tutorial uses evalplus_simple_agent, a thin passthrough that sends the prompt directly to the model and returns its output, so scores measure the model’s coding ability in isolation.
The same setup works for evaluating an agent harness. Swap evalplus_simple_agent for any other agent config (e.g. a reflection agent or a code-execution agent) and set its resources_server.name to evalplus. The resources server and verifier are unchanged, so scores are directly comparable across agents and models.
The same gym eval run --no-serve command works in both cases; what changes is what the score reflects.
Scale the Run
After the smoke test succeeds:
- Remove
--limit 5to run the configured dataset. - Increase
--num-repeatsto get statistically meaningful scores (see below). - Keep the model, agent, dataset, sampling settings, and EvalPlus config fixed when comparing models.
- Save rollout JSONL, aggregate metrics, model ID, config files, and git SHA with the run notes.
Choosing --num-repeats
All models are stochastic at temperature > 0, so a single attempt per task is a point estimate with no measure of stability. --num-repeats controls how many independent attempts are collected per task, and it unlocks a richer set of aggregate metrics:
Even --num-repeats 4 is enough to surface std_err_across_runs and pass@4, which tells you whether a score difference between two models is real or noise. Use --num-repeats 1 for smoke tests; use --num-repeats 4 or higher when reporting results.