Evaluation | NeMo Gym

For the underlying concepts, read Evaluation and Environments. For server-level architecture, read Architecture.

NeMo Gym evaluation is environment-native: the same dataset, resources server, verifier, and rollout machinery can support model comparison, harness comparison, benchmark development, ablation studies, and training-data analysis.

Gym Evaluation Loop

Choose a dataset and environment to measure performance.
Choose the model and agent harness.
Collect rollouts with fixed sampling settings and an explicit repeat count.
Verify each rollout and compute metrics.
Compare against the verifier.
Inspect failures, sometimes-pass tasks, and missing rows.
Use this information to decide whether the next action is model training, harness work, verifier repair, etc.

Evaluation involves many components, and the most important rule is to vary one thing at a time for causal comparison. Changing the model, harness, prompt, verifier, and dataset together can still produce a useful score as a release gate, but it cannot explain what caused the difference.

Models

In this document, a model is an LLM endpoint that can be interacted with via an HTTP request. In NeMo Gym, models are accessed through the model server. The model server keeps provider-specific details behind the Responses API boundary, so the same evaluation can compare hosted models, self-hosted vLLM models, local checkpoints, or models before and after training.

Setting up a model for evaluation requires considering its specific parameters that can influence the result, such as temperature, top-p, max tokens, and seed.

Agent Harnesses

An agent harness is the orchestration layer that turns a model into an agent. The model produces responses, but the harness decides how those responses are used: it builds prompts, manages conversation state, routes tool calls, executes retries, handles observations from the environment, and decides when the task is complete.

In NeMo Gym, the harness usually lives behind the agent server. The agent server owns the rollout loop, while the model server performs stateless inference and the resources server owns the environment, state, and verifier.

Harness changes often affect metrics, and some models are better tuned to use specific harnesses due to their training data.

NeMo Gym treats an agent as model plus harness. The model server stays stateless; the agent server owns the loop that calls the model, routes tool calls, manages conversation state, and asks the resources server to verify the final attempt.

Agent Server

Understand how harnesses orchestrate rollouts.

Architecture

Review how model, agent, and resources servers fit together.

Benchmarks

A benchmark is an environment configured for repeatable comparison. It needs a dataset, resources server, verifier, expected runtime, documented dependencies, and known metrics.

When choosing or building a benchmark, check:

Task fit: Does the task distribution match the target capability?
Verifier fidelity: Does reward measure the target behavior, or just a proxy?
Model compatibility: Can the model endpoint support the required input/output mode, tool calls, images, long context, or structured output?
Harness fit: Does the harness expose the tools and interaction pattern required by the benchmark?
Variance: Are repeats stable enough to distinguish real changes from noise?
Reproducibility: Can another run use the same data, code, config, and model reference?

For integrated external benchmarks, first reproduce the original benchmark’s reported numbers outside Gym when possible. Then integrate it into Gym and rerun against the same model set. This separates benchmark reproduction issues from Gym integration issues.

Browse Environments

Browse built-in benchmark and training environments.

Add a Benchmark

Use the benchmark contribution checklist.

Next Steps

Quickstart

Run a small evaluation and inspect the generated outputs.

Evaluation Tutorials

Follow benchmark-specific evaluation tutorials.

Aggregate Metrics

Customize metrics and key metrics for an environment.

Reward Profiling

Compute per-task pass rates and variance from repeated rollouts.

Training

Use evaluation results to drive post-training.