Quickstart

View as Markdown

See Installation if you need to install NeMo Gym.

Working with an AI coding assistant? NeMo Gym ships Agent Skills — vetted workflows for tasks like adding benchmarks, debugging rollouts, and editing docs.

Configure Your Model

Create an env.yaml file in the project root with your model endpoint credentials:

1policy_base_url: https://api.openai.com/v1
2policy_api_key: <your-openai-api-key>
3policy_model_name: gpt-4.1-2025-04-14

This quickstart uses OpenAI. NeMo Gym supports local and hosted inference — see Configure Model for vLLM, Fireworks, OpenRouter, and others.

Run Evaluation

Run your agent on a set of tasks and score the results. This example uses a simple tool calling agent simple_agent with the mcqa (multiple-choice Q&A) environment and its included example data.

1. Start servers

NeMo Gym uses local servers to coordinate your model, agent, and task verification. Start them first:

$gym env start \
> --resources-server mcqa \
> --model-type openai_model

You should see three server instances starting:

[1] mcqa (resources_servers/mcqa)
[2] mcqa_simple_agent (responses_api_agents/simple_agent)
[3] policy_model (responses_api_models/openai_model)

2. Evaluate your agent

In a new terminal, run your agent on a single task to verify everything works:

$source .venv/bin/activate
$
$gym eval run --no-serve \
> --agent mcqa_simple_agent \
> --input resources_servers/mcqa/data/example.jsonl \
> --output results/mcqa_rollouts.jsonl \
> --limit 5 \
> --num-repeats 1

You should see a progress bar followed by aggregate metrics:

Collecting rollouts: 100%|██████| 5/5 [01:22<00:00, 16.44s/it]
Key metrics for mcqa_simple_agent:
{
"mean/reward": 0.8,
"pass@1[avg-of-1]/accuracy": 80.0,
"pass@1/accuracy": 80.0
}
Finished rollout collection! View results at:
Fully materialized inputs: results/mcqa_rollouts_materialized_inputs.jsonl
Rollouts: results/mcqa_rollouts.jsonl
Aggregate metrics: results/mcqa_rollouts_aggregate_metrics.json

For per-task pass rates, see gym eval profile in the CLI Reference.

Explore

Now that you have a working setup, explore what’s available.

NeMo Gym ships with environments across many domains. You can use these existing environments in addition to building your own.

$gym list benchmarks
Available benchmarks in NeMo Gym (76)
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark name ┃ Domain ┃ Agent name ┃ Num repeats ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ aime24 │ math │ aime24_math_with_judge_simple_ag… │ 32 │
│ aime25 │ math │ aime25_math_with_judge_simple_ag… │ 32 │
│ browsecomp │ agent │ browsecomp_tavily_search_simple_… │ 1 │
│ gpqa │ knowledge │ gpqa_mcqa_simple_agent │ 8 │
│ ifbench │ instruction_fol… │ ifbench_benchmark_simple_agent │ 5 │
| ... | ... | ... | ... |
│ tau2 │ agent │ tau2_benchmark_agent │ 8 │
│ xstest │ safety │ xstest_benchmark_simple_agent │ 4 │
└──────────────────┴────────────────────┴───────────────────────────────────┴─────────────┘

This lists benchmarks with pre-configured agents. For the full set of environments (including training environments), see the Available Environments table.

Every CLI command supports -h or --help for detailed usage information:

$gym --help
$gym env start --help

Next Steps