Quickstart

See Installation if you need to install NeMo Gym.

Configure Your Model

Create an env.yaml file in the project root with your model endpoint credentials:

1 policy_base_url: https://api.openai.com/v1
2 policy_api_key: <your-openai-api-key>
3 policy_model_name: gpt-4.1-2025-04-14

This quickstart uses OpenAI. NeMo Gym supports local and hosted inference — see Configure Model for vLLM, Fireworks, OpenRouter, and others.

Run Evaluation

Run your agent on a set of tasks and score the results. This example uses a simple tool calling agent simple_agent with the mcqa (multiple-choice Q&A) environment and its included example data.

1. Start servers

NeMo Gym uses local servers to coordinate your model, agent, and task verification. Start them first:

$ environment_config="resources_servers/mcqa/configs/mcqa.yaml"
$ model_config="responses_api_models/openai_model/configs/openai_model.yaml"
$ 
$ ng_run "+config_paths=[${environment_config},${model_config}]"

You should see three server instances starting:

[1] mcqa (resources_servers/mcqa)
[2] mcqa_simple_agent (responses_api_agents/simple_agent)
[3] policy_model (responses_api_models/openai_model)

2. Evaluate your agent

In a new terminal, run your agent on a single task to verify everything works:

$ source .venv/bin/activate
$ 
$ ng_collect_rollouts \
>     +agent_name=mcqa_simple_agent \
>     +input_jsonl_fpath=resources_servers/mcqa/data/example.jsonl \
>     +output_jsonl_fpath=results/mcqa_rollouts.jsonl \
>     +limit=5 \
>     +num_repeats=1

You should see a progress bar followed by aggregate metrics:

Collecting rollouts: 100%|██████| 5/5 [01:22<00:00, 16.44s/it]
Key metrics for mcqa_simple_agent:
{
    "mean/reward": 0.8,
    "pass@1[avg-of-1]/accuracy": 80.0,
    "pass@1/accuracy": 80.0
}
Finished rollout collection! View results at:
Fully materialized inputs: results/mcqa_rollouts_materialized_inputs.jsonl
Rollouts: results/mcqa_rollouts.jsonl
Aggregate metrics: results/mcqa_rollouts_aggregate_metrics.json

For per-task pass rates, see ng_reward_profile in the CLI Reference.

Explore

Now that you have a working setup, explore what’s available.

NeMo Gym ships with environments across many domains. You can use these existing environments in addition to building your own.

$ ng_list_benchmarks

Available benchmarks in NeMo Gym
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark name   ┃ Agent name                        ┃ Num repeats ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ aalcr            │ aalcr_benchmark_simple_agent      │ 16          │
│ aime25           │ aime25_math_with_judge_simple_ag… │ 32          │
│ browsecomp       │ browsecomp_tavily_search_simple_… │ 1           │
│ gpqa             │ gpqa_mcqa_simple_agent            │ 8           │
│ ifbench          │ ifbench_benchmark_simple_agent    │ 5           │
| ...              | ...                               | ...         |
│ tau2             │ tau2_benchmark_agent              │ 8           │
│ xstest           │ xstest_benchmark_simple_agent     │ 4           │
└──────────────────┴───────────────────────────────────┴─────────────┘

This lists benchmarks with pre-configured agents. For the full set of environments (including training environments), see the Available Environments table.

Every CLI command supports +h=true or +help=true for detailed usage information:

$ ng_help
$ ng_run +help=true

Next Steps

Browse Environments

Browse available environments for evaluation and training.

Agents

Explore available agent harnesses and learn how to integrate your own agent.

Training

Improve your agent or model with RL or fine-tuning.

Build Custom Environments

Create your own evaluation or training environments.