Available Environments#
NeMo Gym includes a curated collection of environments for training and evaluation across multiple domains. This page is generated from docs/data/environments.yaml. To update it, run:
python scripts/generate_environments_yaml.py
Example Environment Patterns#
Multi Step
example
Multi-step tool calling
Session State Mgmt
example
Session state management (in-memory)
Single Tool Call
example
Basic single-step tool calling
Environments for Training & Evaluation#
Aviary
4 configs
agent coding math
Calendar
1 config
agent
Circle Click
1 config
other
Code Gen
1 config
coding
Equivalence Llm Judge
4 configs
agent knowledge
config equivalence_llm_judge.yaml
readme README
domain knowledge
description Short answer questions with LLM-as-a-judge
value Improve knowledge-related benchmarks like GPQA / HLE
config nl2bash-equivalency.yaml
readme README
domain agent
description Short bash command generation questions with LLM-as-a-judge
value Improve foundational bash and IF capabilities
Ether0
1 config
knowledge
config ether0.yaml
readme README
domain knowledge
description ether0 chemistry benchmark verifiers
value Evalutate chemistry knowledge and reasoning with ether0 benchmark
Genrm Compare
1 config
config genrm_compare.yaml
readme README
Google Search
1 config
agent
config google_search.yaml
readme README
domain agent
description Multi-choice question answering problems with search tools integrated
value Improve knowledge-related benchmarks with search tools
Instruction Following
1 config
instruction_following
config instruction_following.yaml
readme README
domain instruction_following
description Instruction following datasets targeting IFEval and IFBench style instruction following capabilities
value Improve IFEval and IFBench
Jailbreak Detection
1 config
safety
Math Advanced Calculations
1 config
agent
readme README
domain agent
description An instruction following math environment with counter-intuitive calculators
value Improve instruction following capabilities in specific math environments
Math Formal Lean
6 configs
math
config math_formal_lean.yaml
readme README
domain math
description Lean4 formal proof verification environment
value Improve formal theorem proving capabilities
readme README
domain math
description Lean4 formal proof verification environment with multi-turn self-correction
value Improve formal theorem proving capabilities
config nemotron_clean_easy.yaml
readme README
domain math
description Lean4 formal proof verification environment
value Improve formal theorem proving capabilities
config nemotron_first_try_hard.yaml
readme README
domain math
description Lean4 formal proof verification environment
value Improve formal theorem proving capabilities
config nemotron_medium_500.yaml
readme README
domain math
description Lean4 formal proof verification environment
value Improve formal theorem proving capabilities
config nemotron_very_easy.yaml
readme README
domain math
description Lean4 formal proof verification environment
value Improve formal theorem proving capabilities
Math With Judge
7 configs
math
config math_with_judge.yaml
readme README
domain math
description Math dataset with math-verify and LLM-as-a-judge
value Improve math capabilities including AIME 24 / 25
Mcqa
1 config
knowledge
config mcqa.yaml
readme README
domain knowledge
description Multi-choice question answering problems
value Improve benchmarks like MMLU / GPQA / HLE
dataset Nemotron-RL-knowledge-mcqa
Mini Swe Agent
1 config
coding
config mini_swe_agent.yaml
readme README
domain coding
description A software development with mini-swe-agent orchestration
value Improve software development capabilities, like SWE-bench
dataset SWE-Gym
Multichallenge
2 configs
knowledge
config multichallenge.yaml
readme README
domain knowledge
description MultiChallenge benchmark evaluation with LLM judge
config multichallenge_nrl.yaml
readme README
domain knowledge
description MultiChallenge benchmark evaluation with LLM judge
Ns Tools
1 config
agent
config ns_tools.yaml
readme README
domain agent
description NeMo Skills tool execution with math verification
Over Refusal Detection
3 configs
safety
readme README
domain safety
description Over-refusal detection - monitors if model responds helpfully to safe prompts
Reasoning Gym
2 configs
knowledge
Single Step Tool Use With Argument Comparison
4 configs
agent
Structured Outputs
1 config
instruction_following
config structured_outputs_json.yaml
readme README
domain instruction_following
description Check if responses are following structured output requirements in prompts
value Improve instruction following capabilities
Swerl Gen
1 config
coding
config swerl_gen.yaml
readme README
domain coding
description Running sandboxed evaluation for SWE-style tasks (either patch generation or reproduction test generation)
value Improve SWE capabilities useful for benchmarks like SWE-bench
Swerl Llm Judge
1 config
coding
config swerl_llm_judge.yaml
readme README
domain coding
description SWE-style multiple-choice LLM-judge tasks scored via <solution>...</solution> choice.
value Improve SWE capabilities useful for benchmarks like SWE-bench
Tavily Search
2 configs
agent
Terminus Judge
2 configs
agent
config terminus_judge.yaml
readme README
domain agent
description single-step terminal based task (rubrics v4 judge prompt)
value Improve on terminal-style tasks
config terminus_judge_simple.yaml
readme README
domain agent
description single-step terminal based task (simple judge prompt)
value Improve on terminal-style tasks
Text To Sql
1 config
coding
config text_to_sql.yaml
readme README
domain coding
description Text-to-SQL generation with LLM-as-a-judge equivalence checking
value Improve text-to-SQL capabilities across multiple dialects
Workplace Assistant
1 config
agent
config workplace_assistant.yaml
readme README
domain agent
description Workplace assistant multi-step tool-using environment
value Improve multi-step tool use capability