Browse Environments

NeMo Gym currently includes 90+ environments covering math, coding, reasoning, knowledge, agentic tool use, instruction following, and safety. Each environment is a resources server that defines a dataset, verification logic, and optional tools.

Math

Environment	Description	Verification	Benchmark
math_with_judge	OpenMathReasoning, DAPO, and MathStackOverflow datasets	math-verify + LLM judge	✓
math_with_code	Competitive math with calculator tools	Boxed answer + numeric match
math_formal_lean	Lean4 formal proof verification	Lean4 compiler
math_with_autograder	Hard math benchmarks (e.g. IMO AnswerBench)	math-verify + LLM autograder	✓
polymath	Multilingual math across 18 languages and 4 difficulty tiers	LLM judge (weighted)	✓
imo_gradingbench	Four-class IMO proof grading	Last-word extraction	✓
imo_proofbench_judge	IMO ProofBench with 0–7 rubric	LLM judge	✓
proof_verification	Proof scoring against ground truth	LLM judge + meta-verifier
physics_judge	Open-ended physics QA	LLM judge + math-verify	✓
ugphysics_judge	Undergraduate physics benchmarks	LLM judge (TRUE/FALSE) + math-verify	✓
newton_bench	Scientific law discovery across 12 physics domains	Execution

Coding

Environment	Description	Verification	Benchmark
bigcodebench	BigCodeBench Python solutions against unittest suite	Code execution	✓
evalplus	HumanEval+ and MBPP+ function completion	Code execution	✓
code_gen	Competitive coding problem solving	Code execution	✓
competitive_coding_challenges	Contest-style programming problems	Code execution	✓
code_fim	Code fill-in-the-middle (HumanEval-Infilling)	Code execution	✓
bird_sql	Text-to-SQL on BIRD dev (1,534 SQLite tasks)	SQL result-set equality	✓
spider2_lite	Text-to-SQL on Spider 2.0-Lite (135 enterprise tasks)	SQL result-set equality	✓
text_to_sql	Text-to-SQL across multiple SQL dialects	LLM judge (SQL equivalence)	✓
swerl_gen	SWE patch and test generation in a sandboxed environment	Code execution (pytest)	✓
scicode	Multi-step scientific code generation	Code execution	✓
cvdp	RTL hardware design code generation	Code execution (simulation)	✓

Knowledge & Reasoning

Environment	Description	Verification	Benchmark
gpqa_diamond	Graduate-level science multiple choice (GPQA Diamond)	Exact match	✓
mcqa	Multiple-choice QA covering MMLU, GPQA, HLE	Exact match	✓
reasoning_gym	100+ tasks: algebra, logic, geometry, graph theory, games	Exact match
hotpotqa_qa	Closed-book multi-hop QA (HotPotQA)	SQuAD-style substring match	✓
simpleqa	Short-form factual QA with abstention scoring	LLM judge (3-tier)	✓
omniscience	Factual recall and calibration QA	LLM judge	✓
labbench2_vlm	Scientific VLM QA: figures, tables, lab protocols	LLM judge	✓
arc_agi	Abstract reasoning puzzles (ARC-AGI)	Exact match (grid)	✓
nvarc	ARC-AGI in inductive (Python) and transductive (grid) modes	Code execution / exact match	✓
multichallenge	Multi-turn inference memory and instruction retention	LLM judge (rubric)
mrcr	Multi-round coreference resolution	F1 (SequenceMatcher)	✓

Agentic / Tool Use

Environment	Description	Verification	Benchmark
workplace_assistant	Workplace tasks: 26 tools, 5 databases, 690 tasks	Rule-based (task completion)	✓
aviary	Multi-hop QA with Wikipedia search + GSM8k with calculator	LLM judge + execution	✓
tavily_search	Web search tool use (Tavily API)	Execution + optional LLM judge	✓
calendar	Multi-turn calendar scheduling with constraint satisfaction	Rule-based (constraints)	✓
finance_sec_search	SEC EDGAR filing search for financial analysis	LLM judge + execution	✓
google_search	MCQA with integrated Google search tool	Exact match
xlam_fc	Function calling from Salesforce xlam-60k	Exact match	✓
single_step_tool_use_with_argument_comparison	Pivot RL for tool use across conversational, SWE, and search domains	Argument comparison	✓
math_with_code	Math with calculator tool use	Boxed answer match
rdkit_chemistry	Molecular chemistry with RDKit tool use	Execution (RDKit)

Instruction Following & Safety

Environment	Description	Verification	Benchmark
instruction_following	IFEval and IFBench-style instruction following	LLM judge
ifbench	IFBench with 57 instruction types (AllenAI library)	LLM judge	✓
format_verification	Citation format and freeform text formatting	Regex / rule-based
structeval	StructEval: JSON, YAML, CSV, TOML, XML schema adherence	Rule-based (schema parsing)
structured_outputs	Schema adherence across structured output formats	Rule-based (schema validation)	✓
indirect_prompt_injection	Resistance to injected instructions in tool-use trajectories	Rule-based (attack detection)	✓
jailbreak_detection	Jailbreak resistance with Nemotron judge	LLM judge	✓
xstest	Over-refusal calibration (XSTest)	Rule-based	✓
over_refusal_detection	Train models to avoid refusing safe prompts	LLM judge

Other

Environment	Description	Verification	Benchmark
asr_with_pc	ASR with WER (standard, case-sensitive, punctuation)	Execution (WER metrics)	✓
wmt_translation	Machine translation with BLEU and xCOMET-XXL	Execution (COMET metrics)	✓
longmt_eval	Document-level translation (SEGALE pipeline + COMETKiwi)	Execution (neural QE)	✓
vlm_eval_kit	VLM benchmarks: MMBench, OCRBench, and others	Execution (VLMEvalKit)	✓
graphwalks	Long-context graph BFS/DFS reasoning	F1 over node sets	✓
blackjack	Gymnasium-style Blackjack (multi-step)	Win/draw/loss
grl_sokoban	Single-box Sokoban puzzle	Execution (puzzle solved)
speed_bench	Speculative-decoding throughput measurement	vLLM Prometheus metrics	✓

Add a Benchmark

Use the contribution checklist to add a new benchmark.

Build Verifiers

Learn how environments verify agent behavior and compute rewards.