Browse Environments
NeMo Gym currently includes 90+ environments covering math, coding, reasoning, knowledge, agentic tool use, instruction following, and safety. Each environment is a resources server that defines a dataset, verification logic, and optional tools.
Math
| Environment | Description | Verification | Benchmark |
|---|---|---|---|
| math_with_judge | OpenMathReasoning, DAPO, and MathStackOverflow datasets | math-verify + LLM judge | ✓ |
| math_with_code | Competitive math with calculator tools | Boxed answer + numeric match | |
| math_formal_lean | Lean4 formal proof verification | Lean4 compiler | |
| math_with_autograder | Hard math benchmarks (e.g. IMO AnswerBench) | math-verify + LLM autograder | ✓ |
| polymath | Multilingual math across 18 languages and 4 difficulty tiers | LLM judge (weighted) | ✓ |
| imo_gradingbench | Four-class IMO proof grading | Last-word extraction | ✓ |
| imo_proofbench_judge | IMO ProofBench with 0–7 rubric | LLM judge | ✓ |
| proof_verification | Proof scoring against ground truth | LLM judge + meta-verifier | |
| physics_judge | Open-ended physics QA | LLM judge + math-verify | ✓ |
| ugphysics_judge | Undergraduate physics benchmarks | LLM judge (TRUE/FALSE) + math-verify | ✓ |
| newton_bench | Scientific law discovery across 12 physics domains | Execution |
Coding
| Environment | Description | Verification | Benchmark |
|---|---|---|---|
| bigcodebench | BigCodeBench Python solutions against unittest suite | Code execution | ✓ |
| evalplus | HumanEval+ and MBPP+ function completion | Code execution | ✓ |
| code_gen | Competitive coding problem solving | Code execution | ✓ |
| competitive_coding_challenges | Contest-style programming problems | Code execution | ✓ |
| code_fim | Code fill-in-the-middle (HumanEval-Infilling) | Code execution | ✓ |
| bird_sql | Text-to-SQL on BIRD dev (1,534 SQLite tasks) | SQL result-set equality | ✓ |
| spider2_lite | Text-to-SQL on Spider 2.0-Lite (135 enterprise tasks) | SQL result-set equality | ✓ |
| text_to_sql | Text-to-SQL across multiple SQL dialects | LLM judge (SQL equivalence) | ✓ |
| swerl_gen | SWE patch and test generation in a sandboxed environment | Code execution (pytest) | ✓ |
| scicode | Multi-step scientific code generation | Code execution | ✓ |
| cvdp | RTL hardware design code generation | Code execution (simulation) | ✓ |
Knowledge & Reasoning
| Environment | Description | Verification | Benchmark |
|---|---|---|---|
| gpqa_diamond | Graduate-level science multiple choice (GPQA Diamond) | Exact match | ✓ |
| mcqa | Multiple-choice QA covering MMLU, GPQA, HLE | Exact match | ✓ |
| reasoning_gym | 100+ tasks: algebra, logic, geometry, graph theory, games | Exact match | |
| hotpotqa_qa | Closed-book multi-hop QA (HotPotQA) | SQuAD-style substring match | ✓ |
| simpleqa | Short-form factual QA with abstention scoring | LLM judge (3-tier) | ✓ |
| omniscience | Factual recall and calibration QA | LLM judge | ✓ |
| labbench2_vlm | Scientific VLM QA: figures, tables, lab protocols | LLM judge | ✓ |
| arc_agi | Abstract reasoning puzzles (ARC-AGI) | Exact match (grid) | ✓ |
| nvarc | ARC-AGI in inductive (Python) and transductive (grid) modes | Code execution / exact match | ✓ |
| multichallenge | Multi-turn inference memory and instruction retention | LLM judge (rubric) | |
| mrcr | Multi-round coreference resolution | F1 (SequenceMatcher) | ✓ |
Agentic / Tool Use
| Environment | Description | Verification | Benchmark |
|---|---|---|---|
| workplace_assistant | Workplace tasks: 26 tools, 5 databases, 690 tasks | Rule-based (task completion) | ✓ |
| aviary | Multi-hop QA with Wikipedia search + GSM8k with calculator | LLM judge + execution | ✓ |
| tavily_search | Web search tool use (Tavily API) | Execution + optional LLM judge | ✓ |
| calendar | Multi-turn calendar scheduling with constraint satisfaction | Rule-based (constraints) | ✓ |
| finance_sec_search | SEC EDGAR filing search for financial analysis | LLM judge + execution | ✓ |
| google_search | MCQA with integrated Google search tool | Exact match | |
| xlam_fc | Function calling from Salesforce xlam-60k | Exact match | ✓ |
| single_step_tool_use_with_argument_comparison | Pivot RL for tool use across conversational, SWE, and search domains | Argument comparison | ✓ |
| math_with_code | Math with calculator tool use | Boxed answer match | |
| rdkit_chemistry | Molecular chemistry with RDKit tool use | Execution (RDKit) |
Instruction Following & Safety
| Environment | Description | Verification | Benchmark |
|---|---|---|---|
| instruction_following | IFEval and IFBench-style instruction following | LLM judge | |
| ifbench | IFBench with 57 instruction types (AllenAI library) | LLM judge | ✓ |
| format_verification | Citation format and freeform text formatting | Regex / rule-based | |
| structeval | StructEval: JSON, YAML, CSV, TOML, XML schema adherence | Rule-based (schema parsing) | |
| structured_outputs | Schema adherence across structured output formats | Rule-based (schema validation) | ✓ |
| indirect_prompt_injection | Resistance to injected instructions in tool-use trajectories | Rule-based (attack detection) | ✓ |
| jailbreak_detection | Jailbreak resistance with Nemotron judge | LLM judge | ✓ |
| xstest | Over-refusal calibration (XSTest) | Rule-based | ✓ |
| over_refusal_detection | Train models to avoid refusing safe prompts | LLM judge |
Other
| Environment | Description | Verification | Benchmark |
|---|---|---|---|
| asr_with_pc | ASR with WER (standard, case-sensitive, punctuation) | Execution (WER metrics) | ✓ |
| wmt_translation | Machine translation with BLEU and xCOMET-XXL | Execution (COMET metrics) | ✓ |
| longmt_eval | Document-level translation (SEGALE pipeline + COMETKiwi) | Execution (neural QE) | ✓ |
| vlm_eval_kit | VLM benchmarks: MMBench, OCRBench, and others | Execution (VLMEvalKit) | ✓ |
| graphwalks | Long-context graph BFS/DFS reasoning | F1 over node sets | ✓ |
| blackjack | Gymnasium-style Blackjack (multi-step) | Win/draw/loss | |
| grl_sokoban | Single-box Sokoban puzzle | Execution (puzzle solved) | |
| speed_bench | Speculative-decoding throughput measurement | vLLM Prometheus metrics | ✓ |