Browse Environments

View as Markdown

NeMo Gym currently includes 90+ environments covering math, coding, reasoning, knowledge, agentic tool use, instruction following, and safety. Each environment is a resources server that defines a dataset, verification logic, and optional tools.

Math

EnvironmentDescriptionVerificationBenchmark
math_with_judgeOpenMathReasoning, DAPO, and MathStackOverflow datasetsmath-verify + LLM judge
math_with_codeCompetitive math with calculator toolsBoxed answer + numeric match
math_formal_leanLean4 formal proof verificationLean4 compiler
math_with_autograderHard math benchmarks (e.g. IMO AnswerBench)math-verify + LLM autograder
polymathMultilingual math across 18 languages and 4 difficulty tiersLLM judge (weighted)
imo_gradingbenchFour-class IMO proof gradingLast-word extraction
imo_proofbench_judgeIMO ProofBench with 0–7 rubricLLM judge
proof_verificationProof scoring against ground truthLLM judge + meta-verifier
physics_judgeOpen-ended physics QALLM judge + math-verify
ugphysics_judgeUndergraduate physics benchmarksLLM judge (TRUE/FALSE) + math-verify
newton_benchScientific law discovery across 12 physics domainsExecution

Coding

EnvironmentDescriptionVerificationBenchmark
bigcodebenchBigCodeBench Python solutions against unittest suiteCode execution
evalplusHumanEval+ and MBPP+ function completionCode execution
code_genCompetitive coding problem solvingCode execution
competitive_coding_challengesContest-style programming problemsCode execution
code_fimCode fill-in-the-middle (HumanEval-Infilling)Code execution
bird_sqlText-to-SQL on BIRD dev (1,534 SQLite tasks)SQL result-set equality
spider2_liteText-to-SQL on Spider 2.0-Lite (135 enterprise tasks)SQL result-set equality
text_to_sqlText-to-SQL across multiple SQL dialectsLLM judge (SQL equivalence)
swerl_genSWE patch and test generation in a sandboxed environmentCode execution (pytest)
scicodeMulti-step scientific code generationCode execution
cvdpRTL hardware design code generationCode execution (simulation)

Knowledge & Reasoning

EnvironmentDescriptionVerificationBenchmark
gpqa_diamondGraduate-level science multiple choice (GPQA Diamond)Exact match
mcqaMultiple-choice QA covering MMLU, GPQA, HLEExact match
reasoning_gym100+ tasks: algebra, logic, geometry, graph theory, gamesExact match
hotpotqa_qaClosed-book multi-hop QA (HotPotQA)SQuAD-style substring match
simpleqaShort-form factual QA with abstention scoringLLM judge (3-tier)
omniscienceFactual recall and calibration QALLM judge
labbench2_vlmScientific VLM QA: figures, tables, lab protocolsLLM judge
arc_agiAbstract reasoning puzzles (ARC-AGI)Exact match (grid)
nvarcARC-AGI in inductive (Python) and transductive (grid) modesCode execution / exact match
multichallengeMulti-turn inference memory and instruction retentionLLM judge (rubric)
mrcrMulti-round coreference resolutionF1 (SequenceMatcher)

Agentic / Tool Use

EnvironmentDescriptionVerificationBenchmark
workplace_assistantWorkplace tasks: 26 tools, 5 databases, 690 tasksRule-based (task completion)
aviaryMulti-hop QA with Wikipedia search + GSM8k with calculatorLLM judge + execution
tavily_searchWeb search tool use (Tavily API)Execution + optional LLM judge
calendarMulti-turn calendar scheduling with constraint satisfactionRule-based (constraints)
finance_sec_searchSEC EDGAR filing search for financial analysisLLM judge + execution
google_searchMCQA with integrated Google search toolExact match
xlam_fcFunction calling from Salesforce xlam-60kExact match
single_step_tool_use_with_argument_comparisonPivot RL for tool use across conversational, SWE, and search domainsArgument comparison
math_with_codeMath with calculator tool useBoxed answer match
rdkit_chemistryMolecular chemistry with RDKit tool useExecution (RDKit)

Instruction Following & Safety

EnvironmentDescriptionVerificationBenchmark
instruction_followingIFEval and IFBench-style instruction followingLLM judge
ifbenchIFBench with 57 instruction types (AllenAI library)LLM judge
format_verificationCitation format and freeform text formattingRegex / rule-based
structevalStructEval: JSON, YAML, CSV, TOML, XML schema adherenceRule-based (schema parsing)
structured_outputsSchema adherence across structured output formatsRule-based (schema validation)
indirect_prompt_injectionResistance to injected instructions in tool-use trajectoriesRule-based (attack detection)
jailbreak_detectionJailbreak resistance with Nemotron judgeLLM judge
xstestOver-refusal calibration (XSTest)Rule-based
over_refusal_detectionTrain models to avoid refusing safe promptsLLM judge

Other

EnvironmentDescriptionVerificationBenchmark
asr_with_pcASR with WER (standard, case-sensitive, punctuation)Execution (WER metrics)
wmt_translationMachine translation with BLEU and xCOMET-XXLExecution (COMET metrics)
longmt_evalDocument-level translation (SEGALE pipeline + COMETKiwi)Execution (neural QE)
vlm_eval_kitVLM benchmarks: MMBench, OCRBench, and othersExecution (VLMEvalKit)
graphwalksLong-context graph BFS/DFS reasoningF1 over node sets
blackjackGymnasium-style Blackjack (multi-step)Win/draw/loss
grl_sokobanSingle-box Sokoban puzzleExecution (puzzle solved)
speed_benchSpeculative-decoding throughput measurementvLLM Prometheus metrics