Task Verification
Goal: Understand what task verification is and how rewards drive model training.
What is Verification?
Every resources server in NeMo Gym implements a verify() function that returns a reward value for task performance.
The Problem: When you ran the weather example in the quickstart, the agent successfully called the tool and provided a response. But was that response good? Should the model be rewarded or penalized? Without verification, you cannot measure performance or guide improvement.
The Solution: Each resources server must define exactly what “good performance” means for its domain.
Why Verification Matters
Tool Execution ≠ Good Performance
- The right tool call was issued, e.g.,
get_weather("San Francisco") - But was helpful advice given? Was the response accurate? Was it efficient?
- Verification answers these questions with numerical scores
Training Signal
Verification scores become the reward signals that drive reinforcement learning:
- High scores → “Do more of this behavior”
- Low scores → “Avoid this behavior”
- No verification = No way to improve the model
Common Verification Patterns
Let’s look at real examples from NeMo Gym’s resources servers:
Correctness Verification
Quality Verification
Efficiency Verification
Simple Correctness (mcqa - Multiple Choice Questions):
Sophisticated Correctness (math_with_judge - Math Problems):
What is LLM-as-a-judge?
Some tasks don’t have a clean programmatic solution, or the ground truth is difficult to verify. The “right” answer might be phrased many ways, or “good” means satisfying a rubric (e.g. “does it follow instructions?”, “does it answer the question?”, “is it safe/appropriate?”).
An LLM-as-a-judge means you send the candidate output to another language model with instructions such as “decide if this is correct/equivalent/compliant”, then you parse the judge’s answer (e.g. “yes” or “no”, a score, or A=B) and turn that into your reward.
In NeMo Gym, that call usually happens inside the resources server’s verify() method. The policy produces a rollout, and verification may internally call a second model to grade it. Servers such as equivalence_llm_judge, multichallenge, and text_to_sql are concrete examples of this pattern.
From Verification to Training
How Rewards Drive Learning
- Model generates response → Gets verification score
- RL algorithm uses score to update model parameters
- Higher-scoring behaviors become more likely
- Lower-scoring behaviors become less likely
- Model improves over many training iterations
What Makes Good Verification
Reliable: Same response should get same score consistently
Meaningful: Scores should reflect actual task performance
Scalable: Can handle thousands of evaluations per second during training
Real-World Verification Examples
Math Tutoring
Customer Service
Code Generation
- Correctness: Did the model solve the problem correctly?
- Pedagogy: Did the model explain the steps clearly?
- Efficiency: Did the model use the simplest method?
What You’ve Learned
This verification system is what makes NeMo Gym powerful for model training:
- Resources servers provide verification logic
- Verification patterns vary by domain but follow common principles
- Reward signals from verification drive model improvement through RL
- Good verification is reliable, meaningful, and scalable