Task Verification

Goal: Understand what task verification is and how rewards drive model training.

What is Verification?

Every resources server in NeMo Gym implements a verify() function that returns a reward value for task performance.

The Problem: When you ran the weather example in the quickstart, the agent successfully called the tool and provided a response. But was that response good? Should the model be rewarded or penalized? Without verification, you cannot measure performance or guide improvement.

The Solution: Each resources server must define exactly what “good performance” means for its domain.

Why Verification Matters

Tool Execution ≠ Good Performance

The right tool call was issued, e.g., get_weather("San Francisco")
But was helpful advice given? Was the response accurate? Was it efficient?
Verification answers these questions with numerical scores

Training Signal

Verification scores become the reward signals that drive reinforcement learning:

High scores → “Do more of this behavior”
Low scores → “Avoid this behavior”
No verification = No way to improve the model

Common Verification Patterns

Let’s look at real examples from NeMo Gym’s resources servers:

Correctness Verification

Quality Verification

Efficiency Verification

Simple Correctness (mcqa - Multiple Choice Questions):

1 # Extract model answer (A, B, C, or D)
2 pred = extract_answer_from_response(response)
3 gold = expected_answer  # e.g., "C"
4 
5 # Binary scoring: right or wrong
6 is_correct = (pred == gold)
7 reward = 1.0 if is_correct else 0.0

Sophisticated Correctness (math_with_judge - Math Problems):

1 # Uses math-verify library for mathematical equivalence
2 library_reward = math_metric.compute(predicted_answer, expected_answer)
3 
4 # PLUS an LLM judge for edge cases
5 judge_prompt = f"Are these answers equivalent? {predicted_answer} vs {expected_answer}"
6 judge_score = await llm_judge(judge_prompt)
7 
8 # Combines both signals
9 final_reward = combine_scores(library_reward, judge_score)

What is LLM-as-a-judge?

Some tasks don’t have a clean programmatic solution, or the ground truth is difficult to verify. The “right” answer might be phrased many ways, or “good” means satisfying a rubric (e.g. “does it follow instructions?”, “does it answer the question?”, “is it safe/appropriate?”).

An LLM-as-a-judge means you send the candidate output to another language model with instructions such as “decide if this is correct/equivalent/compliant”, then you parse the judge’s answer (e.g. “yes” or “no”, a score, or A=B) and turn that into your reward.

In NeMo Gym, that call usually happens inside the resources server’s verify() method. The policy produces a rollout, and verification may internally call a second model to grade it. Servers such as equivalence_llm_judge, multichallenge, and text_to_sql are concrete examples of this pattern.

From Verification to Training

How Rewards Drive Learning

Model generates response → Gets verification score
RL algorithm uses score to update model parameters
Higher-scoring behaviors become more likely
Lower-scoring behaviors become less likely
Model improves over many training iterations

What Makes Good Verification

Reliable: Same response should get same score consistently

1 # Good: Deterministic scoring
2 reward = 1.0 if predicted_answer == expected_answer else 0.0
3 
4 # Bad: Random or inconsistent scoring  
5 reward = random.uniform(0.8, 1.0) if correct else random.uniform(0.0, 0.2)

Meaningful: Scores should reflect actual task performance

1 # Good: Measures what you care about
2 reward = accuracy_score + helpfulness_score + efficiency_score
3 
4 # Bad: Measures irrelevant details
5 reward = 1.0 if response.startswith("Hello") else 0.0

Scalable: Can handle thousands of evaluations per second during training

1 # Good: Fast, local computation
2 reward = simple_string_match(predicted, expected)
3 
4 # Bad: Expensive API calls for every verification
5 reward = await expensive_api_call(predicted, expected)

Task Verification

What is Verification?

Why Verification Matters

Common Verification Patterns

Correctness Verification

Quality Verification

Efficiency Verification

What is LLM-as-a-judge?

From Verification to Training

How Rewards Drive Learning

What Makes Good Verification

Real-World Verification Examples

Math Tutoring

Customer Service

Code Generation

What You’ve Learned