Task verification#
Goal: Understand what task verification is and how rewards drive model training.
What is Verification?#
Every resource server in NeMo Gym has a verify() function that measure task performance. The purpose of this function is to define how to measure how well that task was accomplished.
The Problem: When you ran the weather example in the quickstart, it successfully called the tool and gave a response. But was that response good? Should the model be rewarded or penalized for that behavior? Without verification, there’s no way to measure improvement.
The Solution: Each resource server must define exactly what “good performance” means for its domain.
Why Verification Matters#
Tool Execution ≠ Good Performance
The right tool call was issued i.e.
get_weather("San Francisco")But was helpful advice given? Was the response accurate? Was it efficient?
Verification answers these questions with numerical scores
Training Signal
Verification scores become the reward signals that drive reinforcement learning:
High scores → “Do more of this behavior”
Low scores → “Avoid this behavior”
No verification = No way to improve the model
Common Verification Patterns#
Let’s look at real examples from NeMo Gym’s resource servers:
Simple Correctness (mcqa - Multiple Choice Questions):
# Extract model answer (A, B, C, or D)
pred = extract_answer_from_response(response)
gold = expected_answer # e.g., "C"
# Binary scoring: right or wrong
is_correct = (pred == gold)
reward = 1.0 if is_correct else 0.0
Sophisticated Correctness (math_with_judge - Math Problems):
# Uses math-verify library for mathematical equivalence
library_reward = math_metric.compute(predicted_answer, expected_answer)
# PLUS an LLM judge for edge cases
judge_prompt = f"Are these answers equivalent? {predicted_answer} vs {expected_answer}"
judge_score = await llm_judge(judge_prompt)
# Combines both signals
final_reward = combine_scores(library_reward, judge_score)
Instruction Following (instruction_following):
# Check if response follows specific instructions
instructions = ["Use exactly 3 sentences", "Include the word 'banana'", "End with a question"]
follow_list = []
for instruction in instructions:
follows = instruction_checker.verify(model_response, instruction)
follow_list.append(follows)
# Only reward if ALL instructions followed
reward = 1.0 if all(follow_list) else 0.0
Tool Usage Patterns:
# Count unnecessary tool calls
tool_calls = count_tool_calls(model_response)
expected_calls = 1 # Should only need one weather call
# Penalize inefficiency
efficiency_score = 1.0 - abs(tool_calls - expected_calls) * 0.2
reward = max(0.0, efficiency_score)
Response Length:
# Prefer concise but complete responses
response_length = len(model_response.split())
optimal_length = 50 # words
if response_length <= optimal_length:
reward = 1.0
else:
# Penalize verbosity
reward = max(0.5, 1.0 - (response_length - optimal_length) * 0.01)
From Verification to Training#
How Rewards Drive Learning#
Model generates response → Gets verification score
RL algorithm uses score to update model parameters
Higher-scoring behaviors become more likely
Lower-scoring behaviors become less likely
Model improves over many training iterations
What Makes Good Verification#
Reliable: Same response should get same score consistently
# Good: Deterministic scoring
reward = 1.0 if predicted_answer == expected_answer else 0.0
# Bad: Random or inconsistent scoring
reward = random.uniform(0.8, 1.0) if correct else random.uniform(0.0, 0.2)
Meaningful: Scores should reflect actual task performance
# Good: Measures what you care about
reward = accuracy_score + helpfulness_score + efficiency_score
# Bad: Measures irrelevant details
reward = 1.0 if response.startswith("Hello") else 0.0
Scalable: Can handle thousands of evaluations per second during training
# Good: Fast, local computation
reward = simple_string_match(predicted, expected)
# Bad: Expensive API calls for every verification
reward = await expensive_api_call(predicted, expected)
Real-World Verification Examples#
Correctness: Did the model solve the problem correctly?
Pedagogy: Did the model explain the steps clearly?
Efficiency: Did the model use the simplest method?
Accuracy: Did the model answer the customer’s question?
Politeness: Was the tone appropriate?
Resolution: Did the model solve the customer’s problem?
Functionality: Does the code run correctly?
Quality: Is the code well-structured and readable?
Security: Does the code avoid common vulnerabilities?
What You’ve Learned#
This verification system is what makes NeMo Gym powerful for model training:
Resource servers provide both tools AND scoring systems
Verification patterns vary by domain but follow common principles
Reward signals from verification drive model improvement through RL
Good verification is reliable, meaningful, and scalable