Task Verification

View as Markdown

Goal: Understand what task verification is and how rewards drive model training.

What is Verification?

Every resources server in NeMo Gym implements a verify() function that returns a reward value for task performance.

The Problem: When you ran the weather example in the quickstart, the agent successfully called the tool and provided a response. But was that response good? Should the model be rewarded or penalized? Without verification, you cannot measure performance or guide improvement.

The Solution: Each resources server must define exactly what “good performance” means for its domain.

Why Verification Matters

Tool Execution ≠ Good Performance

  • The right tool call was issued, e.g., get_weather("San Francisco")
  • But was helpful advice given? Was the response accurate? Was it efficient?
  • Verification answers these questions with numerical scores

Training Signal

Verification scores become the reward signals that drive reinforcement learning:

  • High scores → “Do more of this behavior”
  • Low scores → “Avoid this behavior”
  • No verification = No way to improve the model

Common Verification Patterns

Let’s look at real examples from NeMo Gym’s resources servers:

Simple Correctness (mcqa - Multiple Choice Questions):

1# Extract model answer (A, B, C, or D)
2pred = extract_answer_from_response(response)
3gold = expected_answer # e.g., "C"
4
5# Binary scoring: right or wrong
6is_correct = (pred == gold)
7reward = 1.0 if is_correct else 0.0

Sophisticated Correctness (math_with_judge - Math Problems):

1# Uses math-verify library for mathematical equivalence
2library_reward = math_metric.compute(predicted_answer, expected_answer)
3
4# PLUS an LLM judge for edge cases
5judge_prompt = f"Are these answers equivalent? {predicted_answer} vs {expected_answer}"
6judge_score = await llm_judge(judge_prompt)
7
8# Combines both signals
9final_reward = combine_scores(library_reward, judge_score)

What is LLM-as-a-judge?

Some tasks don’t have a clean programmatic solution, or the ground truth is difficult to verify. The “right” answer might be phrased many ways, or “good” means satisfying a rubric (e.g. “does it follow instructions?”, “does it answer the question?”, “is it safe/appropriate?”).

An LLM-as-a-judge means you send the candidate output to another language model with instructions such as “decide if this is correct/equivalent/compliant”, then you parse the judge’s answer (e.g. “yes” or “no”, a score, or A=B) and turn that into your reward.

In NeMo Gym, that call usually happens inside the resources server’s verify() method. The policy produces a rollout, and verification may internally call a second model to grade it. Servers such as equivalence_llm_judge, multichallenge, and text_to_sql are concrete examples of this pattern.

From Verification to Training

How Rewards Drive Learning

  1. Model generates response → Gets verification score
  2. RL algorithm uses score to update model parameters
  3. Higher-scoring behaviors become more likely
  4. Lower-scoring behaviors become less likely
  5. Model improves over many training iterations

What Makes Good Verification

Reliable: Same response should get same score consistently

1# Good: Deterministic scoring
2reward = 1.0 if predicted_answer == expected_answer else 0.0
3
4# Bad: Random or inconsistent scoring
5reward = random.uniform(0.8, 1.0) if correct else random.uniform(0.0, 0.2)

Meaningful: Scores should reflect actual task performance

1# Good: Measures what you care about
2reward = accuracy_score + helpfulness_score + efficiency_score
3
4# Bad: Measures irrelevant details
5reward = 1.0 if response.startswith("Hello") else 0.0

Scalable: Can handle thousands of evaluations per second during training

1# Good: Fast, local computation
2reward = simple_string_match(predicted, expected)
3
4# Bad: Expensive API calls for every verification
5reward = await expensive_api_call(predicted, expected)

Real-World Verification Examples

  • Correctness: Did the model solve the problem correctly?
  • Pedagogy: Did the model explain the steps clearly?
  • Efficiency: Did the model use the simplest method?

What You’ve Learned

This verification system is what makes NeMo Gym powerful for model training:

  • Resources servers provide verification logic
  • Verification patterns vary by domain but follow common principles
  • Reward signals from verification drive model improvement through RL
  • Good verification is reliable, meaningful, and scalable