NeMo Evaluator: Deployment Patterns#
Architecture#
flowchart TB
subgraph CLI ["nel CLI"]
run["nel eval run"]
serve["nel serve"]
validate["nel validate"]
end
subgraph Core ["Evaluator Core"]
loop["Eval Loop<br/>seed → model → verify"]
obs["Observability<br/>trajectories · stats · failures"]
metrics["Metrics<br/>pass@k · CIs · regression"]
end
subgraph Envs ["Environment Sources"]
local["EvalEnvironment<br/>(GSM8K, TriviaQA, BYOB)"]
gym_a["GymEnvironment<br/>remote seed/verify"]
vlmkit["VLMEvalKitEnvironment<br/>VLM benchmarks"]
end
subgraph Out ["Outputs"]
traj["trajectories.jsonl"]
stats["runtime_stats.json"]
fail["failure_analysis.json"]
results["results.jsonl"]
bundle["eval-{id}.json"]
reg["regression.json"]
end
subgraph External ["Gym Pipeline"]
gymserve["nel serve<br/>(Gym protocol)"]
ng["ng_collect_rollouts"]
jsonl["dataset.jsonl"]
end
run --> loop
validate --> loop
loop --> obs --> Out
loop --> metrics --> bundle
loop --- local & gym_a & vlmkit
serve --> local
local --> gymserve --> jsonl --> ng
Pattern 1: Direct Evaluation (Local Environment)#
Who: Research / training owner running benchmarks against a model endpoint.
What: Evaluator owns the full loop. Seeds problems from a local EvalEnvironment, calls the model, verifies, produces all artifacts.
# Quick validation
nel validate -b gsm8k --samples 10
# Full run with n-repeats
nel eval run --bench gsm8k --repeats 4 --max-problems 100 -o ./results
# From YAML config
nel eval run eval_config.yaml
# eval_config.yaml
services:
model:
type: api
url: https://integrate.api.nvidia.com/v1/chat/completions
protocol: chat_completions
model: nvidia/nemotron-3-super-120b-a12b
api_key: ${NVIDIA_API_KEY}
benchmarks:
- name: gsm8k
repeats: 4
max_problems: 100
solver:
type: simple
service: model
Multi-benchmark configs support --resume for checkpoint-based recovery:
nel eval run eval_config.yaml --resume
sequenceDiagram
participant CLI as nel eval run
participant Env as EvalEnvironment
participant Model as Model API
participant Disk as Artifacts
loop each problem × n_repeats
CLI->>Env: seed(idx)
Env-->>CLI: prompt, expected, metadata
CLI->>Model: chat(prompt)
Model-->>CLI: ModelResponse (content, tokens, latency)
CLI->>Env: verify(response, expected)
Env-->>CLI: reward, extracted, scoring_details
CLI->>Disk: record step → trajectories.jsonl
end
CLI->>Disk: bundle.json, runtime_stats.json, results.jsonl
Artifacts produced: All. Full trajectory with request/response bindings, token counts, latency breakdown per phase, scoring details, failure categorization.
Environments: Any registered EvalEnvironment subclass — GSM8K, TriviaQA, or any BYOB benchmark.
Pattern 2: Serve for Gym Training#
Who: Environment developer making an Evaluator benchmark available to Gym’s RL pipeline.
What: An EvalEnvironment is exposed as an HTTP service speaking Gym’s /seed_session + /verify protocol. Gym agents consume it during training. Evaluator independently evaluates the same benchmark with full observability.
# Serve with evaluator protocol (for nel eval run --adapter)
nel serve -b gsm8k --port 9090
# Serve with gym-compatible protocol (accepts NeMoGymResponse in /verify)
nel serve -b gsm8k --port 9090 --gym-compat
# Also export JSONL for ng_collect_rollouts
nel serve -b gsm8k --port 9090 --gym-compat --export-data /data
sequenceDiagram
participant Gym as Gym Agent
participant Svc as nel serve (EvalEnvironment)
participant Eval as nel eval run (independent)
Note over Gym,Svc: Training loop
Gym->>Svc: POST /seed_session
Svc-->>Gym: {prompt, expected_answer}
Gym->>Gym: model call
Gym->>Svc: POST /verify {response: NeMoGymResponse}
Svc-->>Gym: {reward: 1.0}
Note over Gym: reads reward only
Note over Eval,Svc: Independent evaluation
Eval->>Svc: POST /seed_session {idx}
Svc-->>Eval: {prompt, expected_answer, metadata}
Eval->>Eval: model call (captured)
Eval->>Svc: POST /verify {response, expected}
Svc-->>Eval: {reward, scoring_details, metadata}
Note over Eval: full artifact bundle
Dual-consumer: Same service, two readers. Gym reads reward. Evaluator reads everything.
Pattern 3: Consume Remote Environment via Adapter#
Who: Evaluation owner who wants Evaluator’s statistical rigor on an environment running elsewhere.
What: GymEnvironment connects to any server speaking seed_session/verify. Evaluator owns the model call in between, giving full trajectory capture even though the environment is remote.
# Consume a remote environment
nel eval run --bench gym://staging-server:8080 --repeats 4 --max-problems 50
# Consume another nel serve instance
nel eval run --bench gym://127.0.0.1:9090 --repeats 2
sequenceDiagram
participant Eval as nel eval run
participant Env as GymEnvironment
participant Remote as Remote Environment
participant Model as Model API
loop each problem × n_repeats
Eval->>Env: seed(idx)
Env->>Remote: POST /seed_session {idx}
Remote-->>Env: {prompt, expected_answer}
Env-->>Eval: SeedResult
Eval->>Model: chat(prompt)
Note over Eval: trajectory captured
Eval->>Env: verify(response, expected)
Env->>Remote: POST /verify
Remote-->>Env: {reward, scoring_details}
Env-->>Eval: VerifyResult
end
Key: Evaluator owns the model call → full trajectory, tokens, latency regardless of where the environment runs.
Pattern 4: VLMEvalKit Benchmarks#
Who: Research team evaluating vision-language models against VLMEvalKit’s 100+ benchmarks with Evaluator’s full observability.
What: VLMEvalKitEnvironment wraps VLMEvalKit datasets, handling image loading and scoring (MCQ, VQA, Y/N). The VLMSolver sends images + text to the model.
nel eval run --bench vlmevalkit://MMBench_DEV_EN --repeats 1 --max-problems 50
from nemo_evaluator.environments.vlmevalkit import VLMEvalKitEnvironment
from nemo_evaluator.engine import run_evaluation, ModelClient
from nemo_evaluator.solvers import VLMSolver
env = VLMEvalKitEnvironment("MMBench_DEV_EN")
client = ModelClient(base_url="...", model="...", api_key="...")
solver = VLMSolver(client)
bundle = await run_evaluation(env, solver, n_repeats=1)
sequenceDiagram
participant Eval as nel eval run
participant VLM as VLMEvalKitEnvironment
participant VK as VLMEvalKit Dataset
participant Model as VLM API
VLM->>VK: build_dataset("MMBench_DEV_EN")
loop each problem × n_repeats
Eval->>VLM: seed(idx)
VLM->>VK: build_prompt(line)
VK-->>VLM: images + prompt text
VLM-->>Eval: SeedResult (prompt, images, choices)
Eval->>Model: chat(images + prompt)
Note over Eval: trajectory captured
Eval->>VLM: verify(response, expected)
VLM-->>Eval: VerifyResult (MCQ/VQA/Y-N scoring)
end
Pattern 5: Serve for ng_collect_rollouts#
Who: Training team that needs batch rollout collection using Gym’s infrastructure.
What: nel serve exposes any EvalEnvironment as a Gym-compatible HTTP server. Gym agents and ng_collect_rollouts consume it via standard /seed_session + /verify endpoints.
# Serve gsm8k as a Gym-compatible resource server
nel serve -b gsm8k --gym-compat --port 9090
# Gym training nodes connect via: gym://localhost:9090
# Or programmatically
import uvicorn
from nemo_evaluator.environments.registry import get_environment
from nemo_evaluator.serving.app import generate_app
env = get_environment("gsm8k")
app = generate_app(env, gym_compat=True)
uvicorn.run(app, host="0.0.0.0", port=9090)
JSONL row format (matches ng_collect_rollouts input spec):
{
"responses_create_params": {
"input": [{"role": "user", "content": "Solve: ..."}]
},
"expected_answer": "42",
"uuid": "gsm8k-0",
"eval_type": "gsm8k",
"metadata": {"category": "math"}
}
Pattern 6: Regression Comparison#
Who: CI pipeline or evaluation owner comparing model versions.
What: Two run bundles → score deltas with CI overlap check, Mann-Whitney U p-values, per-category deltas, runtime deltas.
from nemo_evaluator.engine.comparison import compare_runs, write_regression
report = compare_runs("results/v1/eval-*.json", "results/v2/eval-*.json")
write_regression(report, "results/regression.json")
Output:
{
"score_deltas": {
"pass@1": {"baseline": 0.85, "candidate": 0.88, "delta": 0.03, "ci_overlap": true, "p_value": 0.031, "significant": true}
},
"category_deltas": {
"algebra": {"baseline": 0.92, "candidate": 0.95, "delta": 0.03},
"geometry": {"baseline": 0.71, "candidate": 0.68, "delta": -0.03}
},
"runtime_deltas": {
"tokens_per_second": {"baseline": 41.0, "candidate": 38.5, "delta": -2.5}
}
}
Artifact Summary#
Every evaluation run (all patterns) produces:
File |
Contents |
|---|---|
|
Config, pass@k with CIs, per-category scores, runtime stats, failure report |
|
Per-sample: problem_idx, repeat, reward, extracted answer, tokens, latency |
|
Per-step: full prompt, model response, token breakdown (prompt/completion/reasoning), latency breakdown (seed/model/verify ms), scoring method + details, SHA256 request hash, failure category |
|
Latency percentiles (p50/p90/p99), token throughput, finish reason distribution, error count |
|
Categorized failures (refusal, format_error, timeout, rate_limit, empty_response) with counts, percentages, exemplars |
|
Score deltas, CI overlap, per-category and runtime deltas (when comparing two runs) |
Environment Compatibility Matrix#
Source |
Evaluator Owns Model Call |
Full Trajectory |
n_repeats |
pass@k + CIs |
Progress Bar |
Failure Analysis |
|---|---|---|---|---|---|---|
Local EvalEnvironment |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Remote via GymEnvironment |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
VLMEvalKitEnvironment |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Gym ng_collect_rollouts |
No (Gym does) |
From output JSONL |
Via num_repeats |
From reward vectors |
Gym’s tqdm |
Post-hoc |
Legacy harnesses |
No (subprocess) |
From output files |
Via config |
From parsed scores |
Process output |
Post-hoc |
BYOB: Writing a New Benchmark#
from nemo_evaluator import benchmark, scorer, ScorerInput, answer_line
@benchmark(
name="my_benchmark",
dataset="hf://my-org/my-data?split=test",
prompt="Question: {question}\nAnswer:",
target_field="answer",
)
@scorer
def my_scorer(sample: ScorerInput) -> dict:
return answer_line(sample)
Then:
nel validate -b my_benchmark --samples 10 # sanity check
nel eval run --bench my_benchmark --repeats 4 # full evaluation
nel serve -b my_benchmark --gym-compat # serve for Gym
Pattern 7: Distributed Evaluation on SLURM#
User story: “I need to evaluate a 14k-problem benchmark with n=8 repeats. Running serially would take days. I want to shard across 16 SLURM nodes and merge results.”
Architecture#
SLURM Controller
│
sbatch --array=0-15
┌────────────┼────────────┐
▼ ▼ ▼
Shard 0 Shard 1 ... Shard 15
[p0-p874] [p875-p1749] [p13125-p13999]
nel eval run nel eval run nel eval run
│ │ │
└────────────┼────────────┘
▼
Merge Job
(afterok dependency)
│
▼
merged/eval-*.json
merged/results.jsonl
merged/trajectories.jsonl
Config-Driven SLURM#
SLURM evaluations are driven by a YAML config with a cluster: { type: slurm } block:
# slurm_eval.yaml
services:
model:
type: vllm
model: nvidia/Llama-3.1-70B-Instruct
protocol: chat_completions
tensor_parallel_size: 4
port: 8000
node_pool: compute
benchmarks:
- name: gsm8k
repeats: 8
solver:
type: simple
service: model
cluster:
type: slurm
walltime: "04:00:00"
shards: 16
node_pools:
compute:
partition: batch
nodes: 1
ntasks_per_node: 1
gres: "gpu:4"
output:
dir: ./eval_results/gsm8k_distributed
# Submit via config (16 shards, auto-merge after completion)
nel eval run slurm_eval.yaml
# Generate scripts to inspect first (set submit: false in config)
nel eval run slurm_eval.yaml
Shard merging is handled automatically when all array tasks complete.
How Sharding Works#
Each SLURM array task gets SLURM_ARRAY_TASK_ID and SLURM_ARRAY_TASK_COUNT set automatically. nel eval run detects these (or NEL_SHARD_IDX/NEL_TOTAL_SHARDS) and computes its problem range:
14000 problems, 16 shards |
Shard 0 |
Shard 1 |
… |
Shard 15 |
|---|---|---|---|---|
Problem range |
[0, 875) |
[875, 1750) |
… |
[13125, 14000) |
Each shard writes its own artifacts to shard_N/. The merge job combines all results, recomputes global metrics (pass@k, CI), and aggregates runtime stats.
Serve on SLURM#
For long-running Gym training, serve an environment as a SLURM service:
nel serve -b gsm8k --gym-compat --port 9090
# Wrap in an sbatch script for SLURM submission
# Allocated endpoint is written to eval_results/endpoint.txt
# Gym training nodes connect via: gym://$(cat eval_results/endpoint.txt)
Environment Variables#
Variable |
Source |
Purpose |
|---|---|---|
|
SLURM |
Shard index (0-based) |
|
SLURM |
Total shards |
|
Manual override |
Same as above, for non-SLURM use |
|
Manual override |
Same as above |
|
User |
Model API key (avoid passing on CLI) |
Pattern 8: Docker / Docker Compose#
User story: “I want to run evaluations in containers for reproducibility, or spin up a serve+eval stack locally.”
# Build the image
docker build -t nemo-evaluator .
# Single eval
docker run -e NEMO_API_KEY=$KEY nemo-evaluator eval run \
--bench gsm8k --repeats 2 --output-dir /results
# Serve + eval (docker compose)
cd deploy
NEMO_API_KEY=$KEY docker compose up serve eval-remote
# Sharded (manually, 4 shards)
for i in 0 1 2 3; do
NEL_SHARD_IDX=$i NEL_TOTAL_SHARDS=4 docker compose \
--profile sharded run -d eval-shard
done
Files: Dockerfile, deploy/docker-compose.yaml
Pattern 9: Kubernetes#
User story: “I need to run evaluations on our K8s cluster, with sharded jobs for large benchmarks and a persistent serve endpoint for Gym training.”
Single Evaluation Job#
kubectl apply -f deploy/k8s/eval-job.yaml
kubectl logs -f job/nel-eval
Serve as K8s Service#
kubectl apply -f deploy/k8s/serve-deployment.yaml
# Gym training pods connect via: gym://nel-serve.default.svc:9090
Includes readiness/liveness probes on /health, ClusterIP service for internal discovery.
Files: deploy/k8s/eval-job.yaml, deploy/k8s/eval-indexed-job.yaml, deploy/k8s/serve-deployment.yaml
Pattern 10: Ray (Distributed)#
User story: “Our Gym training already runs on Ray. I want to run distributed evaluation on the same Ray cluster.”
# Submit as a Ray job
ray job submit --working-dir . -- python -m nemo_evaluator.engine.ray_launcher \
--bench gsm8k --shards 8 --repeats 5 \
--model-url https://integrate.api.nvidia.com/v1 \
--model-id azure/openai/gpt-5.2 \
--output-dir ./eval_results/ray
# Or from within a Ray script
import ray
from deploy.ray_eval import run_shard
futures = [run_shard.remote("gsm8k", i, 8, ...) for i in range(8)]
results = ray.get(futures)
Each run_shard is a Ray remote task that runs run_evaluation() with a computed problem_range. Results are merged locally after all tasks complete. Works on existing Ray clusters used by Gym training.
File: src/nemo_evaluator/engine/ray_launcher.py
Pattern 11: GitLab CI Regression Gate#
User story: “I want every MR to automatically evaluate the candidate model against the baseline and block merge if scores regress.”
# Include in your .gitlab-ci.yml:
include:
- local: deploy/gitlab-ci-eval.yml
Pipeline stages:
eval:baseline – checks out target branch, runs eval
eval:candidate – runs eval on MR branch
regression:check – compares bundles, fails if any metric drops >5%
Produces regression.json artifact with per-metric deltas, CI overlap, and category breakdowns.
File: deploy/gitlab-ci-eval.yml
Deployment Matrix#
Target |
Eval |
Serve |
Sharded |
Merge |
Config |
|---|---|---|---|---|---|
Local |
|
|
|
automatic |
CLI flags |
SLURM |
|
|
|
automatic |
YAML config |
Docker |
|
|
|
automatic |
|
Kubernetes |
|
|
|
follow-up |
K8s manifests |
Ray |
|
N/A (use K8s) |
|
in-process |
Python |
GitLab CI |
pipeline job |
N/A |
N/A |
regression stage |
|
All targets use the same nel eval run / nel serve commands, same sharding env vars (NEL_SHARD_IDX, NEL_TOTAL_SHARDS), and same artifact format. The only difference is the orchestration layer.