NeMo Evaluator: Deployment Patterns#

Architecture#

        flowchart TB
    subgraph CLI ["nel CLI"]
        run["nel eval run"]
        serve["nel serve"]
        validate["nel validate"]
    end

    subgraph Core ["Evaluator Core"]
        loop["Eval Loop<br/>seed → model → verify"]
        obs["Observability<br/>trajectories · stats · failures"]
        metrics["Metrics<br/>pass@k · CIs · regression"]
    end

    subgraph Envs ["Environment Sources"]
        local["EvalEnvironment<br/>(GSM8K, TriviaQA, BYOB)"]
        gym_a["GymEnvironment<br/>remote seed/verify"]
        vlmkit["VLMEvalKitEnvironment<br/>VLM benchmarks"]
    end

    subgraph Out ["Outputs"]
        traj["trajectories.jsonl"]
        stats["runtime_stats.json"]
        fail["failure_analysis.json"]
        results["results.jsonl"]
        bundle["eval-{id}.json"]
        reg["regression.json"]
    end

    subgraph External ["Gym Pipeline"]
        gymserve["nel serve<br/>(Gym protocol)"]
        ng["ng_collect_rollouts"]
        jsonl["dataset.jsonl"]
    end

    run --> loop
    validate --> loop
    loop --> obs --> Out
    loop --> metrics --> bundle
    loop --- local & gym_a & vlmkit

    serve --> local
    local --> gymserve --> jsonl --> ng

Pattern 1: Direct Evaluation (Local Environment)#

Who: Research / training owner running benchmarks against a model endpoint.

What: Evaluator owns the full loop. Seeds problems from a local EvalEnvironment, calls the model, verifies, produces all artifacts.

# Quick validation
nel validate -b gsm8k --samples 10

# Full run with n-repeats
nel eval run --bench gsm8k --repeats 4 --max-problems 100 -o ./results

# From YAML config
nel eval run eval_config.yaml

# eval_config.yaml
services:
  model:
    type: api
    url: https://integrate.api.nvidia.com/v1/chat/completions
    protocol: chat_completions
    model: nvidia/nemotron-3-super-120b-a12b
    api_key: ${NVIDIA_API_KEY}

benchmarks:
  - name: gsm8k
    repeats: 4
    max_problems: 100
    solver:
      type: simple
      service: model

Multi-benchmark configs support --resume for checkpoint-based recovery:

nel eval run eval_config.yaml --resume

        sequenceDiagram
    participant CLI as nel eval run
    participant Env as EvalEnvironment
    participant Model as Model API
    participant Disk as Artifacts

    loop each problem × n_repeats
        CLI->>Env: seed(idx)
        Env-->>CLI: prompt, expected, metadata
        CLI->>Model: chat(prompt)
        Model-->>CLI: ModelResponse (content, tokens, latency)
        CLI->>Env: verify(response, expected)
        Env-->>CLI: reward, extracted, scoring_details
        CLI->>Disk: record step → trajectories.jsonl
    end
    CLI->>Disk: bundle.json, runtime_stats.json, results.jsonl

Artifacts produced: All. Full trajectory with request/response bindings, token counts, latency breakdown per phase, scoring details, failure categorization.

Environments: Any registered EvalEnvironment subclass — GSM8K, TriviaQA, or any BYOB benchmark.

Pattern 2: Serve for Gym Training#

Who: Environment developer making an Evaluator benchmark available to Gym’s RL pipeline.

What: An EvalEnvironment is exposed as an HTTP service speaking Gym’s /seed_session + /verify protocol. Gym agents consume it during training. Evaluator independently evaluates the same benchmark with full observability.

# Serve with evaluator protocol (for nel eval run --adapter)
nel serve -b gsm8k --port 9090

# Serve with gym-compatible protocol (accepts NeMoGymResponse in /verify)
nel serve -b gsm8k --port 9090 --gym-compat

# Also export JSONL for ng_collect_rollouts
nel serve -b gsm8k --port 9090 --gym-compat --export-data /data

        sequenceDiagram
    participant Gym as Gym Agent
    participant Svc as nel serve (EvalEnvironment)
    participant Eval as nel eval run (independent)

    Note over Gym,Svc: Training loop
    Gym->>Svc: POST /seed_session
    Svc-->>Gym: {prompt, expected_answer}
    Gym->>Gym: model call
    Gym->>Svc: POST /verify {response: NeMoGymResponse}
    Svc-->>Gym: {reward: 1.0}
    Note over Gym: reads reward only

    Note over Eval,Svc: Independent evaluation
    Eval->>Svc: POST /seed_session {idx}
    Svc-->>Eval: {prompt, expected_answer, metadata}
    Eval->>Eval: model call (captured)
    Eval->>Svc: POST /verify {response, expected}
    Svc-->>Eval: {reward, scoring_details, metadata}
    Note over Eval: full artifact bundle

Dual-consumer: Same service, two readers. Gym reads reward. Evaluator reads everything.

Pattern 3: Consume Remote Environment via Adapter#

Who: Evaluation owner who wants Evaluator’s statistical rigor on an environment running elsewhere.

What: GymEnvironment connects to any server speaking seed_session/verify. Evaluator owns the model call in between, giving full trajectory capture even though the environment is remote.

# Consume a remote environment
nel eval run --bench gym://staging-server:8080 --repeats 4 --max-problems 50

# Consume another nel serve instance
nel eval run --bench gym://127.0.0.1:9090 --repeats 2

        sequenceDiagram
    participant Eval as nel eval run
    participant Env as GymEnvironment
    participant Remote as Remote Environment
    participant Model as Model API

    loop each problem × n_repeats
        Eval->>Env: seed(idx)
        Env->>Remote: POST /seed_session {idx}
        Remote-->>Env: {prompt, expected_answer}
        Env-->>Eval: SeedResult

        Eval->>Model: chat(prompt)
        Note over Eval: trajectory captured

        Eval->>Env: verify(response, expected)
        Env->>Remote: POST /verify
        Remote-->>Env: {reward, scoring_details}
        Env-->>Eval: VerifyResult
    end

Key: Evaluator owns the model call → full trajectory, tokens, latency regardless of where the environment runs.

Pattern 4: VLMEvalKit Benchmarks#

Who: Research team evaluating vision-language models against VLMEvalKit’s 100+ benchmarks with Evaluator’s full observability.

What: VLMEvalKitEnvironment wraps VLMEvalKit datasets, handling image loading and scoring (MCQ, VQA, Y/N). The VLMSolver sends images + text to the model.

nel eval run --bench vlmevalkit://MMBench_DEV_EN --repeats 1 --max-problems 50

from nemo_evaluator.environments.vlmevalkit import VLMEvalKitEnvironment
from nemo_evaluator.engine import run_evaluation, ModelClient
from nemo_evaluator.solvers import VLMSolver

env = VLMEvalKitEnvironment("MMBench_DEV_EN")
client = ModelClient(base_url="...", model="...", api_key="...")
solver = VLMSolver(client)
bundle = await run_evaluation(env, solver, n_repeats=1)

        sequenceDiagram
    participant Eval as nel eval run
    participant VLM as VLMEvalKitEnvironment
    participant VK as VLMEvalKit Dataset
    participant Model as VLM API

    VLM->>VK: build_dataset("MMBench_DEV_EN")

    loop each problem × n_repeats
        Eval->>VLM: seed(idx)
        VLM->>VK: build_prompt(line)
        VK-->>VLM: images + prompt text
        VLM-->>Eval: SeedResult (prompt, images, choices)

        Eval->>Model: chat(images + prompt)
        Note over Eval: trajectory captured

        Eval->>VLM: verify(response, expected)
        VLM-->>Eval: VerifyResult (MCQ/VQA/Y-N scoring)
    end

Pattern 5: Serve for ng_collect_rollouts#

Who: Training team that needs batch rollout collection using Gym’s infrastructure.

What: nel serve exposes any EvalEnvironment as a Gym-compatible HTTP server. Gym agents and ng_collect_rollouts consume it via standard /seed_session + /verify endpoints.

# Serve gsm8k as a Gym-compatible resource server
nel serve -b gsm8k --gym-compat --port 9090
# Gym training nodes connect via: gym://localhost:9090

# Or programmatically
import uvicorn
from nemo_evaluator.environments.registry import get_environment
from nemo_evaluator.serving.app import generate_app

env = get_environment("gsm8k")
app = generate_app(env, gym_compat=True)
uvicorn.run(app, host="0.0.0.0", port=9090)

JSONL row format (matches ng_collect_rollouts input spec):

{
  "responses_create_params": {
    "input": [{"role": "user", "content": "Solve: ..."}]
  },
  "expected_answer": "42",
  "uuid": "gsm8k-0",
  "eval_type": "gsm8k",
  "metadata": {"category": "math"}
}

Pattern 6: Regression Comparison#

Who: CI pipeline or evaluation owner comparing model versions.

What: Two run bundles → score deltas with CI overlap check, Mann-Whitney U p-values, per-category deltas, runtime deltas.

from nemo_evaluator.engine.comparison import compare_runs, write_regression

report = compare_runs("results/v1/eval-*.json", "results/v2/eval-*.json")
write_regression(report, "results/regression.json")

Output:

{
  "score_deltas": {
    "pass@1": {"baseline": 0.85, "candidate": 0.88, "delta": 0.03, "ci_overlap": true, "p_value": 0.031, "significant": true}
  },
  "category_deltas": {
    "algebra": {"baseline": 0.92, "candidate": 0.95, "delta": 0.03},
    "geometry": {"baseline": 0.71, "candidate": 0.68, "delta": -0.03}
  },
  "runtime_deltas": {
    "tokens_per_second": {"baseline": 41.0, "candidate": 38.5, "delta": -2.5}
  }
}

Artifact Summary#

Every evaluation run (all patterns) produces:

File	Contents
`eval-{id}.json`	Config, pass@k with CIs, per-category scores, runtime stats, failure report
`results.jsonl`	Per-sample: problem_idx, repeat, reward, extracted answer, tokens, latency
`trajectories.jsonl`	Per-step: full prompt, model response, token breakdown (prompt/completion/reasoning), latency breakdown (seed/model/verify ms), scoring method + details, SHA256 request hash, failure category
`runtime_stats.json`	Latency percentiles (p50/p90/p99), token throughput, finish reason distribution, error count
`failure_analysis.json`	Categorized failures (refusal, format_error, timeout, rate_limit, empty_response) with counts, percentages, exemplars
`regression.json`	Score deltas, CI overlap, per-category and runtime deltas (when comparing two runs)

Environment Compatibility Matrix#

Source	Evaluator Owns Model Call	Full Trajectory	n_repeats	pass@k + CIs	Progress Bar	Failure Analysis
Local EvalEnvironment	Yes	Yes	Yes	Yes	Yes	Yes
Remote via GymEnvironment	Yes	Yes	Yes	Yes	Yes	Yes
VLMEvalKitEnvironment	Yes	Yes	Yes	Yes	Yes	Yes
Gym ng_collect_rollouts	No (Gym does)	From output JSONL	Via num_repeats	From reward vectors	Gym’s tqdm	Post-hoc
Legacy harnesses	No (subprocess)	From output files	Via config	From parsed scores	Process output	Post-hoc

BYOB: Writing a New Benchmark#

from nemo_evaluator import benchmark, scorer, ScorerInput, answer_line

@benchmark(
    name="my_benchmark",
    dataset="hf://my-org/my-data?split=test",
    prompt="Question: {question}\nAnswer:",
    target_field="answer",
)
@scorer
def my_scorer(sample: ScorerInput) -> dict:
    return answer_line(sample)

Then:

nel validate -b my_benchmark --samples 10             # sanity check
nel eval run --bench my_benchmark --repeats 4          # full evaluation
nel serve -b my_benchmark --gym-compat                 # serve for Gym

Pattern 7: Distributed Evaluation on SLURM#

User story: “I need to evaluate a 14k-problem benchmark with n=8 repeats. Running serially would take days. I want to shard across 16 SLURM nodes and merge results.”

Architecture#

                           SLURM Controller
                                 │
                    sbatch --array=0-15
                    ┌────────────┼────────────┐
                    ▼            ▼             ▼
               Shard 0      Shard 1  ...  Shard 15
              [p0-p874]    [p875-p1749]  [p13125-p13999]
            nel eval run   nel eval run   nel eval run
                    │            │             │
                    └────────────┼────────────┘
                                 ▼
                          Merge Job
                     (afterok dependency)
                                 │
                                 ▼
                     merged/eval-*.json
                     merged/results.jsonl
                     merged/trajectories.jsonl

Config-Driven SLURM#

SLURM evaluations are driven by a YAML config with a cluster: { type: slurm } block:

# slurm_eval.yaml
services:
  model:
    type: vllm
    model: nvidia/Llama-3.1-70B-Instruct
    protocol: chat_completions
    tensor_parallel_size: 4
    port: 8000
    node_pool: compute

benchmarks:
  - name: gsm8k
    repeats: 8
    solver:
      type: simple
      service: model

cluster:
  type: slurm
  walltime: "04:00:00"
  shards: 16
  node_pools:
    compute:
      partition: batch
      nodes: 1
      ntasks_per_node: 1
      gres: "gpu:4"

output:
  dir: ./eval_results/gsm8k_distributed

# Submit via config (16 shards, auto-merge after completion)
nel eval run slurm_eval.yaml

# Generate scripts to inspect first (set submit: false in config)
nel eval run slurm_eval.yaml

Shard merging is handled automatically when all array tasks complete.

How Sharding Works#

Each SLURM array task gets SLURM_ARRAY_TASK_ID and SLURM_ARRAY_TASK_COUNT set automatically. nel eval run detects these (or NEL_SHARD_IDX/NEL_TOTAL_SHARDS) and computes its problem range:

14000 problems, 16 shards	Shard 0	Shard 1	…	Shard 15
Problem range	[0, 875)	[875, 1750)	…	[13125, 14000)

Each shard writes its own artifacts to shard_N/. The merge job combines all results, recomputes global metrics (pass@k, CI), and aggregates runtime stats.

Serve on SLURM#

For long-running Gym training, serve an environment as a SLURM service:

nel serve -b gsm8k --gym-compat --port 9090

# Wrap in an sbatch script for SLURM submission
# Allocated endpoint is written to eval_results/endpoint.txt
# Gym training nodes connect via: gym://$(cat eval_results/endpoint.txt)

Environment Variables#

Variable	Source	Purpose
`SLURM_ARRAY_TASK_ID`	SLURM	Shard index (0-based)
`SLURM_ARRAY_TASK_COUNT`	SLURM	Total shards
`NEL_SHARD_IDX`	Manual override	Same as above, for non-SLURM use
`NEL_TOTAL_SHARDS`	Manual override	Same as above
`NEMO_API_KEY`	User	Model API key (avoid passing on CLI)

Pattern 8: Docker / Docker Compose#

User story: “I want to run evaluations in containers for reproducibility, or spin up a serve+eval stack locally.”

# Build the image
docker build -t nemo-evaluator .

# Single eval
docker run -e NEMO_API_KEY=$KEY nemo-evaluator eval run \
    --bench gsm8k --repeats 2 --output-dir /results

# Serve + eval (docker compose)
cd deploy
NEMO_API_KEY=$KEY docker compose up serve eval-remote

# Sharded (manually, 4 shards)
for i in 0 1 2 3; do
  NEL_SHARD_IDX=$i NEL_TOTAL_SHARDS=4 docker compose \
    --profile sharded run -d eval-shard
done

Files: Dockerfile, deploy/docker-compose.yaml

Pattern 9: Kubernetes#

User story: “I need to run evaluations on our K8s cluster, with sharded jobs for large benchmarks and a persistent serve endpoint for Gym training.”

Single Evaluation Job#

kubectl apply -f deploy/k8s/eval-job.yaml
kubectl logs -f job/nel-eval

Sharded (Indexed Job)#

Uses K8s completionMode: Indexed – each pod gets JOB_COMPLETION_INDEX mapped to NEL_SHARD_IDX.

kubectl apply -f deploy/k8s/eval-indexed-job.yaml
# 8 pods run in parallel, each evaluating its shard
kubectl wait --for=condition=complete job/nel-eval-sharded --timeout=2h
# Then run the merge job
kubectl apply -f deploy/k8s/eval-merge.yaml

Serve as K8s Service#

kubectl apply -f deploy/k8s/serve-deployment.yaml
# Gym training pods connect via: gym://nel-serve.default.svc:9090

Includes readiness/liveness probes on /health, ClusterIP service for internal discovery.

Files: deploy/k8s/eval-job.yaml, deploy/k8s/eval-indexed-job.yaml, deploy/k8s/serve-deployment.yaml

Pattern 10: Ray (Distributed)#

User story: “Our Gym training already runs on Ray. I want to run distributed evaluation on the same Ray cluster.”

# Submit as a Ray job
ray job submit --working-dir . -- python -m nemo_evaluator.engine.ray_launcher \
    --bench gsm8k --shards 8 --repeats 5 \
    --model-url https://integrate.api.nvidia.com/v1 \
    --model-id azure/openai/gpt-5.2 \
    --output-dir ./eval_results/ray

# Or from within a Ray script
import ray
from deploy.ray_eval import run_shard
futures = [run_shard.remote("gsm8k", i, 8, ...) for i in range(8)]
results = ray.get(futures)

Each run_shard is a Ray remote task that runs run_evaluation() with a computed problem_range. Results are merged locally after all tasks complete. Works on existing Ray clusters used by Gym training.

File: src/nemo_evaluator/engine/ray_launcher.py

Pattern 11: GitLab CI Regression Gate#

User story: “I want every MR to automatically evaluate the candidate model against the baseline and block merge if scores regress.”

# Include in your .gitlab-ci.yml:
include:
  - local: deploy/gitlab-ci-eval.yml

Pipeline stages:

eval:baseline – checks out target branch, runs eval
eval:candidate – runs eval on MR branch
regression:check – compares bundles, fails if any metric drops >5%

Produces regression.json artifact with per-metric deltas, CI overlap, and category breakdowns.

File: deploy/gitlab-ci-eval.yml

Deployment Matrix#

Target	Eval	Serve	Sharded	Merge	Config
Local	`nel eval run`	`nel serve`	`NEL_SHARD_IDX` env	automatic	CLI flags
SLURM	`nel eval run config.yaml`	`nel serve`	`--array`	automatic	YAML config
Docker	`docker run`	`compose up serve`	`compose --profile sharded`	automatic	`docker-compose.yaml`
Kubernetes	`Job`	`Deployment+Service`	`Indexed Job`	follow-up `Job`	K8s manifests
Ray	`ray job submit`	N/A (use K8s)	`ray.remote` tasks	in-process	Python
GitLab CI	pipeline job	N/A	N/A	regression stage	`.gitlab-ci.yml`

All targets use the same nel eval run / nel serve commands, same sharding env vars (NEL_SHARD_IDX, NEL_TOTAL_SHARDS), and same artifact format. The only difference is the orchestration layer.