Architecture#
System Overview#
flowchart TB
subgraph CLI["CLI (nel)"]
RUN["nel eval run"]
SERVE["nel serve"]
VALIDATE["nel validate"]
REPORT["nel eval report"]
COMPARE["nel compare"]
GATE["nel gate"]
end
subgraph ENVS["Environments"]
REG["Registry"]
BYOB["@benchmark + @scorer"]
GYM["GymEnvironment"]
SKILLS["SkillsEnvironment"]
LMEVAL["LMEvalEnvironment"]
VLM["VLMEvalKitEnvironment"]
end
subgraph ENGINE["Engine"]
LOOP["eval_loop"]
SOLVER["Solver"]
MC["ModelClient"]
end
subgraph CFG["Config"]
EVALCFG["EvalConfig"]
COMPOSE["compose"]
end
subgraph ORCH["Orchestration"]
DEPLOY["ModelServer"]
LOCALRUN["Orchestrator"]
SLURMGEN["SlurmGen"]
end
subgraph EXEC["Executors"]
LOCAL["LocalExecutor"]
DOCKER["DockerExecutor"]
SLURM["SlurmExecutor"]
end
subgraph SAND["Sandbox"]
SBASE["Sandbox Protocol"]
SDOCK["DockerSandbox"]
SSLURM["SlurmSandbox"]
SLOCAL["LocalSandbox"]
SECS["ECSFargateSandbox"]
SMGR["SandboxManager"]
end
subgraph PROXY["Proxy"]
LITELLM["LiteLLM Proxy"]
INTERCEPT["Interceptors"]
end
subgraph OBS["Observability"]
TYPES["StepRecord<br/>ModelResponse"]
COLLECT["ArtifactCollector"]
PROG["ConsoleProgress"]
end
subgraph SCORE["Scoring"]
JUDGE["judge.py"]
JSCHEMA["json_schema.py"]
end
RUN --> EVALCFG
RUN --> EXEC
EXEC --> ORCH
ORCH --> DEPLOY
ORCH --> LOOP
LOOP --> SMGR
SERVE --> REG
VALIDATE --> LOOP
LOOP --> REG
LOOP --> SOLVER
SOLVER --> MC
LOOP --> COLLECT
LOOP --> PROG
REG --> BYOB
REG --> GYM
REG --> SKILLS
REG --> LMEVAL
REG --> VLM
LOOP --> JUDGE
LOOP --> LITELLM
LITELLM --> INTERCEPT
style CLI fill:#e8eaf6
style ENVS fill:#e8f5e9
style ENGINE fill:#fff3e0
style CFG fill:#e8eaf6
style ORCH fill:#fff8e1
style EXEC fill:#f3e5f5
style OBS fill:#fce4ec
style SAND fill:#e0f7fa
style SCORE fill:#fff9c4
style PROXY fill:#e1f5fe
Package Map#
Package |
Responsibility |
Key types |
|---|---|---|
|
Base class, registry, |
|
|
15 built-in benchmarks (all |
Scorer functions |
|
Pluggable inference strategies |
|
|
Core eval loop, model client, checkpoint, comparison, export plugins |
|
|
Pydantic config schemas, env expansion, YAML composition |
|
|
Suite orchestration, model server management, SLURM generation, proxy lifecycle |
|
|
HTTP server for environments (evaluator + Gym protocol) |
|
|
Executor protocol and backends (local, Docker, SLURM) |
|
|
Per-problem isolated execution and strategy patterns |
|
|
LiteLLM proxy callback plugins for request/response interception |
|
|
Verification scorers, judge pipeline, JSON schema |
|
|
Rich telemetry capture |
|
|
Statistical aggregation |
|
|
CLI commands |
|
Environment Abstraction#
Everything is an EvalEnvironment. Built-in benchmarks, remote Gym servers, NeMo Skills tasks, lm-eval harness tasks, and VLMEvalKit benchmarks all implement the same contract:
classDiagram
class EvalEnvironment {
<<abstract>>
+str name
+seed(idx) SeedResult
+verify(response, expected, sandbox?) VerifyResult
+sandbox_specs() list~SandboxSpec~ | None
+dataset_size() int
+close()
}
class ByobEnvironment {
+BenchmarkDefinition definition
}
class GymEnvironment {
+str endpoint
}
class SkillsEnvironment {
+str benchmark
+str eval_type
}
class LMEvalEnvironment {
+str task_name
}
class VLMEvalKitEnvironment {
+str dataset_name
}
class ContainerEnvironment {
+str image_uri
+str task_name
}
EvalEnvironment <|-- ByobEnvironment
EvalEnvironment <|-- GymEnvironment
EvalEnvironment <|-- SkillsEnvironment
EvalEnvironment <|-- LMEvalEnvironment
EvalEnvironment <|-- VLMEvalKitEnvironment
EvalEnvironment <|-- ContainerEnvironment
Resolution#
The registry resolves environment names in order:
URI scheme –
lm-eval://task,skills://name,gym://host:port,gym://name,vlmevalkit://dataset,container://image#taskBuilt-in registry – names registered via
@benchmarkor@register
from nemo_evaluator import get_environment
env = get_environment("mmlu") # built-in
env = get_environment("lm-eval://aime25") # lm-eval task
env = get_environment("skills://gpqa") # NeMo Skills
env = get_environment("gym://localhost:9090") # remote Gym
env = get_environment("vlmevalkit://MMBench_DEV_EN") # VLMEvalKit
Evaluation Flow#
sequenceDiagram
participant CLI as nel eval run
participant Exec as Executor
participant Runner as Orchestrator
participant Deploy as ModelServer
participant Loop as eval_loop
participant Solver as Solver
participant Env as Environment
participant Obs as ArtifactCollector
CLI->>Exec: executor.run(config)
Exec->>Runner: run_local(config)
Runner->>Deploy: start()
Deploy-->>Runner: model_url
opt proxy configured
Runner->>Runner: start_proxy(model_url)
Note right of Runner: model_url rewritten to proxy URL
end
Runner->>Loop: run_evaluation(env, solver, config, sandbox_manager)
loop For each problem x repeat
Loop->>Env: seed(idx)
Env-->>Loop: SeedResult(prompt, expected, sandbox_spec?)
opt sandbox configured
Loop->>Loop: manager.acquire(spec)
end
Loop->>Solver: solve(task, sandbox?)
Solver-->>Loop: SolveResult(response)
Loop->>Env: verify(response, expected, sandbox?)
Env-->>Loop: VerifyResult(reward, details)
opt sandbox acquired
Loop->>Loop: manager.release(sandbox)
end
Loop->>Obs: record(StepRecord)
end
Loop-->>Runner: bundle
Runner->>Deploy: stop()
Solver Protocol#
Solvers decouple inference strategy from benchmark logic. The eval loop calls solver.solve(task) and receives a response. In YAML configs, solvers are configured via solver.type in each benchmark.
Solver |
Config |
Protocol |
Use case |
|---|---|---|---|
|
|
|
Standard benchmarks (default) |
|
|
|
Vision-language benchmarks |
|
|
Harbor agent SDK |
Agentic evaluation (OpenHands, SWE-agent) |
|
|
NEL-driven ReAct loop |
Full-observability tool use (Gym HTTP tools, sandbox tools, or both) |
|
|
HTTP to nemo-gym |
Delegate solve to gym server |
|
(via service) |
SSE |
NAT agent benchmarks |
|
|
OpenClaw CLI |
OpenClaw benchmarks |
class Solver(Protocol):
async def solve(self, task: SeedResult) -> SolveResult: ...
Executor Protocol#
All execution backends implement the Executor protocol (executors/__init__.py):
class Executor(Protocol):
name: str
def run(self, config, *, dry_run=False, resume=False,
background=False, submit=False) -> None: ...
def status(self, output_dir) -> ProcessState: ...
def stop(self, output_dir) -> bool: ...
@staticmethod
def detect(output_dir) -> bool: ...
The CLI dispatches via get_executor(config.cluster.type) and detect_executor(output_dir) – no if/elif trees.
Executor |
Config |
Metadata file |
What it does |
|---|---|---|---|
|
|
|
In-process eval with optional model deployment, checkpointing, and failure isolation |
|
|
|
Runs eval inside a Docker container with the correct per-harness image |
|
|
|
Generates self-contained sbatch scripts with per-benchmark containers |
SLURM uses node_pools to declare resource topology. Services and sandboxes reference pools by name, enabling heterogeneous jobs (e.g., GPU nodes for model serving + CPU nodes for sandboxes).
Adding a new executor (e.g. Kubernetes) requires only a new class, a metadata file convention, and a registry entry.
# Local with external API
nel eval run --bench mmlu --model-url https://api.example.com/v1
# Docker
nel eval run config.yaml # with cluster.type: docker
# SLURM (generates + submits sbatch)
nel eval run config.yaml # with cluster.type: slurm
Model Deployment#
The orchestration/model_server.py module manages model server lifecycle:
Config |
Internal class |
Description |
|---|---|---|
|
|
External API, no server management |
|
|
Local vLLM process |
|
|
Local SGLang process |
|
|
Local TensorRT-LLM process |
All deployments implement start() -> url, health_wait(), and stop().
Observability Data Model#
classDiagram
class StepRecord {
+str step_id
+int problem_idx
+int repeat
+str prompt
+str expected_answer
+ModelResponse model_response
+float reward
+str extracted_answer
+str scoring_method
+dict scoring_details
+float seed_ms
+float model_ms
+float verify_ms
+str failure_category
}
class ModelResponse {
+str content
+str model
+str finish_reason
+int prompt_tokens
+int completion_tokens
+int reasoning_tokens
+float latency_ms
+dict raw_response
}
class RuntimeStats {
+int total_steps
+int total_tokens
+float elapsed_seconds
+dict latency_percentiles_ms
+float tokens_per_second
+float steps_per_second
+int model_errors
}
class FailureReport {
+int total_failures
+float failure_rate
+dict categories
+list exemplars
}
StepRecord --> ModelResponse
Resilience and Resume#
Multi-benchmark suites use CheckpointManager to track per-benchmark completion:
flowchart LR
START["nel eval run suite.yaml"] --> B1["Benchmark 1"]
B1 -->|"completed"| CKPT1["checkpoint: done"]
CKPT1 --> B2["Benchmark 2"]
B2 -->|"FAILED"| CKPT2["checkpoint: failed"]
CKPT2 --> B3["Benchmark 3"]
B3 -->|"completed"| CKPT3["checkpoint: done"]
CKPT3 --> SUMMARY["Summary: 2 done, 1 failed"]
SUMMARY --> RESUME["nel eval run suite.yaml --resume"]
RESUME --> B2R["Retry Benchmark 2"]
Failure isolation: A failing benchmark is caught, logged, and skipped. The suite continues to the next benchmark.
Checkpoint tracking: Each benchmark’s status (completed/failed) is persisted to disk under the output directory.
Resume:
--resumeskips completed benchmarks and retries failed ones. Without--resume, checkpoints are cleared for a fresh run.