Legacy Evaluator Containers#
Run any existing NeMo Evaluator harness container (simple-evals, lm-evaluation-harness, nemo-skills, mtbench, etc.) through NEL’s unified interface.
How It Works#
flowchart TB
NEL["run_container_eval()"] --> CONFIG["Build config_ef.yaml"]
CONFIG --> DOCKER["docker run<br/>eval-factory container"]
DOCKER --> HARNESS["Harness runs<br/>(simple-evals, lm-eval, etc.)"]
HARNESS --> OUTPUT["results.yml<br/>eval_factory_metrics.json"]
OUTPUT --> PARSE["Parse & convert<br/>to NEL bundle"]
subgraph "Container"
HARNESS
ADAPTER["AdapterServer<br/>(model proxy)"]
HARNESS -->|model calls| ADAPTER
ADAPTER -->|forward| MODEL["Model API"]
end
style NEL fill:#e1f5fe
style HARNESS fill:#fff3e0
The container adapter does not decompose the harness. The harness owns the model call, which means you get aggregate scores and response stats from the adapter interceptors, but not per-request trajectories. For full observability, use the NeMo Skills Integration (native mode) instead.
Available Harnesses#
nel list --source lm-eval
Harness |
Container |
Example tasks |
|---|---|---|
|
|
AIME_2025, GPQA_diamond, MMLU |
|
|
ifeval, arc, hellaswag |
|
|
gsm8k, math, aime24 |
|
|
mt_bench |
|
|
bfcl_v3, bfcl_v4 |
|
|
hle |
|
|
livecodebench |
|
|
scicode |
|
|
VLM benchmarks |
|
|
Safety evals |
|
|
HELM benchmarks |
Plus 10+ more (see nel list --source lm-eval).
Python API#
from nemo_evaluator.environments.container import ContainerEnvironment
from nemo_evaluator.engine.eval_loop import run_evaluation
from nemo_evaluator.solvers import ChatSolver
from nemo_evaluator.engine.model_client import ModelClient
env = ContainerEnvironment(
image="nvcr.io/.../simple-evals:26.01",
task="GPQA_diamond",
)
client = ModelClient(
base_url="https://integrate.api.nvidia.com/v1",
model="azure/openai/gpt-5.2",
)
solver = ChatSolver(client)
bundle = await run_evaluation(env, solver)
Output Format#
The container adapter parses results.yml and eval_factory_metrics.json from the container output:
{
"source": "container",
"image": "nvcr.io/.../simple-evals:26.01",
"task": "simple_evals.AIME_2025",
"scores": {
"AIME_2025/score/micro": {"value": 0.4, "stats": {"stddev": 0.49, "stderr": 0.16}}
},
"runtime": {
"elapsed_seconds": 120.5,
"inference_time_seconds": 98.2,
"scoring_time_seconds": 22.3
},
"response_stats": {
"avg_latency_ms": 1250.0,
"avg_prompt_tokens": 320,
"avg_completion_tokens": 450,
"count": 50,
"successful_count": 48
}
}
Observability Trade-offs#
Feature |
Container mode |
Native mode (Skills/BYOB) |
|---|---|---|
Aggregate scores |
Yes |
Yes |
Response stats (avg latency, tokens) |
Yes (from interceptors) |
Yes |
Per-request trajectories |
No |
Yes |
Failure categorization |
No |
Yes |
n_repeats with pass@k |
No (harness controls) |
Yes |
Bootstrap CI |
No |
Yes |
Scoring details per sample |
No |
Yes |
For full observability, see NeMo Skills Integration which wraps Skills benchmarks natively.