Legacy Evaluator Containers#

Run any existing NeMo Evaluator harness container (simple-evals, lm-evaluation-harness, nemo-skills, mtbench, etc.) through NEL’s unified interface.

How It Works#

        flowchart TB
    NEL["run_container_eval()"] --> CONFIG["Build config_ef.yaml"]
    CONFIG --> DOCKER["docker run<br/>eval-factory container"]
    DOCKER --> HARNESS["Harness runs<br/>(simple-evals, lm-eval, etc.)"]
    HARNESS --> OUTPUT["results.yml<br/>eval_factory_metrics.json"]
    OUTPUT --> PARSE["Parse & convert<br/>to NEL bundle"]

    subgraph "Container"
        HARNESS
        ADAPTER["AdapterServer<br/>(model proxy)"]
        HARNESS -->|model calls| ADAPTER
        ADAPTER -->|forward| MODEL["Model API"]
    end

    style NEL fill:#e1f5fe
    style HARNESS fill:#fff3e0

The container adapter does not decompose the harness. The harness owns the model call, which means you get aggregate scores and response stats from the adapter interceptors, but not per-request trajectories. For full observability, use the NeMo Skills Integration (native mode) instead.

Available Harnesses#

nel list --source lm-eval

Harness	Container	Example tasks
`simple_evals`	`nvcr.io/.../simple-evals:26.01`	AIME_2025, GPQA_diamond, MMLU
`lm-evaluation-harness`	`nvcr.io/.../lm-evaluation-harness:26.01`	ifeval, arc, hellaswag
`nemo_skills`	`nvcr.io/.../nemo-skills:26.01`	gsm8k, math, aime24
`mtbench`	`nvcr.io/.../mtbench:26.01`	mt_bench
`bfcl`	`nvcr.io/.../bfcl:26.01`	bfcl_v3, bfcl_v4
`hle`	`nvcr.io/.../hle:26.01`	hle
`livecodebench`	`nvcr.io/.../livecodebench:26.01`	livecodebench
`scicode`	`nvcr.io/.../scicode:26.01`	scicode
`vlmevalkit`	`nvcr.io/.../vlmevalkit:26.01`	VLM benchmarks
`safety_eval`	`nvcr.io/.../safety-harness:26.01`	Safety evals
`helm`	`nvcr.io/.../helm:26.01`	HELM benchmarks

Plus 10+ more (see nel list --source lm-eval).

Python API#

from nemo_evaluator.environments.container import ContainerEnvironment
from nemo_evaluator.engine.eval_loop import run_evaluation
from nemo_evaluator.solvers import ChatSolver
from nemo_evaluator.engine.model_client import ModelClient

env = ContainerEnvironment(
    image="nvcr.io/.../simple-evals:26.01",
    task="GPQA_diamond",
)

client = ModelClient(
    base_url="https://integrate.api.nvidia.com/v1",
    model="azure/openai/gpt-5.2",
)
solver = ChatSolver(client)
bundle = await run_evaluation(env, solver)

Output Format#

The container adapter parses results.yml and eval_factory_metrics.json from the container output:

{
  "source": "container",
  "image": "nvcr.io/.../simple-evals:26.01",
  "task": "simple_evals.AIME_2025",
  "scores": {
    "AIME_2025/score/micro": {"value": 0.4, "stats": {"stddev": 0.49, "stderr": 0.16}}
  },
  "runtime": {
    "elapsed_seconds": 120.5,
    "inference_time_seconds": 98.2,
    "scoring_time_seconds": 22.3
  },
  "response_stats": {
    "avg_latency_ms": 1250.0,
    "avg_prompt_tokens": 320,
    "avg_completion_tokens": 450,
    "count": 50,
    "successful_count": 48
  }
}

Observability Trade-offs#

Feature	Container mode	Native mode (Skills/BYOB)
Aggregate scores	Yes	Yes
Response stats (avg latency, tokens)	Yes (from interceptors)	Yes
Per-request trajectories	No	Yes
Failure categorization	No	Yes
n_repeats with pass@k	No (harness controls)	Yes
Bootstrap CI	No	Yes
Scoring details per sample	No	Yes

For full observability, see NeMo Skills Integration which wraps Skills benchmarks natively.