Key Features#

NeMo Evaluator delivers comprehensive AI model evaluation through a dual-library architecture that scales from local development to enterprise production. Experience container-first reproducibility, multi-backend execution, and 100+ benchmarks across 17 evaluation harnesses.

Unified Orchestration (NeMo Evaluator Launcher)#

Multi-Backend Execution#

Run evaluations anywhere with unified configuration and monitoring:

  • Local Execution: Docker-based evaluation on your workstation

  • HPC Clusters: Slurm integration for large-scale parallel evaluation

  • Cloud Platforms: Lepton AI and custom cloud backend support

  • Hybrid Workflows: Mix local development with cloud production

# Single command, multiple backends
nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct
nv-eval run --config-dir examples --config-name slurm_llama_3_1_8b_instruct  
nv-eval run --config-dir examples --config-name lepton_vllm_llama_3_1_8b_instruct

100+ Benchmarks Across 17 Harnesses#

Access comprehensive benchmark suite with single CLI:

# Discover available benchmarks
nv-eval ls tasks

# Run academic benchmarks
nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct \
  -o 'evaluation.tasks=["mmlu_pro", "gsm8k", "arc_challenge"]'

# Run safety evaluation
nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct \
  -o 'evaluation.tasks=["aegis_v2", "garak"]'

Built-in Result Export#

First-class integration with MLOps platforms:

# Export to MLflow
nv-eval export <invocation_id> --dest mlflow

# Export to Weights & Biases
nv-eval export <invocation_id> --dest wandb

# Export to Google Sheets
nv-eval export <invocation_id> --dest gsheets

Core Evaluation Engine (NeMo Evaluator Core)#

Container-First Architecture#

Pre-built NGC containers guarantee reproducible results across environments:

Container

Benchmarks

Use Case

simple-evals

MMLU Pro, GSM8K, ARC

Academic benchmarks

lm-evaluation-harness

HellaSwag, TruthfulQA, PIQA

Language model evaluation

bigcode-evaluation-harness

HumanEval, MBPP, APPS

Code generation

safety-harness

Toxicity, bias, jailbreaking

Safety assessment

vlmevalkit

VQA, image captioning

Vision-language models

agentic_eval

Tool usage, planning

Agentic AI evaluation

# Pull and run any evaluation container
docker pull nvcr.io/nvidia/eval-factory/simple-evals:25.08.1
docker run --rm -it --gpus all nvcr.io/nvidia/eval-factory/simple-evals:25.08.1

Advanced Adapter System#

Sophisticated request/response processing pipeline with interceptor architecture:

# Configure adapter system in framework YAML configuration
target:
  api_endpoint:
    url: "http://localhost:8080/v1/completions/"
    model_id: "my-model"
    adapter_config:
      interceptors:
        # System message interceptor
        - name: system_message
          config:
            system_message: "You are a helpful AI assistant. Think step by step."
        
        # Request logging interceptor
        - name: request_logging
          config:
            max_requests: 1000
        
        # Caching interceptor
        - name: caching
          config:
            cache_dir: "./evaluation_cache"
        
        # Reasoning interceptor
        - name: reasoning
          config:
            start_reasoning_token: "<think>"
            end_reasoning_token: "</think>"
        
        # Response logging interceptor
        - name: response_logging
          config:
            max_responses: 1000
        
        # Progress tracking interceptor
        - name: progress_tracking

Programmatic API#

Full Python API for integration into ML pipelines:

from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import EvaluationConfig, EvaluationTarget

# Configure and run evaluation programmatically
result = evaluate(
    eval_cfg=EvaluationConfig(type="mmlu_pro", output_dir="./results"),
    target_cfg=EvaluationTarget(api_endpoint=endpoint_config)
)

Container Direct Access#

NGC Container Catalog#

Direct access to specialized evaluation containers:

# Academic benchmarks
docker run --rm -it --gpus all nvcr.io/nvidia/eval-factory/simple-evals:25.08.1

# Code generation evaluation  
docker run --rm -it --gpus all nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.08.1

# Safety and security testing
docker run --rm -it --gpus all nvcr.io/nvidia/eval-factory/safety-harness:25.08.1

# Vision-language model evaluation
docker run --rm -it --gpus all nvcr.io/nvidia/eval-factory/vlmevalkit:25.08.1

Reproducible Evaluation Environments#

Every container provides:

  • Fixed dependencies: Locked versions for consistent results

  • Pre-configured frameworks: Ready-to-run evaluation harnesses

  • Isolated execution: No dependency conflicts between evaluations

  • Version tracking: Tagged releases for exact reproducibility

Enterprise Features#

Multi-Backend Scalability#

Scale from laptop to datacenter with unified configuration:

  • Local Development: Quick iteration with Docker

  • HPC Clusters: Slurm integration for large-scale evaluation

  • Cloud Platforms: Lepton AI and custom backend support

  • Hybrid Workflows: Seamless transition between environments

Advanced Configuration Management#

Hydra-based configuration with full reproducibility:

# Evaluation configuration with overrides
evaluation:
  tasks:
    - name: mmlu_pro
      overrides:
        config.params.limit_samples: 1000
    - name: gsm8k
      overrides:
        config.params.temperature: 0.0

execution:
  output_dir: results

target:
  api_endpoint:
    url: https://my-model-endpoint.com/v1/chat/completions
    model_id: my-custom-model

OpenAI API Compatibility#

Universal Model Support#

Evaluate any model that exposes OpenAI-compatible endpoints:

  • Hosted Models: NVIDIA Build, OpenAI, Anthropic, Cohere

  • Self-Hosted: vLLM, TRT-LLM, NeMo Framework

  • Custom Endpoints: Any service implementing OpenAI API spec

Endpoint Type Support#

Support for diverse evaluation endpoint types through the evaluation configuration:

# Text generation evaluation (chat endpoint)
target:
  api_endpoint:
    type: chat
    url: https://api.example.com/v1/chat/completions

# Log-probability evaluation (completions endpoint)
target:
  api_endpoint:
    type: completions
    url: https://api.example.com/v1/completions

# Vision-language evaluation (vlm endpoint)
target:
  api_endpoint:
    type: vlm
    url: https://api.example.com/v1/chat/completions

Extensibility and Customization#

Custom Framework Support#

Add your own evaluation frameworks using Framework Definition Files:

# custom_framework.yml
framework:
  name: my_custom_eval
  description: Custom evaluation for domain-specific tasks
  
defaults:
  command: >-
    python custom_eval.py --model {{target.api_endpoint.model_id}}
    --task {{config.params.task}} --output {{config.output_dir}}
    
evaluations:
  - name: domain_specific_task
    description: Evaluate domain-specific capabilities
    defaults:
      config:
        params:
          task: domain_task
          temperature: 0.0

Advanced Interceptor Configuration#

Fine-tune request/response processing with the adapter system through YAML configuration:

# Production-ready adapter configuration in framework YAML
target:
  api_endpoint:
    url: "https://production-api.com/v1/completions"
    model_id: "production-model"
    adapter_config:
      log_failed_requests: true
      interceptors:
        # System message interceptor
        - name: system_message
          config:
            system_message: "You are an expert AI assistant specialized in this domain."
        
        # Request logging interceptor
        - name: request_logging
          config:
            max_requests: 5000
        
        # Caching interceptor
        - name: caching
          config:
            cache_dir: "./production_cache"
        
        # Reasoning interceptor
        - name: reasoning
          config:
            start_reasoning_token: "<think>"
            end_reasoning_token: "</think>"
        
        # Response logging interceptor
        - name: response_logging
          config:
            max_responses: 5000
        
        # Progress tracking interceptor
        - name: progress_tracking
          config:
            progress_tracking_url: "http://monitoring.internal:3828/progress"

Security and Safety#

Comprehensive Safety Evaluation#

Built-in safety assessment through specialized containers:

# Run safety evaluation suite
nv-eval run \
    --config-dir examples \
    --config-name local_llama_3_1_8b_instruct \
    -o 'evaluation.tasks=["aegis_v2", "garak"]'

Safety Containers Available:

  • safety-harness: Content safety evaluation using NemoGuard judge models

  • garak: Security vulnerability scanning and prompt injection detection

  • agentic_eval: Tool usage and planning evaluation for agentic AI systems

Monitoring and Observability#

Real-Time Progress Tracking#

Monitor evaluation progress across all backends:

# Check evaluation status
nv-eval status <invocation_id>

# Kill running evaluations
nv-eval kill <invocation_id>

Result Export and Analysis#

Export evaluation results to MLOps platforms for downstream analysis:

# Export to MLflow for experiment tracking
nv-eval export <invocation_id> --dest mlflow

# Export to Weights & Biases for visualization
nv-eval export <invocation_id> --dest wandb

# Export to Google Sheets for sharing
nv-eval export <invocation_id> --dest gsheets