Architecture Overview#
NeMo Evaluator provides a two-tier architecture for comprehensive model evaluation:
graph TB subgraph Tier2[" Orchestration Layer"] Launcher["nemo-evaluator-launcher<br/>• CLI orchestration<br/>• Multi-backend execution (local, Slurm, Lepton)<br/>• Deployment management (vLLM, NIM, SGLang)<br/>• Result export (MLflow, W&B, Google Sheets)"] end subgraph Tier1[" Evaluation Engine"] Evaluator["nemo-evaluator<br/>• Adapter system<br/>• Interceptor pipeline<br/>• Containerized evaluation execution<br/>• Result aggregation"] end subgraph External["NVIDIA Eval Factory Containers"] Containers["Evaluation Frameworks<br/>• nvidia-lm-eval (lm-evaluation-harness)<br/>• nvidia-simple-evals<br/>• nvidia-bfcl, nvidia-bigcode-eval<br/>• nvidia-eval-factory-garak<br/>• nvidia-safety-harness"] end Launcher --> Evaluator Evaluator --> Containers style Tier2 fill:#e1f5fe style Tier1 fill:#f3e5f5 style External fill:#fff3e0
Component Overview#
Orchestration Layer (nemo-evaluator-launcher
)#
High-level orchestration for complete evaluation workflows.
Key Features:
CLI and YAML configuration management
Multi-backend execution (local, Slurm, Lepton)
Deployment management (vLLM, NIM, SGLang, or bring-your-own-endpoint)
Result export to MLflow, Weights & Biases, and Google Sheets
Job monitoring and lifecycle management
Use Cases:
Automated evaluation pipelines
HPC cluster evaluations with Slurm
Cloud deployments with Lepton AI
Multi-model comparative studies
Evaluation Engine (nemo-evaluator
)#
Core evaluation capabilities with request/response processing.
Key Features:
Adapter System: Request/response processing layer for API endpoints
Interceptor Pipeline: Modular components for logging, caching, and reasoning
Containerized Execution: Evaluation harnesses run in Docker containers
Result Aggregation: Standardized result schemas and metrics
Use Cases:
Programmatic evaluation integration
Request/response transformation and logging
Custom interceptor development
Direct Python API usage
Interceptor Pipeline#
The evaluation engine provides an interceptor system for request/response processing. Interceptors are configurable components that process API requests and responses in a pipeline.
graph LR A[Request] --> B[System Message] B --> C[Payload Modifier] C --> D[Request Logging] D --> E[Caching] E --> F[API Endpoint] F --> G[Response Logging] G --> H[Reasoning] H --> I[Response Stats] I --> J[Response] style E fill:#e1f5fe style F fill:#f3e5f5
Available Interceptors:
System Message: Inject system prompts into chat requests
Payload Modifier: Transform request parameters
Request/Response Logging: Log requests and responses to files
Caching: Cache responses to avoid redundant API calls
Reasoning: Extract chain-of-thought from responses
Response Stats: Track token usage and latency metrics
Progress Tracking: Monitor evaluation progress
Integration Patterns#
Pattern 1: Launcher with Deployment#
Use the launcher to handle both model deployment and evaluation:
nemo-evaluator-launcher run \
--config-dir packages/nemo-evaluator-launcher/examples \
--config-name local_llama_3_1_8b_instruct \
-o deployment.checkpoint_path=/path/to/model \
-o 'evaluation.tasks=["mmlu_pro", "gsm8k"]'
Pattern 2: Launcher with Existing Endpoint#
Point the launcher to an existing API endpoint:
nemo-evaluator-launcher run \
--config-dir packages/nemo-evaluator-launcher/examples \
--config-name local_llama_3_1_8b_instruct \
-o target.api_endpoint.url=http://localhost:8080/v1/completions \
-o deployment.type=none
Pattern 3: Python API#
Use the Python API for programmatic integration:
from nemo_evaluator import evaluate, EvaluationConfig, EvaluationTarget, ApiEndpoint, EndpointType
# Configure target endpoint
api_endpoint = ApiEndpoint(
url="http://localhost:8080/v1/completions",
type=EndpointType.COMPLETIONS
)
target = EvaluationTarget(api_endpoint=api_endpoint)
# Configure evaluation
eval_config = EvaluationConfig(
type="mmlu_pro",
output_dir="./results"
)
# Run evaluation
results = evaluate(eval_cfg=eval_config, target_cfg=target)