Evaluation Model#
NeMo Evaluator provides evaluation approaches and endpoint compatibility for comprehensive AI model assessment.
Evaluation Approaches#
NeMo Evaluator supports several evaluation approaches through containerized harnesses:
Text Generation: Models generate responses to prompts, assessed for correctness or quality against reference answers or rubrics.
Log Probability: Models assign probabilities to token sequences, enabling confidence measurement without text generation. Effective for choice-based tasks and base model evaluation.
Code Generation: Models generate code from natural language descriptions, evaluated for correctness through test execution.
Function Calling: Models generate structured outputs for tool use and API interaction scenarios.
Retrieval Augmented Generation: Models fetches content based on context, evaluated for content relevance and converage, as well as answer corectness.
Visual Understanding: Models generate responses to prompts with images and videos, assessed for correctness or quality against reference answers or rubrics.
Agentic Workflows: Models are tasked with complex problems and need to select and engage tools autonomously.
Safety & Security: Evaluation against adversarial prompts and safety benchmarks to test model alignment and robustness.
One or more evaluation harnesses implement each approach. To discover available tasks for each approach, use nemo-evaluator-launcher ls tasks
.
Endpoint Compatibility#
NeMo Evaluator targets OpenAI-compatible API endpoints. The platform supports the following endpoint types:
completions
: Direct text completion without chat formatting (/v1/completions
). Used for base models and academic benchmarks.chat
: Conversational interface with role-based messages (/v1/chat/completions
). Used for instruction-tuned and chat models.vlm
: Vision-language model endpoints supporting image inputs.embedding
: Embedding generation endpoints for retrieval evaluation.
Each evaluation task specifies which endpoint types it supports. Verify compatibility using nemo-evaluator-launcher ls tasks
.
Metrics#
Individual evaluation harnesses define metrics that vary by task. Common metric types include:
Accuracy metrics: Exact match, normalized accuracy, F1 scores
Generative metrics: BLEU, ROUGE, code execution pass rates
Probability metrics: Perplexity, log-likelihood scores
Safety metrics: Refusal rates, toxicity scores, vulnerability detection
The platform returns results in a standardized schema regardless of the source harness. To see metrics for a specific task, refer to Benchmark Catalog or run an evaluation and inspect the results.
For hands-on guides, refer to Run Evaluations.