About Evaluating#

Evaluation is powered by NeMo Platform, a cloud-native platform for evaluating large language models (LLMs), RAG pipelines, and AI agents at enterprise scale. The evaluation API provides automated workflows for over 100 industry benchmarks, LLM-as-a-judge scoring, and specialized metrics for RAG and agent systems.

NeMo Platform enables real-time evaluations of your LLM application through APIs, guiding you in refining and optimizing LLMs for enhanced performance and real-world applicability. The NeMo Evaluator APIs can be seamlessly automated within development pipelines, enabling faster iterations without the need for live data. It is cost-effective and suitable for pre-deployment checks and regression testing.


Evaluation Concepts#

NeMo Platform supports two core evaluation primitives:

  • Metrics: Scoring logic that evaluates model outputs. Use metrics when you need flexible, reusable scoring for your own datasets and task-specific criteria.

  • Benchmarks: Evaluation suites that pair one or more metrics with a dataset. Use benchmarks when you want standardized comparisons or curated end-to-end evaluations, and use custom benchmarks when you want to apply your own metrics to your own datasets.

There are two execution modes and two evaluation patterns:

  • Live evaluation (synchronous): Submit a request and get results immediately. Best for fast iteration, metric development, and small payloads.

  • Jobs (asynchronous): Submit work, monitor status, and fetch results when complete. Best for production workloads, larger datasets, and recurring regression checks.

  • Offline evaluation: Score existing dataset rows (for example, model outputs already generated).

  • Online evaluation: Generate outputs from a model as part of evaluation, then score them.

For deeper details, see Evaluation Metrics and Evaluation Benchmarks.


Tutorials#

After deploying NeMo Platform Quickstart, use the following tutorials to learn how to accomplish common evaluation tasks. These step-by-step guides help you evaluate models using different benchmarks and metrics.

Run a benchmark evaluation

Learn how to run an evaluation with a built-in benchmark.

Run a Benchmark Evaluation
Run an LLM Judge Eval

Learn how to evaluate a fine-tuned model using the LLM Judge metric with a custom dataset.

Evaluate Response Quality with LLM-as-a-Judge


Where to Go Next#


Available Evaluations#

Review configurations, data formats, and result examples for each evaluation.

Industry Benchmarks

Standard benchmarks for code generation, safety, reasoning, and tool-calling.

Industry Benchmarks
Retrieval

Evaluate document retrieval pipelines on standard or custom datasets.

Retriever Evaluation Metrics
RAG

Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation).

RAG Evaluation Metrics
Agentic

Assess agent-based and multi-step reasoning models, including topic adherence and tool use.

Agentic Evaluation Metrics
LLM-as-a-Judge

Use another LLM to evaluate outputs with flexible scoring criteria. Define custom rubrics or numerical ranges.

Evaluate with LLM-as-a-Judge
Similarity Metrics

Create metrics for text similarity, exact matching, and standard NLP evaluations using Jinja2 templating.

Similarity Metrics