About Evaluating#

Evaluation is powered by NeMo Platform, a cloud-native platform for evaluating large language models (LLMs), RAG pipelines, and AI agents at enterprise scale. The evaluation API provides automated workflows for over 100 industry benchmarks, LLM-as-a-judge scoring, and specialized metrics for RAG and agent systems.

NeMo Platform enables real-time evaluations of your LLM application through APIs, guiding you in refining and optimizing LLMs for enhanced performance and real-world applicability. The NeMo Evaluator APIs can be seamlessly automated within development pipelines, enabling faster iterations without the need for live data. It is cost-effective and suitable for pre-deployment checks and regression testing.

Tutorials

Open Source SDK

Evaluation Concepts#

NeMo Platform supports two core evaluation primitives:

Metrics: Scoring logic that evaluates model outputs. Use metrics when you need flexible, reusable scoring for your own datasets and task-specific criteria.
Benchmarks: Evaluation suites that pair one or more metrics with a dataset. Use benchmarks when you want standardized comparisons or curated end-to-end evaluations, and use custom benchmarks when you want to apply your own metrics to your own datasets.

There are two execution modes and two evaluation patterns:

Live evaluation (synchronous): Submit a request and get results immediately. Best for fast iteration, metric development, and small payloads.
Jobs (asynchronous): Submit work, monitor status, and fetch results when complete. Best for production workloads, larger datasets, and recurring regression checks.
Offline evaluation: Score existing dataset rows (for example, model outputs already generated).
Online evaluation: Generate outputs from a model as part of evaluation, then score them.

For deeper details, see Evaluation Metrics and Evaluation Benchmarks.

Tutorials#

After deploying NeMo Platform Quickstart, use the following tutorials to learn how to accomplish common evaluation tasks. These step-by-step guides help you evaluate models using different benchmarks and metrics.

Run a benchmark evaluation

Learn how to run an evaluation with a built-in benchmark.

beginner nemo-evaluator

Run a Benchmark Evaluation

Run an LLM Judge Eval

Learn how to evaluate a fine-tuned model using the LLM Judge metric with a custom dataset.

custom-dataset

Evaluate Response Quality with LLM-as-a-Judge

Recommended Evaluation Journey#

Most teams get the best results by starting metric-first, then moving to benchmarks:

Develop and validate your metrics first
- Start with Metrics to define how quality should be scored for your use case.
- Use live evaluation (POST /v2/workspaces/{workspace}/evaluation/metric-evaluate) with small DatasetRows payloads to iterate quickly.
Scale metric evaluation to jobs
- When metrics are validated, run async metric jobs (/evaluation/metric-jobs) on larger datasets.
- Use filesets for production-scale inputs. See Manage Files.
Package validated metrics into custom benchmarks
- Create Custom Benchmarks by combining one or more validated metrics with a dataset.
- Use benchmark jobs (/evaluation/benchmark-jobs) for repeatable regression testing and model comparisons.
Monitor and analyze results
- Track job status and progress with job management APIs.
- Retrieve results and artifacts for analysis, reporting, and regression tracking.