About Evaluating#
Evaluation is powered by NeMo Platform, a cloud-native platform for evaluating large language models (LLMs), RAG pipelines, and AI agents at enterprise scale. The evaluation API provides automated workflows for over 100 industry benchmarks, LLM-as-a-judge scoring, and specialized metrics for RAG and agent systems.
NeMo Platform enables real-time evaluations of your LLM application through APIs, guiding you in refining and optimizing LLMs for enhanced performance and real-world applicability. The NeMo Evaluator APIs can be seamlessly automated within development pipelines, enabling faster iterations without the need for live data. It is cost-effective and suitable for pre-deployment checks and regression testing.
Evaluation Concepts#
NeMo Platform supports two core evaluation primitives:
Metrics: Scoring logic that evaluates model outputs. Use metrics when you need flexible, reusable scoring for your own datasets and task-specific criteria.
Benchmarks: Evaluation suites that pair one or more metrics with a dataset. Use benchmarks when you want standardized comparisons or curated end-to-end evaluations, and use custom benchmarks when you want to apply your own metrics to your own datasets.
There are two execution modes and two evaluation patterns:
Live evaluation (synchronous): Submit a request and get results immediately. Best for fast iteration, metric development, and small payloads.
Jobs (asynchronous): Submit work, monitor status, and fetch results when complete. Best for production workloads, larger datasets, and recurring regression checks.
Offline evaluation: Score existing dataset rows (for example, model outputs already generated).
Online evaluation: Generate outputs from a model as part of evaluation, then score them.
For deeper details, see Evaluation Metrics and Evaluation Benchmarks.
Tutorials#
After deploying NeMo Platform Quickstart, use the following tutorials to learn how to accomplish common evaluation tasks. These step-by-step guides help you evaluate models using different benchmarks and metrics.
Learn how to run an evaluation with a built-in benchmark.
Learn how to evaluate a fine-tuned model using the LLM Judge metric with a custom dataset.
Recommended Evaluation Journey#
Most teams get the best results by starting metric-first, then moving to benchmarks:
Develop and validate your metrics first
Start with Metrics to define how quality should be scored for your use case.
Use live evaluation (
POST /v2/workspaces/{workspace}/evaluation/metric-evaluate) with smallDatasetRowspayloads to iterate quickly.
Scale metric evaluation to jobs
When metrics are validated, run async metric jobs (
/evaluation/metric-jobs) on larger datasets.Use filesets for production-scale inputs. See Manage Files.
Package validated metrics into custom benchmarks
Create Custom Benchmarks by combining one or more validated metrics with a dataset.
Use benchmark jobs (
/evaluation/benchmark-jobs) for repeatable regression testing and model comparisons.
Monitor and analyze results
Track job status and progress with job management APIs.
Retrieve results and artifacts for analysis, reporting, and regression tracking.
Where to Go Next#
For benchmark workflows, see Benchmark Job Management and Benchmark Results.
For metric workflows, see Metric Jobs and Metric Results.
For full endpoint details, see API Reference.
Available Evaluations#
Review configurations, data formats, and result examples for each evaluation.
Standard benchmarks for code generation, safety, reasoning, and tool-calling.
Evaluate document retrieval pipelines on standard or custom datasets.
Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation).
Assess agent-based and multi-step reasoning models, including topic adherence and tool use.
Use another LLM to evaluate outputs with flexible scoring criteria. Define custom rubrics or numerical ranges.
Create metrics for text similarity, exact matching, and standard NLP evaluations using Jinja2 templating.