Evaluations#

Evaluation is the process of measuring and assessing model performance on specific tasks, benchmarks, or datasets.

The Evaluations page in NeMo Studio provides a centralized interface for managing your model evaluation jobs. You can create, monitor, and analyze evaluation jobs that measure model performance.

Backend Microservices#

In the backend, the UI communicates with NeMo Evaluator to manage evaluation jobs and entities such as configurations, results, and metadata.

Evaluation Page UI Overview#

The following are the main components and features of the Evaluations page.

Evaluation Job Listing#

The Evaluations page displays your evaluation jobs in a table format with the following columns:

Evaluation ID: The unique identifier for your evaluation job.
Status: The current state of the job.
Configuration Name: The name of the evaluation configuration used for the job.
Model: The model being evaluated.
Tags: The tags associated with the evaluation job.
Created: Timestamp showing when the job was created.

You can sort evaluation jobs by clicking on column headers to organize your view.

Evaluation Configuration Listing#

The Evaluations page includes a Configurations tab that displays your evaluation configurations in a table format with the following columns:

Configuration Name: The name of the evaluation configuration.
Target Type: The type of target being evaluated.
Input File: The input file used for the evaluation configuration, showing the file path.
Created: Timestamp showing when the configuration was created.
Actions: Menu options for managing the configuration.

You can create new evaluation configurations by clicking the Create Configuration button.

Search and Filter#

You can filter and sort evaluation jobs.

Filter Options: Apply filters to narrow down jobs based on status, model, or other criteria.
Sort: Click column headers to sort jobs by name, score, status, or date.

Evaluation Job Management#

You can manage evaluation jobs on the Evaluations page.

Create New Job: Start a new evaluation job by clicking the Launch Evaluation button.
View Details: Access detailed information about each job, including configuration, metrics, and results.
Job Actions: Use the context menu (three-dot menu icon) to perform actions such as rerunning, comparing, or deleting jobs.
Monitor Progress: Track real-time progress of evaluation jobs.

Evaluation Workflow#

The following are the common workflows for evaluating models.

Create an Evaluation Job#

To create a new evaluation job:

Navigate to the Evaluations page from the left sidebar.
Click the Launch Evaluation button.
Set an evaluation target. For more information about the evaluation target, refer to Create and Manage Evaluation Targets.
Select a model to evaluate. You can select a base model or a customized model.
Choose an evaluation dataset.
Configure evaluation parameters and metrics. For more information about the evaluation metrics, refer to Evaluation Metrics.
Review your configuration and submit the job.

Monitor Job Progress#

While a job is running, you can monitor its progress.

View real-time status updates of evaluation jobs on the Evaluations page.
Click on a job to see the progress and details of each evaluation job.

Analyze Results#

After an evaluation job completes, you can analyze the results.

View results metrics and scores on the Evaluations page.
Compare results across multiple evaluation jobs.
Analyze performance on specific tasks or data subsets.

Evaluation Metrics#

Evaluation metrics assess the accuracy and performance of AI models.

The following table summarizes the evaluation metrics available in NeMo Studio, backed by NeMo Evaluator:

Metric	Description	Value Range	Interpretation
String-Check	Compares model output text against expected labels for exact string matching (case-sensitive or insensitive, based on configuration).	Binary (0 or 1), or [0.0, 1.0] for an aggregated average.	1.0 indicates a perfect, exact match across all samples. 0 indicates no match.
BLEU	Measures the quality of text generation by comparing n-gram overlap between model output and reference text.	[0.0, 1.0]	1.0 indicates perfect word and phrase overlap with the reference (closer to human translation quality).
ROUGE	Evaluates text generation quality by comparing the overlap (unigram, bigram, or Longest Common Subsequence) of ground truth references against model predictions.	[0.0, 1.0]	1.0 signifies maximal similarity and recall of content words from the reference text.
EM (Exact Match)	Measures the percentage of predictions that perfectly match the ground truth.	[0.0, 1.0] (Often expressed as a percentage [0%, 100%].)	1.0 indicates that every prediction in the dataset is precisely correct.
F1	Provides a balanced measure of precision and recall for token-level or instance-level model predictions (the harmonic mean).	[0.0, 1.0]	1.0 represents perfect precision and recall (optimal performance). Typically used for structured output evaluation.