RAG Evaluation Metrics#
RAG (Retrieval Augmented Generation) metrics evaluate the quality of RAG pipelines by measuring both retrieval and answer generation performance. These metrics use RAGAS to assess how well retrieved contexts support generated answers.
Overview#
RAGAS metrics evaluate RAG components by measuring how well retrieved contexts support generated answers. These metrics support both offline and online evaluation modes:
Offline evaluation: Uses pre-generated responses from your dataset
Online evaluation: Responses are generated automatically using a model and prompt template before evaluation
Job’s model and prompt_template are used to generate responses
Generated response (in
sample["output_text"]) is automatically used asresponsein RAGAS evaluationRAG context variables can be included in the job’s
prompt_template:{{user_input}}- User question/input from dataset{{retrieved_contexts}}- Retrieved context passages from dataset
RAGAS metrics require:
Judge LLM: An LLM to evaluate answer quality (required for most metrics)
Judge Embeddings (optional): Required for some metrics like
response_relevancyData: Questions, contexts, and either pre-generated responses (offline) or a model to generate them (online)
Prerequisites#
Before running RAG evaluations:
Workspace: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.
Model Endpoints: Access to judge LLM endpoints (and embeddings model for some metrics)
API Keys (if required): Create secrets for any endpoints requiring authentication
Initialize the SDK:
import os
from nemo_platform import NeMoPlatform
from nemo_platform.types.evaluation import (
ContextEntityRecallMetricParam,
ContextPrecisionMetricParam,
ContextRecallMetricParam,
ContextRelevanceMetricParam,
DatasetRowsParam,
EvaluationJobParamsParam,
FaithfulnessMetricParam,
MetricOfflineJobParam,
MetricOnlineJobParam,
ModelParam,
NoiseSensitivityMetricParam,
ResponseGroundednessMetricParam,
ResponseRelevancyMetricParam,
)
client = NeMoPlatform(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
workspace="default",
)
Creating a Secret for API Keys#
If using external endpoints that require authentication (like NVIDIA API), create a secret first:
client.secrets.create(
name="nvidia-api-key",
data="nvapi-YOUR_API_KEY_HERE",
description="NVIDIA API key for RAG metrics"
)
Tip
RAGAS metrics accept both inline model definitions and model references (e.g.,
"my-workspace/my-model") for judge_model and embeddings_model fields. See
Model Configuration for details.
Supported RAGAS Metrics#
Use Case |
Metric Type |
Description |
Required Columns* |
|---|---|---|---|
Measure retrieval quality |
Coverage of reference information in retrieved context |
user_input, retrieved_contexts, reference |
|
Whether all retrieved chunks are relevant to the question |
user_input, retrieved_contexts, reference |
||
Relevance of retrieved context to the question |
user_input, retrieved_contexts |
||
Recall of important entities from reference in context |
retrieved_contexts, reference |
||
Detect hallucinations |
Measures factual consistency of response with retrieved context |
user_input, response, retrieved_contexts |
|
Evaluates whether response is grounded in context without hallucinations |
response, retrieved_contexts |
||
Robustness to noisy or irrelevant context |
user_input, response, reference, retrieved_contexts |
||
Check if answers address the question |
Response relevancy to question using embeddings similarity |
user_input, response, retrieved_contexts |
* Required Columns: Dataset columns that must be present for the metric to be evaluated.
** Requires embeddings_model in addition to judge_model
Context Recall#
Measures the fraction of relevant content retrieved compared to the total relevant content in the reference.
Score name:
context_recallScore range: 0 to 1, with higher scores indicating better recall.
Data Format#
{
"user_input": "What is the capital of France?",
"retrieved_contexts": ["Paris is the capital and largest city of France."],
"reference": "Paris is the capital of France."
}
result = client.evaluation.metrics.evaluate(
metric=ContextRecallMetricParam(
type="context_recall",
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
),
),
dataset=DatasetRowsParam(
rows=[{
"user_input": "What is the capital of France?",
"retrieved_contexts": ["Paris is the capital and largest city of France."],
"reference": "Paris is the capital of France.",
}]
),
)
for score in result.aggregate_scores:
print(f"{score.name}: mean={score.mean}")
# Create the metric entity
client.evaluation.metrics.create(
name="my-context-recall",
type="context_recall",
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
)
# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
spec=MetricOfflineJobParam(
metric="my-workspace/my-context-recall",
dataset="my-workspace/rag-dataset",
params=EvaluationJobParamsParam(parallelism=16)
)
)
# Create the metric entity
client.evaluation.metrics.create(
name="my-context-recall",
type="context_recall",
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
)
# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
spec=MetricOnlineJobParam(
metric="my-workspace/my-context-recall",
model=ModelParam( # Generation model
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="nvidia/llama-3.3-nemotron-super-49b-v1",
api_key_secret="nvidia-api-key",
),
prompt_template={ # Can include RAG context variables
"messages": [{
"role": "user",
"content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
}]
},
dataset="my-workspace/rag-dataset", # response column optional
params=EvaluationJobParamsParam(
parallelism=16,
inference={"temperature": 0.7, "max_tokens": 1024},
)
)
)
{
"scores": [
{
"count": 1,
"histogram": {},
"name": "context_recall",
"nan_count": 0,
"max": 1.0,
"mean": 1.0,
"min": 1.0,
"percentiles": {},
"score_type": "range",
"std_dev": 0.0,
"sum": 1.0,
"variance": 0.0
}
]
}
Context Precision#
Measures the proportion of relevant chunks in the retrieved contexts (precision@k).
Score name:
context_precisionScore range: 0 to 1, with higher scores indicating better precision.
Data Format#
{
"user_input": "What is the capital of France?",
"retrieved_contexts": ["Paris is the capital and largest city of France."],
"reference": "Paris"
}
result = client.evaluation.metrics.evaluate(
metric=ContextPrecisionMetricParam(
type="context_precision",
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
),
),
dataset=DatasetRowsParam(
rows=[{
"user_input": "What is the capital of France?",
"retrieved_contexts": ["Paris is the capital and largest city of France."],
"reference": "Paris",
}]
),
)
for score in result.aggregate_scores:
print(f"{score.name}: mean={score.mean}")
# Create the metric entity
client.evaluation.metrics.create(
name="my-context-precision",
type="context_precision",
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
)
# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
spec=MetricOfflineJobParam(
metric="my-workspace/my-context-precision",
dataset="my-workspace/rag-dataset",
params=EvaluationJobParamsParam(parallelism=16)
)
)
# Create the metric entity
client.evaluation.metrics.create(
name="my-context-precision",
type="context_precision",
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
)
# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
spec=MetricOnlineJobParam(
metric="my-workspace/my-context-precision",
model=ModelParam( # Generation model
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="nvidia/llama-3.3-nemotron-super-49b-v1",
api_key_secret="nvidia-api-key",
),
prompt_template={ # Can include RAG context variables
"messages": [{
"role": "user",
"content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
}]
},
dataset="my-workspace/rag-dataset", # response column optional
params=EvaluationJobParamsParam(
parallelism=16,
inference={"temperature": 0.7, "max_tokens": 1024},
)
)
)
{
"scores": [
{
"count": 1,
"histogram": {},
"name": "context_precision",
"nan_count": 0,
"max": 1.0,
"mean": 1.0,
"min": 1.0,
"percentiles": {},
"score_type": "range",
"std_dev": 0.0,
"sum": 1.0,
"variance": 0.0
}
]
}
Context Relevance#
Judges assess whether retrieved contexts are relevant to the user input.
Score name:
nv_context_relevanceScore range: 0/1/2 scale, normalized to [0,1].
Data Format#
{
"user_input": "What is the capital of France?",
"retrieved_contexts": ["Paris is the capital and largest city of France."]
}
result = client.evaluation.metrics.evaluate(
metric=ContextRelevanceMetricParam(
type="context_relevance",
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
),
),
dataset=DatasetRowsParam(
rows=[{
"user_input": "What is the capital of France?",
"retrieved_contexts": ["Paris is the capital and largest city of France."],
}]
),
)
for score in result.aggregate_scores:
print(f"{score.name}: mean={score.mean}")
# Create the metric entity
client.evaluation.metrics.create(
name="my-context-relevance",
type="context_relevance",
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
)
# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
spec=MetricOfflineJobParam(
metric="my-workspace/my-context-relevance",
dataset="my-workspace/rag-dataset",
params=EvaluationJobParamsParam(parallelism=16)
)
)
# Create the metric entity
client.evaluation.metrics.create(
name="my-context-relevance",
type="context_relevance",
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
)
# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
spec=MetricOnlineJobParam(
metric="my-workspace/my-context-relevance",
model=ModelParam( # Generation model
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="nvidia/llama-3.3-nemotron-super-49b-v1",
api_key_secret="nvidia-api-key",
),
prompt_template={ # Can include RAG context variables
"messages": [{
"role": "user",
"content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
}]
},
dataset="my-workspace/rag-dataset", # response column optional
params=EvaluationJobParamsParam(
parallelism=16,
inference={"temperature": 0.7, "max_tokens": 1024},
)
)
)
{
"scores": [
{
"count": 1,
"histogram": {},
"name": "nv_context_relevance",
"nan_count": 0,
"max": 1.0,
"mean": 1.0,
"min": 1.0,
"percentiles": {},
"score_type": "range",
"std_dev": 0.0,
"sum": 1.0,
"variance": 0.0
}
]
}
Context Entity Recall#
Measures the recall of entities in the retrieved contexts compared to entities in the reference.
Score name:
context_entity_recallScore range: 0 to 1, with higher scores indicating better entity coverage.
Data Format#
{
"retrieved_contexts": ["Paris is the capital and largest city of France."],
"reference": "Paris is the capital of France."
}
result = client.evaluation.metrics.evaluate(
metric=ContextEntityRecallMetricParam(
type="context_entity_recall",
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
),
),
dataset=DatasetRowsParam(
rows=[{
"retrieved_contexts": ["Paris is the capital and largest city of France."],
"reference": "Paris is the capital of France.",
}]
),
)
for score in result.aggregate_scores:
print(f"{score.name}: mean={score.mean}")
# Create the metric entity
client.evaluation.metrics.create(
name="my-context-entity-recall",
type="context_entity_recall",
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
)
# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
spec=MetricOfflineJobParam(
metric="my-workspace/my-context-entity-recall",
dataset="my-workspace/rag-dataset",
params=EvaluationJobParamsParam(parallelism=16)
)
)
# Create the metric entity
client.evaluation.metrics.create(
name="my-context-entity-recall",
type="context_entity_recall",
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
)
# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
spec=MetricOnlineJobParam(
metric="my-workspace/my-context-entity-recall",
model=ModelParam( # Generation model
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="nvidia/llama-3.3-nemotron-super-49b-v1",
api_key_secret="nvidia-api-key",
),
prompt_template={ # Can include RAG context variables
"messages": [{
"role": "user",
"content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
}]
},
dataset="my-workspace/rag-dataset", # response column optional
params=EvaluationJobParamsParam(
parallelism=16,
inference={"temperature": 0.7, "max_tokens": 1024},
)
)
)
{
"scores": [
{
"count": 1,
"histogram": {},
"name": "context_entity_recall",
"nan_count": 0,
"max": 1.0,
"mean": 1.0,
"min": 1.0,
"percentiles": {},
"score_type": "range",
"std_dev": 0.0,
"sum": 1.0,
"variance": 0.0
}
]
}
Faithfulness#
Measures how factually consistent a response is with the retrieved context.
Score name:
faithfulnessScore range: 0 to 1, with higher scores indicating better consistency.
Data Format#
{
"user_input": "What is the capital of France?",
"response": "The capital of France is Paris.",
"retrieved_contexts": ["Paris is the capital and largest city of France."]
}
result = client.evaluation.metrics.evaluate(
metric=FaithfulnessMetricParam(
type="faithfulness",
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
),
),
dataset=DatasetRowsParam(
rows=[{
"user_input": "What is the capital of France?",
"response": "The capital of France is Paris.",
"retrieved_contexts": ["Paris is the capital and largest city of France."],
}]
),
)
for score in result.aggregate_scores:
print(f"{score.name}: mean={score.mean}")
# Create the metric entity
client.evaluation.metrics.create(
name="my-faithfulness",
type="faithfulness",
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
)
# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
spec=MetricOfflineJobParam(
metric="my-workspace/my-faithfulness",
dataset="my-workspace/rag-dataset",
params=EvaluationJobParamsParam(parallelism=16)
)
)
# Create the metric entity
client.evaluation.metrics.create(
name="my-faithfulness",
type="faithfulness",
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
)
# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
spec=MetricOnlineJobParam(
metric="my-workspace/my-faithfulness",
model=ModelParam( # Generation model
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="nvidia/llama-3.3-nemotron-super-49b-v1",
api_key_secret="nvidia-api-key",
),
prompt_template={ # Can include RAG context variables
"messages": [{
"role": "user",
"content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
}]
},
dataset="my-workspace/rag-dataset", # response column optional
params=EvaluationJobParamsParam(
parallelism=16,
inference={"temperature": 0.7, "max_tokens": 1024},
)
)
)
{
"scores": [
{
"count": 1,
"histogram": {},
"name": "faithfulness",
"nan_count": 0,
"max": 1.0,
"mean": 1.0,
"min": 1.0,
"percentiles": {},
"score_type": "range",
"std_dev": 0.0,
"sum": 1.0,
"variance": 0.0
}
]
}
Response Groundedness#
Evaluates whether the response is grounded in retrieved contexts.
Score name:
nv_response_groundednessScore range: 0/1/2 scale, normalized to [0,1].
Data Format#
{
"response": "The capital of France is Paris.",
"retrieved_contexts": ["Paris is the capital and largest city of France."]
}
result = client.evaluation.metrics.evaluate(
metric=ResponseGroundednessMetricParam(
type="response_groundedness",
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
),
),
dataset=DatasetRowsParam(
rows=[{
"response": "The capital of France is Paris.",
"retrieved_contexts": ["Paris is the capital and largest city of France."],
}]
),
)
for score in result.aggregate_scores:
print(f"{score.name}: mean={score.mean}")
# Create the metric entity
client.evaluation.metrics.create(
name="my-response-groundedness",
type="response_groundedness",
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
)
# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
spec=MetricOfflineJobParam(
metric="my-workspace/my-response-groundedness",
dataset="my-workspace/rag-dataset",
params=EvaluationJobParamsParam(parallelism=16)
)
)
# Create the metric entity
client.evaluation.metrics.create(
name="my-response-groundedness",
type="response_groundedness",
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
)
# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
spec=MetricOnlineJobParam(
metric="my-workspace/my-response-groundedness",
model=ModelParam( # Generation model
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="nvidia/llama-3.3-nemotron-super-49b-v1",
api_key_secret="nvidia-api-key",
),
prompt_template={ # Can include RAG context variables
"messages": [{
"role": "user",
"content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
}]
},
dataset="my-workspace/rag-dataset", # response column optional
params=EvaluationJobParamsParam(
parallelism=16,
inference={"temperature": 0.7, "max_tokens": 1024},
)
)
)
{
"scores": [
{
"count": 1,
"histogram": {},
"name": "nv_response_groundedness",
"nan_count": 0,
"max": 1.0,
"mean": 1.0,
"min": 1.0,
"percentiles": {},
"score_type": "range",
"std_dev": 0.0,
"sum": 1.0,
"variance": 0.0
}
]
}
Noise Sensitivity#
Measures how sensitive the response is to irrelevant or noisy content in the retrieved contexts.
Score name:
noise_sensitivityScore range: 0 to 1, with lower scores indicating better robustness to noise.
Data Format#
{
"user_input": "What is the capital of France?",
"response": "The capital of France is Paris.",
"reference": "Paris",
"retrieved_contexts": ["Paris is the capital of France.", "Irrelevant noise text."]
}
result = client.evaluation.metrics.evaluate(
metric=NoiseSensitivityMetricParam(
type="noise_sensitivity",
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
),
),
dataset=DatasetRowsParam(
rows=[{
"user_input": "What is the capital of France?",
"response": "The capital of France is Paris.",
"reference": "Paris",
"retrieved_contexts": ["Paris is the capital of France.", "Irrelevant noise text."],
}]
),
)
for score in result.aggregate_scores:
print(f"{score.name}: mean={score.mean}")
# Create the metric entity
client.evaluation.metrics.create(
name="my-noise-sensitivity",
type="noise_sensitivity",
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
)
# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
spec=MetricOfflineJobParam(
metric="my-workspace/my-noise-sensitivity",
dataset="my-workspace/rag-dataset",
params=EvaluationJobParamsParam(parallelism=16)
)
)
# Create the metric entity
client.evaluation.metrics.create(
name="my-noise-sensitivity",
type="noise_sensitivity",
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
)
# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
spec=MetricOnlineJobParam(
metric="my-workspace/my-noise-sensitivity",
model=ModelParam( # Generation model
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="nvidia/llama-3.3-nemotron-super-49b-v1",
api_key_secret="nvidia-api-key",
),
prompt_template={ # Can include RAG context variables
"messages": [{
"role": "user",
"content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
}]
},
dataset="my-workspace/rag-dataset", # response column optional
params=EvaluationJobParamsParam(
parallelism=16,
inference={"temperature": 0.7, "max_tokens": 1024},
)
)
)
{
"scores": [
{
"count": 1,
"histogram": {},
"name": "noise_sensitivity",
"nan_count": 0,
"max": 1.0,
"mean": 1.0,
"min": 1.0,
"percentiles": {},
"score_type": "range",
"std_dev": 0.0,
"sum": 1.0,
"variance": 0.0
}
]
}
Response Relevancy#
Measures how relevant the response is to the user’s question using embedding-based cosine similarity. Requires both judge LLM and embeddings model.
Score name:
answer_relevancyScore range: 0 to 1, with higher scores indicating better relevance.
Data Format#
{
"user_input": "What is the capital of France?",
"response": "The capital of France is Paris.",
"retrieved_contexts": ["Paris is the capital city of France."]
}
Configuration Options#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
Number of parallel questions generated (NIM only supports 1) |
result = client.evaluation.metrics.evaluate(
metric=ResponseRelevancyMetricParam(
type="response_relevancy",
strictness=1,
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
),
embeddings_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/embeddings",
name="nvidia/nv-embedqa-e5-v5",
api_key_secret="nvidia-api-key",
),
),
dataset=DatasetRowsParam(
rows=[{
"user_input": "What is the capital of France?",
"response": "The capital of France is Paris.",
"retrieved_contexts": ["Paris is the capital city of France."],
}]
),
)
for score in result.aggregate_scores:
print(f"{score.name}: mean={score.mean}")
# Create the metric entity
client.evaluation.metrics.create(
name="my-response-relevancy",
type="response_relevancy",
strictness=1,
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
),
embeddings_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/embeddings",
name="nvidia/nv-embedqa-e5-v5",
api_key_secret="nvidia-api-key",
)
)
# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
spec=MetricOfflineJobParam(
metric="my-workspace/my-response-relevancy",
dataset="my-workspace/rag-dataset",
params=EvaluationJobParamsParam(parallelism=16)
)
)
# Create the metric entity
client.evaluation.metrics.create(
name="my-response-relevancy",
type="response_relevancy",
strictness=1,
judge_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
),
embeddings_model=ModelParam(
url="https://integrate.api.nvidia.com/v1/embeddings",
name="nvidia/nv-embedqa-e5-v5",
api_key_secret="nvidia-api-key",
)
)
# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
spec=MetricOnlineJobParam(
metric="my-workspace/my-response-relevancy",
model=ModelParam( # Generation model
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="nvidia/llama-3.3-nemotron-super-49b-v1",
api_key_secret="nvidia-api-key",
),
prompt_template={ # Can include RAG context variables
"messages": [{
"role": "user",
"content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
}]
},
dataset="my-workspace/rag-dataset", # response column optional
params=EvaluationJobParamsParam(
parallelism=16,
inference={"temperature": 0.7, "max_tokens": 1024},
)
)
)
{
"scores": [
{
"count": 1,
"histogram": {},
"name": "answer_relevancy",
"nan_count": 0,
"max": 1.0,
"mean": 1.0,
"min": 1.0,
"percentiles": {},
"score_type": "range",
"std_dev": 0.0,
"sum": 1.0,
"variance": 0.0
}
]
}
Dataset Format#
RAGAS metrics use specific column names:
Field |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
User question or input |
|
list[string] |
Some metrics |
List of context passages |
|
string |
Some metrics |
Generated answer (optional for online mode, required for offline) |
|
string |
Some metrics |
Reference answer or ground truth |
Note
Online vs Offline Mode:
Online mode (
MetricOnlineJob):responsecolumn is optional - will be generated automatically using job’s model and prompt_templateOffline mode (
MetricOfflineJob):responsecolumn is required - must contain pre-generated responses
Example Dataset#
{
"user_input": "What is the capital of France?",
"retrieved_contexts": ["Paris is the capital and largest city of France."],
"response": "The capital of France is Paris.",
"reference": "Paris"
}
Note
Different metrics require different columns. Check the metric documentation for specific requirements.
Response Format#
All evaluation responses follow this structure:
{
"metric": {
"type": "faithfulness",
"judge_model": {"url": "...", "name": "..."}
},
"aggregate_scores": [
{
"name": "faithfulness",
"count": 1,
"mean": 0.95,
"min": 0.95,
"max": 0.95,
"sum": 0.95
}
],
"row_scores": [
{
"index": 0,
"row": {"user_input": "...", "response": "..."},
"scores": {"faithfulness": 0.95},
"error": null
}
]
}
Working with Results#
# Access aggregate scores
for score in result.aggregate_scores:
print(f"{score.name}: mean={score.mean}, count={score.count}")
# Access per-row scores
for row in result.row_scores:
if row.scores:
print(f"Row {row.index}: {row.scores}")
elif row.error:
print(f"Row {row.index} failed: {row.error}")
Managing Secrets for Authenticated Endpoints#
Store API keys as secrets for secure authentication:
# Create secrets for all endpoints that may require authentication
client.secrets.create(name="judge-api-key", data="<your-judge-key>")
client.secrets.create(name="embedding-api-key", data="<your-embedding-key>")
Reference secrets by name in your metric configuration:
"judge_model": {
"url": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-70b-instruct",
"api_key_secret": "judge-api-key" # Name of secret, not the actual API key
}
Job Management#
After successfully creating a job, navigate to Metrics Job Management to oversee its execution, monitor progress.
Troubleshooting#
Common Errors#
Error |
Cause |
Solution |
|---|---|---|
|
Missing judge LLM config for metric |
Add |
|
Using |
Add |
Job stuck in “pending” |
Model endpoint not accessible |
Verify endpoint URLs and API key secrets |
Authentication failed |
Invalid or missing API key |
Check secret names match exactly |
|
Judge/model call failures (for example auth, endpoint, quota, or timeout). Some RAGAS metrics are known to return |
Inspect row-level |
Low faithfulness scores |
Context doesn’t support the response |
Improve retrieval or response generation |
Warning
If you see nan_count > 0 with mean = null, first validate judge model authentication.
For some RAGAS metrics, auth failures can be converted to NaN scores instead of surfacing as a hard error.
Tips for Better Results#
Use larger judge models (70B+) for more consistent scoring
Start with inline datasets to test your configuration before large evaluations
Set appropriate timeouts - judge LLM calls can take time with large contexts
Use parallelism wisely - increase
parallelismfor faster evaluation, but respect rate limitsColumn names matter - RAGAS metrics use
user_input,retrieved_contexts,response, andreference
Important Notes#
Dataset Limit: Live evaluation supports up to 10 rows per request. For larger evaluations, use job-based evaluation.
Secret Management: API keys should be stored as secrets and referenced by name in
api_key_secret. Never pass API keys directly in the request.Column Names: RAGAS metrics use specific column names:
user_input(notquestion)response(notanswer)retrieved_contexts(notcontexts)reference(notground_truth)
Embeddings Model: Only
response_relevancyrequires an embeddings model. All other metrics use only the judge LLM.
Limitations#
Judge Model Quality: Evaluation quality depends on the judge model’s ability to follow instructions. Larger models (70B+) typically produce more consistent results.
Dataset Format: RAGAS metrics use specific column names (
user_input,retrieved_contexts,response,reference). Ensure your data matches this structure.
See also
Retriever Metrics - Evaluate retrieval quality
LLM-as-a-Judge - Custom judge-based evaluation
Agentic Metrics - Evaluate agent workflows