RAG Evaluation Metrics#

RAG (Retrieval Augmented Generation) metrics evaluate the quality of RAG pipelines by measuring both retrieval and answer generation performance. These metrics use RAGAS to assess how well retrieved contexts support generated answers.

Overview#

RAGAS metrics evaluate RAG components by measuring how well retrieved contexts support generated answers. These metrics support both offline and online evaluation modes:

Offline evaluation: Uses pre-generated responses from your dataset
Online evaluation: Responses are generated automatically using a model and prompt template before evaluation
1. Job’s model and prompt_template are used to generate responses
2. Generated response (in sample["output_text"]) is automatically used as response in RAGAS evaluation
3. RAG context variables can be included in the job’s prompt_template:
  - {{user_input}} - User question/input from dataset
  - {{retrieved_contexts}} - Retrieved context passages from dataset

RAGAS metrics require:

Judge LLM: An LLM to evaluate answer quality (required for most metrics)
Judge Embeddings (optional): Required for some metrics like response_relevancy
Data: Questions, contexts, and either pre-generated responses (offline) or a model to generate them (online)

Prerequisites#

Before running RAG evaluations:

Workspace: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.
Model Endpoints: Access to judge LLM endpoints (and embeddings model for some metrics)
API Keys (if required): Create secrets for any endpoints requiring authentication
Initialize the SDK:

import os
from nemo_platform import NeMoPlatform
from nemo_platform.types.evaluation import (
    ContextEntityRecallMetricParam,
    ContextPrecisionMetricParam,
    ContextRecallMetricParam,
    ContextRelevanceMetricParam,
    DatasetRowsParam,
    EvaluationJobParamsParam,
    FaithfulnessMetricParam,
    MetricOfflineJobParam,
    MetricOnlineJobParam,
    ModelParam,
    NoiseSensitivityMetricParam,
    ResponseGroundednessMetricParam,
    ResponseRelevancyMetricParam,
)

client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)

Creating a Secret for API Keys#

If using external endpoints that require authentication (like NVIDIA API), create a secret first:

client.secrets.create(
    name="nvidia-api-key",
    data="nvapi-YOUR_API_KEY_HERE",
    description="NVIDIA API key for RAG metrics"
)

Tip

RAGAS metrics accept both inline model definitions and model references (e.g., "my-workspace/my-model") for judge_model and embeddings_model fields. See Model Configuration for details.

Supported RAGAS Metrics#

Use Case	Metric Type	Description	Required Columns*
Measure retrieval quality	`context_recall`	Coverage of reference information in retrieved context	user_input, retrieved_contexts, reference
	`context_precision`	Whether all retrieved chunks are relevant to the question	user_input, retrieved_contexts, reference
	`context_relevance`	Relevance of retrieved context to the question	user_input, retrieved_contexts
	`context_entity_recall`	Recall of important entities from reference in context	retrieved_contexts, reference
Detect hallucinations	`faithfulness`	Measures factual consistency of response with retrieved context	user_input, response, retrieved_contexts
	`response_groundedness`	Evaluates whether response is grounded in context without hallucinations	response, retrieved_contexts
	`noise_sensitivity`	Robustness to noisy or irrelevant context	user_input, response, reference, retrieved_contexts
Check if answers address the question	`response_relevancy`**	Response relevancy to question using embeddings similarity	user_input, response, retrieved_contexts

* Required Columns: Dataset columns that must be present for the metric to be evaluated.

** Requires embeddings_model in addition to judge_model

Context Recall#

Measures the fraction of relevant content retrieved compared to the total relevant content in the reference.

Score name: context_recall
Score range: 0 to 1, with higher scores indicating better recall.

Data Format#

{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": ["Paris is the capital and largest city of France."],
  "reference": "Paris is the capital of France."
}

Live Evaluation

result = client.evaluation.metrics.evaluate(
    metric=ContextRecallMetricParam(
        type="context_recall",
        judge_model=ModelParam(
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="meta/llama-3.1-70b-instruct",
            api_key_secret="nvidia-api-key",
        ),
    ),
    dataset=DatasetRowsParam(
        rows=[{
            "user_input": "What is the capital of France?",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
            "reference": "Paris is the capital of France.",
        }]
    ),
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")

Offline Job Evaluation

# Create the metric entity
client.evaluation.metrics.create(
    name="my-context-recall",
    type="context_recall",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJobParam(
        metric="my-workspace/my-context-recall",
        dataset="my-workspace/rag-dataset",
        params=EvaluationJobParamsParam(parallelism=16)
    )
)

Online Job Evaluation

# Create the metric entity
client.evaluation.metrics.create(
    name="my-context-recall",
    type="context_recall",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJobParam(
        metric="my-workspace/my-context-recall",
        model=ModelParam(  # Generation model
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            api_key_secret="nvidia-api-key",
        ),
        prompt_template={  # Can include RAG context variables
            "messages": [{
                "role": "user",
                "content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
            }]
        },
        dataset="my-workspace/rag-dataset",  # response column optional
        params=EvaluationJobParamsParam(
            parallelism=16,
            inference={"temperature": 0.7, "max_tokens": 1024},
        )
    )
)

Result

{
  "scores": [
    {
      "count": 1,
      "histogram": {},
      "name": "context_recall",
      "nan_count": 0,
      "max": 1.0,
      "mean": 1.0,
      "min": 1.0,
      "percentiles": {},
      "score_type": "range",
      "std_dev": 0.0,
      "sum": 1.0,
      "variance": 0.0
    }
  ]
}

Context Precision#

Measures the proportion of relevant chunks in the retrieved contexts (precision@k).

Score name: context_precision
Score range: 0 to 1, with higher scores indicating better precision.

Data Format#

{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": ["Paris is the capital and largest city of France."],
  "reference": "Paris"
}

Live Evaluation

result = client.evaluation.metrics.evaluate(
    metric=ContextPrecisionMetricParam(
        type="context_precision",
        judge_model=ModelParam(
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="meta/llama-3.1-70b-instruct",
            api_key_secret="nvidia-api-key",
        ),
    ),
    dataset=DatasetRowsParam(
        rows=[{
            "user_input": "What is the capital of France?",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
            "reference": "Paris",
        }]
    ),
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")

Offline Job Evaluation

# Create the metric entity
client.evaluation.metrics.create(
    name="my-context-precision",
    type="context_precision",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJobParam(
        metric="my-workspace/my-context-precision",
        dataset="my-workspace/rag-dataset",
        params=EvaluationJobParamsParam(parallelism=16)
    )
)

Online Job Evaluation

# Create the metric entity
client.evaluation.metrics.create(
    name="my-context-precision",
    type="context_precision",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJobParam(
        metric="my-workspace/my-context-precision",
        model=ModelParam(  # Generation model
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            api_key_secret="nvidia-api-key",
        ),
        prompt_template={  # Can include RAG context variables
            "messages": [{
                "role": "user",
                "content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
            }]
        },
        dataset="my-workspace/rag-dataset",  # response column optional
        params=EvaluationJobParamsParam(
            parallelism=16,
            inference={"temperature": 0.7, "max_tokens": 1024},
        )
    )
)

Result

{
  "scores": [
    {
      "count": 1,
      "histogram": {},
      "name": "context_precision",
      "nan_count": 0,
      "max": 1.0,
      "mean": 1.0,
      "min": 1.0,
      "percentiles": {},
      "score_type": "range",
      "std_dev": 0.0,
      "sum": 1.0,
      "variance": 0.0
    }
  ]
}

Context Relevance#

Judges assess whether retrieved contexts are relevant to the user input.

Score name: nv_context_relevance
Score range: 0/1/2 scale, normalized to [0,1].

Data Format#

{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": ["Paris is the capital and largest city of France."]
}

Live Evaluation

result = client.evaluation.metrics.evaluate(
    metric=ContextRelevanceMetricParam(
        type="context_relevance",
        judge_model=ModelParam(
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="meta/llama-3.1-70b-instruct",
            api_key_secret="nvidia-api-key",
        ),
    ),
    dataset=DatasetRowsParam(
        rows=[{
            "user_input": "What is the capital of France?",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
        }]
    ),
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")

Offline Job Evaluation

# Create the metric entity
client.evaluation.metrics.create(
    name="my-context-relevance",
    type="context_relevance",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJobParam(
        metric="my-workspace/my-context-relevance",
        dataset="my-workspace/rag-dataset",
        params=EvaluationJobParamsParam(parallelism=16)
    )
)

Online Job Evaluation

# Create the metric entity
client.evaluation.metrics.create(
    name="my-context-relevance",
    type="context_relevance",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJobParam(
        metric="my-workspace/my-context-relevance",
        model=ModelParam(  # Generation model
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            api_key_secret="nvidia-api-key",
        ),
        prompt_template={  # Can include RAG context variables
            "messages": [{
                "role": "user",
                "content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
            }]
        },
        dataset="my-workspace/rag-dataset",  # response column optional
        params=EvaluationJobParamsParam(
            parallelism=16,
            inference={"temperature": 0.7, "max_tokens": 1024},
        )
    )
)

Result

{
  "scores": [
    {
      "count": 1,
      "histogram": {},
      "name": "nv_context_relevance",
      "nan_count": 0,
      "max": 1.0,
      "mean": 1.0,
      "min": 1.0,
      "percentiles": {},
      "score_type": "range",
      "std_dev": 0.0,
      "sum": 1.0,
      "variance": 0.0
    }
  ]
}

Context Entity Recall#

Measures the recall of entities in the retrieved contexts compared to entities in the reference.

Score name: context_entity_recall
Score range: 0 to 1, with higher scores indicating better entity coverage.

Data Format#

{
  "retrieved_contexts": ["Paris is the capital and largest city of France."],
  "reference": "Paris is the capital of France."
}

Live Evaluation

result = client.evaluation.metrics.evaluate(
    metric=ContextEntityRecallMetricParam(
        type="context_entity_recall",
        judge_model=ModelParam(
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="meta/llama-3.1-70b-instruct",
            api_key_secret="nvidia-api-key",
        ),
    ),
    dataset=DatasetRowsParam(
        rows=[{
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
            "reference": "Paris is the capital of France.",
        }]
    ),
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")

Offline Job Evaluation

# Create the metric entity
client.evaluation.metrics.create(
    name="my-context-entity-recall",
    type="context_entity_recall",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJobParam(
        metric="my-workspace/my-context-entity-recall",
        dataset="my-workspace/rag-dataset",
        params=EvaluationJobParamsParam(parallelism=16)
    )
)

Online Job Evaluation

# Create the metric entity
client.evaluation.metrics.create(
    name="my-context-entity-recall",
    type="context_entity_recall",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJobParam(
        metric="my-workspace/my-context-entity-recall",
        model=ModelParam(  # Generation model
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            api_key_secret="nvidia-api-key",
        ),
        prompt_template={  # Can include RAG context variables
            "messages": [{
                "role": "user",
                "content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
            }]
        },
        dataset="my-workspace/rag-dataset",  # response column optional
        params=EvaluationJobParamsParam(
            parallelism=16,
            inference={"temperature": 0.7, "max_tokens": 1024},
        )
    )
)

Result

{
  "scores": [
    {
      "count": 1,
      "histogram": {},
      "name": "context_entity_recall",
      "nan_count": 0,
      "max": 1.0,
      "mean": 1.0,
      "min": 1.0,
      "percentiles": {},
      "score_type": "range",
      "std_dev": 0.0,
      "sum": 1.0,
      "variance": 0.0
    }
  ]
}

Faithfulness#

Measures how factually consistent a response is with the retrieved context.

Score name: faithfulness
Score range: 0 to 1, with higher scores indicating better consistency.

Data Format#

{
  "user_input": "What is the capital of France?",
  "response": "The capital of France is Paris.",
  "retrieved_contexts": ["Paris is the capital and largest city of France."]
}

Live Evaluation

result = client.evaluation.metrics.evaluate(
    metric=FaithfulnessMetricParam(
        type="faithfulness",
        judge_model=ModelParam(
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="meta/llama-3.1-70b-instruct",
            api_key_secret="nvidia-api-key",
        ),
    ),
    dataset=DatasetRowsParam(
        rows=[{
            "user_input": "What is the capital of France?",
            "response": "The capital of France is Paris.",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
        }]
    ),
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")

Offline Job Evaluation

# Create the metric entity
client.evaluation.metrics.create(
    name="my-faithfulness",
    type="faithfulness",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJobParam(
        metric="my-workspace/my-faithfulness",
        dataset="my-workspace/rag-dataset",
        params=EvaluationJobParamsParam(parallelism=16)
    )
)

Online Job Evaluation

# Create the metric entity
client.evaluation.metrics.create(
    name="my-faithfulness",
    type="faithfulness",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJobParam(
        metric="my-workspace/my-faithfulness",
        model=ModelParam(  # Generation model
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            api_key_secret="nvidia-api-key",
        ),
        prompt_template={  # Can include RAG context variables
            "messages": [{
                "role": "user",
                "content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
            }]
        },
        dataset="my-workspace/rag-dataset",  # response column optional
        params=EvaluationJobParamsParam(
            parallelism=16,
            inference={"temperature": 0.7, "max_tokens": 1024},
        )
    )
)

Result

{
  "scores": [
    {
      "count": 1,
      "histogram": {},
      "name": "faithfulness",
      "nan_count": 0,
      "max": 1.0,
      "mean": 1.0,
      "min": 1.0,
      "percentiles": {},
      "score_type": "range",
      "std_dev": 0.0,
      "sum": 1.0,
      "variance": 0.0
    }
  ]
}

Response Groundedness#

Evaluates whether the response is grounded in retrieved contexts.

Score name: nv_response_groundedness
Score range: 0/1/2 scale, normalized to [0,1].

Data Format#

{
  "response": "The capital of France is Paris.",
  "retrieved_contexts": ["Paris is the capital and largest city of France."]
}

Live Evaluation

result = client.evaluation.metrics.evaluate(
    metric=ResponseGroundednessMetricParam(
        type="response_groundedness",
        judge_model=ModelParam(
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="meta/llama-3.1-70b-instruct",
            api_key_secret="nvidia-api-key",
        ),
    ),
    dataset=DatasetRowsParam(
        rows=[{
            "response": "The capital of France is Paris.",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
        }]
    ),
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")

Offline Job Evaluation

# Create the metric entity
client.evaluation.metrics.create(
    name="my-response-groundedness",
    type="response_groundedness",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJobParam(
        metric="my-workspace/my-response-groundedness",
        dataset="my-workspace/rag-dataset",
        params=EvaluationJobParamsParam(parallelism=16)
    )
)

Online Job Evaluation

# Create the metric entity
client.evaluation.metrics.create(
    name="my-response-groundedness",
    type="response_groundedness",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJobParam(
        metric="my-workspace/my-response-groundedness",
        model=ModelParam(  # Generation model
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            api_key_secret="nvidia-api-key",
        ),
        prompt_template={  # Can include RAG context variables
            "messages": [{
                "role": "user",
                "content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
            }]
        },
        dataset="my-workspace/rag-dataset",  # response column optional
        params=EvaluationJobParamsParam(
            parallelism=16,
            inference={"temperature": 0.7, "max_tokens": 1024},
        )
    )
)

Result

{
  "scores": [
    {
      "count": 1,
      "histogram": {},
      "name": "nv_response_groundedness",
      "nan_count": 0,
      "max": 1.0,
      "mean": 1.0,
      "min": 1.0,
      "percentiles": {},
      "score_type": "range",
      "std_dev": 0.0,
      "sum": 1.0,
      "variance": 0.0
    }
  ]
}

Noise Sensitivity#

Measures how sensitive the response is to irrelevant or noisy content in the retrieved contexts.

Score name: noise_sensitivity
Score range: 0 to 1, with lower scores indicating better robustness to noise.

Data Format#

{
  "user_input": "What is the capital of France?",
  "response": "The capital of France is Paris.",
  "reference": "Paris",
  "retrieved_contexts": ["Paris is the capital of France.", "Irrelevant noise text."]
}

Live Evaluation

result = client.evaluation.metrics.evaluate(
    metric=NoiseSensitivityMetricParam(
        type="noise_sensitivity",
        judge_model=ModelParam(
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="meta/llama-3.1-70b-instruct",
            api_key_secret="nvidia-api-key",
        ),
    ),
    dataset=DatasetRowsParam(
        rows=[{
            "user_input": "What is the capital of France?",
            "response": "The capital of France is Paris.",
            "reference": "Paris",
            "retrieved_contexts": ["Paris is the capital of France.", "Irrelevant noise text."],
        }]
    ),
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")

Offline Job Evaluation

# Create the metric entity
client.evaluation.metrics.create(
    name="my-noise-sensitivity",
    type="noise_sensitivity",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJobParam(
        metric="my-workspace/my-noise-sensitivity",
        dataset="my-workspace/rag-dataset",
        params=EvaluationJobParamsParam(parallelism=16)
    )
)

Online Job Evaluation

# Create the metric entity
client.evaluation.metrics.create(
    name="my-noise-sensitivity",
    type="noise_sensitivity",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJobParam(
        metric="my-workspace/my-noise-sensitivity",
        model=ModelParam(  # Generation model
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            api_key_secret="nvidia-api-key",
        ),
        prompt_template={  # Can include RAG context variables
            "messages": [{
                "role": "user",
                "content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
            }]
        },
        dataset="my-workspace/rag-dataset",  # response column optional
        params=EvaluationJobParamsParam(
            parallelism=16,
            inference={"temperature": 0.7, "max_tokens": 1024},
        )
    )
)

Result

{
  "scores": [
    {
      "count": 1,
      "histogram": {},
      "name": "noise_sensitivity",
      "nan_count": 0,
      "max": 1.0,
      "mean": 1.0,
      "min": 1.0,
      "percentiles": {},
      "score_type": "range",
      "std_dev": 0.0,
      "sum": 1.0,
      "variance": 0.0
    }
  ]
}

Response Relevancy#

Measures how relevant the response is to the user’s question using embedding-based cosine similarity. Requires both judge LLM and embeddings model.

Score name: answer_relevancy
Score range: 0 to 1, with higher scores indicating better relevance.

Data Format#

{
  "user_input": "What is the capital of France?",
  "response": "The capital of France is Paris.",
  "retrieved_contexts": ["Paris is the capital city of France."]
}

Configuration Options#

Parameter	Type	Default	Description
`strictness`	int	`1`	Number of parallel questions generated (NIM only supports 1)

Live Evaluation

result = client.evaluation.metrics.evaluate(
    metric=ResponseRelevancyMetricParam(
        type="response_relevancy",
        strictness=1,
        judge_model=ModelParam(
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="meta/llama-3.1-70b-instruct",
            api_key_secret="nvidia-api-key",
        ),
        embeddings_model=ModelParam(
            url="https://integrate.api.nvidia.com/v1/embeddings",
            name="nvidia/nv-embedqa-e5-v5",
            api_key_secret="nvidia-api-key",
        ),
    ),
    dataset=DatasetRowsParam(
        rows=[{
            "user_input": "What is the capital of France?",
            "response": "The capital of France is Paris.",
            "retrieved_contexts": ["Paris is the capital city of France."],
        }]
    ),
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")

Offline Job Evaluation

# Create the metric entity
client.evaluation.metrics.create(
    name="my-response-relevancy",
    type="response_relevancy",
    strictness=1,
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    ),
    embeddings_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/embeddings",
        name="nvidia/nv-embedqa-e5-v5",
        api_key_secret="nvidia-api-key",
    )
)

# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJobParam(
        metric="my-workspace/my-response-relevancy",
        dataset="my-workspace/rag-dataset",
        params=EvaluationJobParamsParam(parallelism=16)
    )
)

Online Job Evaluation

# Create the metric entity
client.evaluation.metrics.create(
    name="my-response-relevancy",
    type="response_relevancy",
    strictness=1,
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    ),
    embeddings_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/embeddings",
        name="nvidia/nv-embedqa-e5-v5",
        api_key_secret="nvidia-api-key",
    )
)

# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJobParam(
        metric="my-workspace/my-response-relevancy",
        model=ModelParam(  # Generation model
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            api_key_secret="nvidia-api-key",
        ),
        prompt_template={  # Can include RAG context variables
            "messages": [{
                "role": "user",
                "content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
            }]
        },
        dataset="my-workspace/rag-dataset",  # response column optional
        params=EvaluationJobParamsParam(
            parallelism=16,
            inference={"temperature": 0.7, "max_tokens": 1024},
        )
    )
)

Result

{
  "scores": [
    {
      "count": 1,
      "histogram": {},
      "name": "answer_relevancy",
      "nan_count": 0,
      "max": 1.0,
      "mean": 1.0,
      "min": 1.0,
      "percentiles": {},
      "score_type": "range",
      "std_dev": 0.0,
      "sum": 1.0,
      "variance": 0.0
    }
  ]
}

Dataset Format#

RAGAS metrics use specific column names:

Field	Type	Required	Description
`user_input`	string	Yes	User question or input
`retrieved_contexts`	list[string]	Some metrics	List of context passages
`response`	string	Some metrics	Generated answer (optional for online mode, required for offline)
`reference`	string	Some metrics	Reference answer or ground truth

Note

Online vs Offline Mode:

Online mode (MetricOnlineJob): response column is optional - will be generated automatically using job’s model and prompt_template
Offline mode (MetricOfflineJob): response column is required - must contain pre-generated responses

Example Dataset#

{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": ["Paris is the capital and largest city of France."],
  "response": "The capital of France is Paris.",
  "reference": "Paris"
}

Note

Different metrics require different columns. Check the metric documentation for specific requirements.

Response Format#

All evaluation responses follow this structure:

{
  "metric": {
    "type": "faithfulness",
    "judge_model": {"url": "...", "name": "..."}
  },
  "aggregate_scores": [
    {
      "name": "faithfulness",
      "count": 1,
      "mean": 0.95,
      "min": 0.95,
      "max": 0.95,
      "sum": 0.95
    }
  ],
  "row_scores": [
    {
      "index": 0,
      "row": {"user_input": "...", "response": "..."},
      "scores": {"faithfulness": 0.95},
      "error": null
    }
  ]
}

Working with Results#

# Access aggregate scores
for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}, count={score.count}")

# Access per-row scores
for row in result.row_scores:
    if row.scores:
        print(f"Row {row.index}: {row.scores}")
    elif row.error:
        print(f"Row {row.index} failed: {row.error}")

Managing Secrets for Authenticated Endpoints#

Store API keys as secrets for secure authentication:

# Create secrets for all endpoints that may require authentication
client.secrets.create(name="judge-api-key", data="<your-judge-key>")
client.secrets.create(name="embedding-api-key", data="<your-embedding-key>")

Reference secrets by name in your metric configuration:

"judge_model": {
    "url": "https://integrate.api.nvidia.com/v1/chat/completions",
    "name": "meta/llama-3.1-70b-instruct",
    "api_key_secret": "judge-api-key"  # Name of secret, not the actual API key
}

Job Management#

After successfully creating a job, navigate to Metrics Job Management to oversee its execution, monitor progress.

Troubleshooting#

Common Errors#

Error	Cause	Solution
`judge_model` is required	Missing judge LLM config for metric	Add `judge_model` to metric configuration
`embeddings_model` is required	Using `response_relevancy` without embeddings	Add `embeddings_model` to metric configuration
Job stuck in “pending”	Model endpoint not accessible	Verify endpoint URLs and API key secrets
Authentication failed	Invalid or missing API key	Check secret names match exactly
`nan_count > 0` and `mean = null`	Judge/model call failures (for example auth, endpoint, quota, or timeout). Some RAGAS metrics are known to return `NaN` instead of raising on these failures.	Inspect row-level `error` and request logs; verify API key, endpoint, and model access
Low faithfulness scores	Context doesn’t support the response	Improve retrieval or response generation

Warning

If you see nan_count > 0 with mean = null, first validate judge model authentication. For some RAGAS metrics, auth failures can be converted to NaN scores instead of surfacing as a hard error.

Tips for Better Results#

Use larger judge models (70B+) for more consistent scoring
Start with inline datasets to test your configuration before large evaluations
Set appropriate timeouts - judge LLM calls can take time with large contexts
Use parallelism wisely - increase parallelism for faster evaluation, but respect rate limits
Column names matter - RAGAS metrics use user_input, retrieved_contexts, response, and reference

Important Notes#

Dataset Limit: Live evaluation supports up to 10 rows per request. For larger evaluations, use job-based evaluation.
Secret Management: API keys should be stored as secrets and referenced by name in api_key_secret. Never pass API keys directly in the request.
Column Names: RAGAS metrics use specific column names:
- user_input (not question)
- response (not answer)
- retrieved_contexts (not contexts)
- reference (not ground_truth)
Embeddings Model: Only response_relevancy requires an embeddings model. All other metrics use only the judge LLM.

Limitations#

Judge Model Quality: Evaluation quality depends on the judge model’s ability to follow instructions. Larger models (70B+) typically produce more consistent results.
Dataset Format: RAGAS metrics use specific column names (user_input, retrieved_contexts, response, reference). Ensure your data matches this structure.