RAG Evaluation Metrics#

RAG (Retrieval Augmented Generation) metrics evaluate the quality of RAG pipelines by measuring both retrieval and answer generation performance. These metrics use RAGAS to assess how well retrieved contexts support generated answers.

Overview#

RAGAS metrics evaluate RAG components by measuring how well retrieved contexts support generated answers. These metrics support both offline and online evaluation modes:

  • Offline evaluation: Uses pre-generated responses from your dataset

  • Online evaluation: Responses are generated automatically using a model and prompt template before evaluation

    1. Job’s model and prompt_template are used to generate responses

    2. Generated response (in sample["output_text"]) is automatically used as response in RAGAS evaluation

    3. RAG context variables can be included in the job’s prompt_template:

      • {{user_input}} - User question/input from dataset

      • {{retrieved_contexts}} - Retrieved context passages from dataset

RAGAS metrics require:

  • Judge LLM: An LLM to evaluate answer quality (required for most metrics)

  • Judge Embeddings (optional): Required for some metrics like response_relevancy

  • Data: Questions, contexts, and either pre-generated responses (offline) or a model to generate them (online)

Prerequisites#

Before running RAG evaluations:

  1. Workspace: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.

  2. Model Endpoints: Access to judge LLM endpoints (and embeddings model for some metrics)

  3. API Keys (if required): Create secrets for any endpoints requiring authentication

  4. Initialize the SDK:

import os
from nemo_platform import NeMoPlatform
from nemo_platform.types.evaluation import (
    ContextEntityRecallMetricParam,
    ContextPrecisionMetricParam,
    ContextRecallMetricParam,
    ContextRelevanceMetricParam,
    DatasetRowsParam,
    EvaluationJobParamsParam,
    FaithfulnessMetricParam,
    MetricOfflineJobParam,
    MetricOnlineJobParam,
    ModelParam,
    NoiseSensitivityMetricParam,
    ResponseGroundednessMetricParam,
    ResponseRelevancyMetricParam,
)

client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)

Creating a Secret for API Keys#

If using external endpoints that require authentication (like NVIDIA API), create a secret first:

client.secrets.create(
    name="nvidia-api-key",
    data="nvapi-YOUR_API_KEY_HERE",
    description="NVIDIA API key for RAG metrics"
)

Tip

RAGAS metrics accept both inline model definitions and model references (e.g., "my-workspace/my-model") for judge_model and embeddings_model fields. See Model Configuration for details.


Supported RAGAS Metrics#

Use Case

Metric Type

Description

Required Columns*

Measure retrieval quality

context_recall

Coverage of reference information in retrieved context

user_input, retrieved_contexts, reference

context_precision

Whether all retrieved chunks are relevant to the question

user_input, retrieved_contexts, reference

context_relevance

Relevance of retrieved context to the question

user_input, retrieved_contexts

context_entity_recall

Recall of important entities from reference in context

retrieved_contexts, reference

Detect hallucinations

faithfulness

Measures factual consistency of response with retrieved context

user_input, response, retrieved_contexts

response_groundedness

Evaluates whether response is grounded in context without hallucinations

response, retrieved_contexts

noise_sensitivity

Robustness to noisy or irrelevant context

user_input, response, reference, retrieved_contexts

Check if answers address the question

response_relevancy**

Response relevancy to question using embeddings similarity

user_input, response, retrieved_contexts

* Required Columns: Dataset columns that must be present for the metric to be evaluated.

** Requires embeddings_model in addition to judge_model


Context Recall#

Measures the fraction of relevant content retrieved compared to the total relevant content in the reference.

  • Score name: context_recall

  • Score range: 0 to 1, with higher scores indicating better recall.

Data Format#

{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": ["Paris is the capital and largest city of France."],
  "reference": "Paris is the capital of France."
}
result = client.evaluation.metrics.evaluate(
    metric=ContextRecallMetricParam(
        type="context_recall",
        judge_model=ModelParam(
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="meta/llama-3.1-70b-instruct",
            api_key_secret="nvidia-api-key",
        ),
    ),
    dataset=DatasetRowsParam(
        rows=[{
            "user_input": "What is the capital of France?",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
            "reference": "Paris is the capital of France.",
        }]
    ),
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")
# Create the metric entity
client.evaluation.metrics.create(
    name="my-context-recall",
    type="context_recall",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJobParam(
        metric="my-workspace/my-context-recall",
        dataset="my-workspace/rag-dataset",
        params=EvaluationJobParamsParam(parallelism=16)
    )
)
# Create the metric entity
client.evaluation.metrics.create(
    name="my-context-recall",
    type="context_recall",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJobParam(
        metric="my-workspace/my-context-recall",
        model=ModelParam(  # Generation model
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            api_key_secret="nvidia-api-key",
        ),
        prompt_template={  # Can include RAG context variables
            "messages": [{
                "role": "user",
                "content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
            }]
        },
        dataset="my-workspace/rag-dataset",  # response column optional
        params=EvaluationJobParamsParam(
            parallelism=16,
            inference={"temperature": 0.7, "max_tokens": 1024},
        )
    )
)
{
  "scores": [
    {
      "count": 1,
      "histogram": {},
      "name": "context_recall",
      "nan_count": 0,
      "max": 1.0,
      "mean": 1.0,
      "min": 1.0,
      "percentiles": {},
      "score_type": "range",
      "std_dev": 0.0,
      "sum": 1.0,
      "variance": 0.0
    }
  ]
}

Context Precision#

Measures the proportion of relevant chunks in the retrieved contexts (precision@k).

  • Score name: context_precision

  • Score range: 0 to 1, with higher scores indicating better precision.

Data Format#

{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": ["Paris is the capital and largest city of France."],
  "reference": "Paris"
}
result = client.evaluation.metrics.evaluate(
    metric=ContextPrecisionMetricParam(
        type="context_precision",
        judge_model=ModelParam(
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="meta/llama-3.1-70b-instruct",
            api_key_secret="nvidia-api-key",
        ),
    ),
    dataset=DatasetRowsParam(
        rows=[{
            "user_input": "What is the capital of France?",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
            "reference": "Paris",
        }]
    ),
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")
# Create the metric entity
client.evaluation.metrics.create(
    name="my-context-precision",
    type="context_precision",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJobParam(
        metric="my-workspace/my-context-precision",
        dataset="my-workspace/rag-dataset",
        params=EvaluationJobParamsParam(parallelism=16)
    )
)
# Create the metric entity
client.evaluation.metrics.create(
    name="my-context-precision",
    type="context_precision",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJobParam(
        metric="my-workspace/my-context-precision",
        model=ModelParam(  # Generation model
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            api_key_secret="nvidia-api-key",
        ),
        prompt_template={  # Can include RAG context variables
            "messages": [{
                "role": "user",
                "content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
            }]
        },
        dataset="my-workspace/rag-dataset",  # response column optional
        params=EvaluationJobParamsParam(
            parallelism=16,
            inference={"temperature": 0.7, "max_tokens": 1024},
        )
    )
)
{
  "scores": [
    {
      "count": 1,
      "histogram": {},
      "name": "context_precision",
      "nan_count": 0,
      "max": 1.0,
      "mean": 1.0,
      "min": 1.0,
      "percentiles": {},
      "score_type": "range",
      "std_dev": 0.0,
      "sum": 1.0,
      "variance": 0.0
    }
  ]
}

Context Relevance#

Judges assess whether retrieved contexts are relevant to the user input.

  • Score name: nv_context_relevance

  • Score range: 0/1/2 scale, normalized to [0,1].

Data Format#

{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": ["Paris is the capital and largest city of France."]
}
result = client.evaluation.metrics.evaluate(
    metric=ContextRelevanceMetricParam(
        type="context_relevance",
        judge_model=ModelParam(
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="meta/llama-3.1-70b-instruct",
            api_key_secret="nvidia-api-key",
        ),
    ),
    dataset=DatasetRowsParam(
        rows=[{
            "user_input": "What is the capital of France?",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
        }]
    ),
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")
# Create the metric entity
client.evaluation.metrics.create(
    name="my-context-relevance",
    type="context_relevance",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJobParam(
        metric="my-workspace/my-context-relevance",
        dataset="my-workspace/rag-dataset",
        params=EvaluationJobParamsParam(parallelism=16)
    )
)
# Create the metric entity
client.evaluation.metrics.create(
    name="my-context-relevance",
    type="context_relevance",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJobParam(
        metric="my-workspace/my-context-relevance",
        model=ModelParam(  # Generation model
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            api_key_secret="nvidia-api-key",
        ),
        prompt_template={  # Can include RAG context variables
            "messages": [{
                "role": "user",
                "content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
            }]
        },
        dataset="my-workspace/rag-dataset",  # response column optional
        params=EvaluationJobParamsParam(
            parallelism=16,
            inference={"temperature": 0.7, "max_tokens": 1024},
        )
    )
)
{
  "scores": [
    {
      "count": 1,
      "histogram": {},
      "name": "nv_context_relevance",
      "nan_count": 0,
      "max": 1.0,
      "mean": 1.0,
      "min": 1.0,
      "percentiles": {},
      "score_type": "range",
      "std_dev": 0.0,
      "sum": 1.0,
      "variance": 0.0
    }
  ]
}

Context Entity Recall#

Measures the recall of entities in the retrieved contexts compared to entities in the reference.

  • Score name: context_entity_recall

  • Score range: 0 to 1, with higher scores indicating better entity coverage.

Data Format#

{
  "retrieved_contexts": ["Paris is the capital and largest city of France."],
  "reference": "Paris is the capital of France."
}
result = client.evaluation.metrics.evaluate(
    metric=ContextEntityRecallMetricParam(
        type="context_entity_recall",
        judge_model=ModelParam(
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="meta/llama-3.1-70b-instruct",
            api_key_secret="nvidia-api-key",
        ),
    ),
    dataset=DatasetRowsParam(
        rows=[{
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
            "reference": "Paris is the capital of France.",
        }]
    ),
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")
# Create the metric entity
client.evaluation.metrics.create(
    name="my-context-entity-recall",
    type="context_entity_recall",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJobParam(
        metric="my-workspace/my-context-entity-recall",
        dataset="my-workspace/rag-dataset",
        params=EvaluationJobParamsParam(parallelism=16)
    )
)
# Create the metric entity
client.evaluation.metrics.create(
    name="my-context-entity-recall",
    type="context_entity_recall",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJobParam(
        metric="my-workspace/my-context-entity-recall",
        model=ModelParam(  # Generation model
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            api_key_secret="nvidia-api-key",
        ),
        prompt_template={  # Can include RAG context variables
            "messages": [{
                "role": "user",
                "content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
            }]
        },
        dataset="my-workspace/rag-dataset",  # response column optional
        params=EvaluationJobParamsParam(
            parallelism=16,
            inference={"temperature": 0.7, "max_tokens": 1024},
        )
    )
)
{
  "scores": [
    {
      "count": 1,
      "histogram": {},
      "name": "context_entity_recall",
      "nan_count": 0,
      "max": 1.0,
      "mean": 1.0,
      "min": 1.0,
      "percentiles": {},
      "score_type": "range",
      "std_dev": 0.0,
      "sum": 1.0,
      "variance": 0.0
    }
  ]
}

Faithfulness#

Measures how factually consistent a response is with the retrieved context.

  • Score name: faithfulness

  • Score range: 0 to 1, with higher scores indicating better consistency.

Data Format#

{
  "user_input": "What is the capital of France?",
  "response": "The capital of France is Paris.",
  "retrieved_contexts": ["Paris is the capital and largest city of France."]
}
result = client.evaluation.metrics.evaluate(
    metric=FaithfulnessMetricParam(
        type="faithfulness",
        judge_model=ModelParam(
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="meta/llama-3.1-70b-instruct",
            api_key_secret="nvidia-api-key",
        ),
    ),
    dataset=DatasetRowsParam(
        rows=[{
            "user_input": "What is the capital of France?",
            "response": "The capital of France is Paris.",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
        }]
    ),
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")
# Create the metric entity
client.evaluation.metrics.create(
    name="my-faithfulness",
    type="faithfulness",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJobParam(
        metric="my-workspace/my-faithfulness",
        dataset="my-workspace/rag-dataset",
        params=EvaluationJobParamsParam(parallelism=16)
    )
)
# Create the metric entity
client.evaluation.metrics.create(
    name="my-faithfulness",
    type="faithfulness",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJobParam(
        metric="my-workspace/my-faithfulness",
        model=ModelParam(  # Generation model
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            api_key_secret="nvidia-api-key",
        ),
        prompt_template={  # Can include RAG context variables
            "messages": [{
                "role": "user",
                "content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
            }]
        },
        dataset="my-workspace/rag-dataset",  # response column optional
        params=EvaluationJobParamsParam(
            parallelism=16,
            inference={"temperature": 0.7, "max_tokens": 1024},
        )
    )
)
{
  "scores": [
    {
      "count": 1,
      "histogram": {},
      "name": "faithfulness",
      "nan_count": 0,
      "max": 1.0,
      "mean": 1.0,
      "min": 1.0,
      "percentiles": {},
      "score_type": "range",
      "std_dev": 0.0,
      "sum": 1.0,
      "variance": 0.0
    }
  ]
}

Response Groundedness#

Evaluates whether the response is grounded in retrieved contexts.

  • Score name: nv_response_groundedness

  • Score range: 0/1/2 scale, normalized to [0,1].

Data Format#

{
  "response": "The capital of France is Paris.",
  "retrieved_contexts": ["Paris is the capital and largest city of France."]
}
result = client.evaluation.metrics.evaluate(
    metric=ResponseGroundednessMetricParam(
        type="response_groundedness",
        judge_model=ModelParam(
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="meta/llama-3.1-70b-instruct",
            api_key_secret="nvidia-api-key",
        ),
    ),
    dataset=DatasetRowsParam(
        rows=[{
            "response": "The capital of France is Paris.",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
        }]
    ),
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")
# Create the metric entity
client.evaluation.metrics.create(
    name="my-response-groundedness",
    type="response_groundedness",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJobParam(
        metric="my-workspace/my-response-groundedness",
        dataset="my-workspace/rag-dataset",
        params=EvaluationJobParamsParam(parallelism=16)
    )
)
# Create the metric entity
client.evaluation.metrics.create(
    name="my-response-groundedness",
    type="response_groundedness",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJobParam(
        metric="my-workspace/my-response-groundedness",
        model=ModelParam(  # Generation model
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            api_key_secret="nvidia-api-key",
        ),
        prompt_template={  # Can include RAG context variables
            "messages": [{
                "role": "user",
                "content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
            }]
        },
        dataset="my-workspace/rag-dataset",  # response column optional
        params=EvaluationJobParamsParam(
            parallelism=16,
            inference={"temperature": 0.7, "max_tokens": 1024},
        )
    )
)
{
  "scores": [
    {
      "count": 1,
      "histogram": {},
      "name": "nv_response_groundedness",
      "nan_count": 0,
      "max": 1.0,
      "mean": 1.0,
      "min": 1.0,
      "percentiles": {},
      "score_type": "range",
      "std_dev": 0.0,
      "sum": 1.0,
      "variance": 0.0
    }
  ]
}

Noise Sensitivity#

Measures how sensitive the response is to irrelevant or noisy content in the retrieved contexts.

  • Score name: noise_sensitivity

  • Score range: 0 to 1, with lower scores indicating better robustness to noise.

Data Format#

{
  "user_input": "What is the capital of France?",
  "response": "The capital of France is Paris.",
  "reference": "Paris",
  "retrieved_contexts": ["Paris is the capital of France.", "Irrelevant noise text."]
}
result = client.evaluation.metrics.evaluate(
    metric=NoiseSensitivityMetricParam(
        type="noise_sensitivity",
        judge_model=ModelParam(
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="meta/llama-3.1-70b-instruct",
            api_key_secret="nvidia-api-key",
        ),
    ),
    dataset=DatasetRowsParam(
        rows=[{
            "user_input": "What is the capital of France?",
            "response": "The capital of France is Paris.",
            "reference": "Paris",
            "retrieved_contexts": ["Paris is the capital of France.", "Irrelevant noise text."],
        }]
    ),
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")
# Create the metric entity
client.evaluation.metrics.create(
    name="my-noise-sensitivity",
    type="noise_sensitivity",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJobParam(
        metric="my-workspace/my-noise-sensitivity",
        dataset="my-workspace/rag-dataset",
        params=EvaluationJobParamsParam(parallelism=16)
    )
)
# Create the metric entity
client.evaluation.metrics.create(
    name="my-noise-sensitivity",
    type="noise_sensitivity",
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    )
)

# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJobParam(
        metric="my-workspace/my-noise-sensitivity",
        model=ModelParam(  # Generation model
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            api_key_secret="nvidia-api-key",
        ),
        prompt_template={  # Can include RAG context variables
            "messages": [{
                "role": "user",
                "content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
            }]
        },
        dataset="my-workspace/rag-dataset",  # response column optional
        params=EvaluationJobParamsParam(
            parallelism=16,
            inference={"temperature": 0.7, "max_tokens": 1024},
        )
    )
)
{
  "scores": [
    {
      "count": 1,
      "histogram": {},
      "name": "noise_sensitivity",
      "nan_count": 0,
      "max": 1.0,
      "mean": 1.0,
      "min": 1.0,
      "percentiles": {},
      "score_type": "range",
      "std_dev": 0.0,
      "sum": 1.0,
      "variance": 0.0
    }
  ]
}

Response Relevancy#

Measures how relevant the response is to the user’s question using embedding-based cosine similarity. Requires both judge LLM and embeddings model.

  • Score name: answer_relevancy

  • Score range: 0 to 1, with higher scores indicating better relevance.

Data Format#

{
  "user_input": "What is the capital of France?",
  "response": "The capital of France is Paris.",
  "retrieved_contexts": ["Paris is the capital city of France."]
}

Configuration Options#

Parameter

Type

Default

Description

strictness

int

1

Number of parallel questions generated (NIM only supports 1)

result = client.evaluation.metrics.evaluate(
    metric=ResponseRelevancyMetricParam(
        type="response_relevancy",
        strictness=1,
        judge_model=ModelParam(
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="meta/llama-3.1-70b-instruct",
            api_key_secret="nvidia-api-key",
        ),
        embeddings_model=ModelParam(
            url="https://integrate.api.nvidia.com/v1/embeddings",
            name="nvidia/nv-embedqa-e5-v5",
            api_key_secret="nvidia-api-key",
        ),
    ),
    dataset=DatasetRowsParam(
        rows=[{
            "user_input": "What is the capital of France?",
            "response": "The capital of France is Paris.",
            "retrieved_contexts": ["Paris is the capital city of France."],
        }]
    ),
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")
# Create the metric entity
client.evaluation.metrics.create(
    name="my-response-relevancy",
    type="response_relevancy",
    strictness=1,
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    ),
    embeddings_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/embeddings",
        name="nvidia/nv-embedqa-e5-v5",
        api_key_secret="nvidia-api-key",
    )
)

# Create the offline evaluation job (dataset must include response)
job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJobParam(
        metric="my-workspace/my-response-relevancy",
        dataset="my-workspace/rag-dataset",
        params=EvaluationJobParamsParam(parallelism=16)
    )
)
# Create the metric entity
client.evaluation.metrics.create(
    name="my-response-relevancy",
    type="response_relevancy",
    strictness=1,
    judge_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        name="meta/llama-3.1-70b-instruct",
        api_key_secret="nvidia-api-key",
    ),
    embeddings_model=ModelParam(
        url="https://integrate.api.nvidia.com/v1/embeddings",
        name="nvidia/nv-embedqa-e5-v5",
        api_key_secret="nvidia-api-key",
    )
)

# Create the online evaluation job (response will be generated automatically)
job = client.evaluation.metric_jobs.create(
    spec=MetricOnlineJobParam(
        metric="my-workspace/my-response-relevancy",
        model=ModelParam(  # Generation model
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="nvidia/llama-3.3-nemotron-super-49b-v1",
            api_key_secret="nvidia-api-key",
        ),
        prompt_template={  # Can include RAG context variables
            "messages": [{
                "role": "user",
                "content": "Context:\n{{ retrieved_contexts | join('\n\n') }}\n\nQuestion: {{ user_input }}\n\nAnswer:"
            }]
        },
        dataset="my-workspace/rag-dataset",  # response column optional
        params=EvaluationJobParamsParam(
            parallelism=16,
            inference={"temperature": 0.7, "max_tokens": 1024},
        )
    )
)
{
  "scores": [
    {
      "count": 1,
      "histogram": {},
      "name": "answer_relevancy",
      "nan_count": 0,
      "max": 1.0,
      "mean": 1.0,
      "min": 1.0,
      "percentiles": {},
      "score_type": "range",
      "std_dev": 0.0,
      "sum": 1.0,
      "variance": 0.0
    }
  ]
}

Dataset Format#

RAGAS metrics use specific column names:

Field

Type

Required

Description

user_input

string

Yes

User question or input

retrieved_contexts

list[string]

Some metrics

List of context passages

response

string

Some metrics

Generated answer (optional for online mode, required for offline)

reference

string

Some metrics

Reference answer or ground truth

Note

Online vs Offline Mode:

  • Online mode (MetricOnlineJob): response column is optional - will be generated automatically using job’s model and prompt_template

  • Offline mode (MetricOfflineJob): response column is required - must contain pre-generated responses

Example Dataset#

{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": ["Paris is the capital and largest city of France."],
  "response": "The capital of France is Paris.",
  "reference": "Paris"
}

Note

Different metrics require different columns. Check the metric documentation for specific requirements.


Response Format#

All evaluation responses follow this structure:

{
  "metric": {
    "type": "faithfulness",
    "judge_model": {"url": "...", "name": "..."}
  },
  "aggregate_scores": [
    {
      "name": "faithfulness",
      "count": 1,
      "mean": 0.95,
      "min": 0.95,
      "max": 0.95,
      "sum": 0.95
    }
  ],
  "row_scores": [
    {
      "index": 0,
      "row": {"user_input": "...", "response": "..."},
      "scores": {"faithfulness": 0.95},
      "error": null
    }
  ]
}

Working with Results#

# Access aggregate scores
for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}, count={score.count}")

# Access per-row scores
for row in result.row_scores:
    if row.scores:
        print(f"Row {row.index}: {row.scores}")
    elif row.error:
        print(f"Row {row.index} failed: {row.error}")

Managing Secrets for Authenticated Endpoints#

Store API keys as secrets for secure authentication:

# Create secrets for all endpoints that may require authentication
client.secrets.create(name="judge-api-key", data="<your-judge-key>")
client.secrets.create(name="embedding-api-key", data="<your-embedding-key>")

Reference secrets by name in your metric configuration:

"judge_model": {
    "url": "https://integrate.api.nvidia.com/v1/chat/completions",
    "name": "meta/llama-3.1-70b-instruct",
    "api_key_secret": "judge-api-key"  # Name of secret, not the actual API key
}

Job Management#

After successfully creating a job, navigate to Metrics Job Management to oversee its execution, monitor progress.


Troubleshooting#

Common Errors#

Error

Cause

Solution

judge_model is required

Missing judge LLM config for metric

Add judge_model to metric configuration

embeddings_model is required

Using response_relevancy without embeddings

Add embeddings_model to metric configuration

Job stuck in “pending”

Model endpoint not accessible

Verify endpoint URLs and API key secrets

Authentication failed

Invalid or missing API key

Check secret names match exactly

nan_count > 0 and mean = null

Judge/model call failures (for example auth, endpoint, quota, or timeout). Some RAGAS metrics are known to return NaN instead of raising on these failures.

Inspect row-level error and request logs; verify API key, endpoint, and model access

Low faithfulness scores

Context doesn’t support the response

Improve retrieval or response generation

Warning

If you see nan_count > 0 with mean = null, first validate judge model authentication. For some RAGAS metrics, auth failures can be converted to NaN scores instead of surfacing as a hard error.

Tips for Better Results#

  • Use larger judge models (70B+) for more consistent scoring

  • Start with inline datasets to test your configuration before large evaluations

  • Set appropriate timeouts - judge LLM calls can take time with large contexts

  • Use parallelism wisely - increase parallelism for faster evaluation, but respect rate limits

  • Column names matter - RAGAS metrics use user_input, retrieved_contexts, response, and reference


Important Notes#

  1. Dataset Limit: Live evaluation supports up to 10 rows per request. For larger evaluations, use job-based evaluation.

  2. Secret Management: API keys should be stored as secrets and referenced by name in api_key_secret. Never pass API keys directly in the request.

  3. Column Names: RAGAS metrics use specific column names:

    • user_input (not question)

    • response (not answer)

    • retrieved_contexts (not contexts)

    • reference (not ground_truth)

  4. Embeddings Model: Only response_relevancy requires an embeddings model. All other metrics use only the judge LLM.

Limitations#

  1. Judge Model Quality: Evaluation quality depends on the judge model’s ability to follow instructions. Larger models (70B+) typically produce more consistent results.

  2. Dataset Format: RAGAS metrics use specific column names (user_input, retrieved_contexts, response, reference). Ensure your data matches this structure.

See also