Expected Evaluation Duration#

The time that an evaluation job takes can vary from a few minutes to many hours, depending on the target model, configuration, and other factors. The following tables contain example evaluation durations organized by category.

Important

These are only a few possible evaluation examples.

Basic Model Evaluations#

Basic Model Evaluation Examples#

Example Evaluation

Example Models

Example Hardware

Example Dataset

Example Expected Time

LM Evaluation Harness (gsm8k task)

Inference: meta/llama-3.1-8b-instruct

1 A100

(academic) gsm8k dataset

5–10 Hours

BigCode Evaluation

Inference: meta/llama-3.1-8b-instruct

1 A100

(academic) HumanEval dataset

1–5 Hours

Similarity and Judgment Evaluations#

Similarity and Judgment Evaluation Examples#

Example Evaluation

Example Models

Example Hardware

Example Dataset

Example Expected Time

Similarity Metrics (offline)

Offline generated answers

20 answers

Minutes

Similarity Metrics (online)

Inference: meta/llama-3.1-8b-instruct

1 A100

113 questions/prompts

Minutes–1 Hour

LLM-as-a-Judge

Inference: meta/llama-3.1-8b-instruct and judge: meta/llama-3.1-70b-instruct

5 A100s

(academic) MT-bench dataset

1–5 Hours

LLM-as-a-Judge (judgment only)

Judge: meta/llama-3.1-70b-instruct

4 A100s

2 answers

Minutes

Retriever Evaluations#

Retriever Evaluation Examples#

Example Evaluation

Example Models

Example Hardware

Example Dataset

Example Expected Time

Retriever Evaluation (embedding only)

Embedding: nvidia/nv-embedqa-e5-v5

1 A100

(academic) FiQA dataset

Minutes

Retriever Evaluation (embedding + reranking)

Embedding: nvidia/nv-embedqa-e5-v5 and Reranking: nvidia/nv-rerankqa-mistral-4b-v3

2 A100s

(academic) FiQA dataset

Minutes

RAG Evaluations#

RAG Evaluation Examples#

Example Evaluation

Example Models

Example Hardware

Example Dataset

Example Expected Time

RAG (with pre-retrieved contexts)

Inference: meta/llama-3.1-70b-instruct, judge LLM: meta/llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5

9 A100s

3 questions

Minutes

RAG (with pre-generated answers)

Judge LLM: meta/llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5

5 A100s

322 questions

Minutes–1 Hour

RAG (retriever with embedding only)

Embedding: nvidia/nv-embedqa-e5-v5, inference: meta/llama-3.1-8b-instruct, judge LLM: meta/llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5

7 A100s

(academic) FiQA dataset

1–5 Hours

RAG (retriever with embedding + reranking)

Embedding: nvidia/nv-embedqa-e5-v5, reranking: nvidia/nv-rerankqa-mistral-4b-v3, inference: meta/llama-3.1-70b-instruct, judge LLM: meta/llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5

11 A100s

(academic) FiQA dataset

1–5 Hours