Expected Evaluation Duration#

The time that an evaluation job takes can vary from a few minutes to many hours, depending on the target model, configuration, and other factors. The following tables contain example evaluation durations organized by category.

Important

These are only a few possible evaluation examples.

Basic Model Evaluations#

Basic Model Evaluation Examples#
Example Evaluation	Example Models	Example Hardware	Example Dataset	Example Expected Time
LM Evaluation Harness (gsm8k task)	Inference: meta/llama-3.1-8b-instruct	1 A100	(academic) gsm8k dataset	5–10 Hours
BigCode Evaluation	Inference: meta/llama-3.1-8b-instruct	1 A100	(academic) HumanEval dataset	1–5 Hours

Similarity and Judgment Evaluations#

Similarity and Judgment Evaluation Examples#
Example Evaluation	Example Models	Example Hardware	Example Dataset	Example Expected Time
Similarity Metrics (offline)	Offline generated answers	—	20 answers	Minutes
Similarity Metrics (online)	Inference: meta/llama-3.1-8b-instruct	1 A100	113 questions/prompts	Minutes–1 Hour
LLM-as-a-Judge	Inference: meta/llama-3.1-8b-instruct and judge: meta/llama-3.1-70b-instruct	5 A100s	(academic) MT-bench dataset	1–5 Hours
LLM-as-a-Judge (judgment only)	Judge: meta/llama-3.1-70b-instruct	4 A100s	2 answers	Minutes

Retriever Evaluations#

Retriever Evaluation Examples#
Example Evaluation	Example Models	Example Hardware	Example Dataset	Example Expected Time
Retriever Evaluation (embedding only)	Embedding: nvidia/nv-embedqa-e5-v5	1 A100	(academic) FiQA dataset	Minutes
Retriever Evaluation (embedding + reranking)	Embedding: nvidia/nv-embedqa-e5-v5 and Reranking: nvidia/nv-rerankqa-mistral-4b-v3	2 A100s	(academic) FiQA dataset	Minutes

RAG Evaluations#

RAG Evaluation Examples#
Example Evaluation	Example Models	Example Hardware	Example Dataset	Example Expected Time
RAG (with pre-retrieved contexts)	Inference: meta/llama-3.1-70b-instruct, judge LLM: meta/llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5	9 A100s	3 questions	Minutes
RAG (with pre-generated answers)	Judge LLM: meta/llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5	5 A100s	322 questions	Minutes–1 Hour
RAG (retriever with embedding only)	Embedding: nvidia/nv-embedqa-e5-v5, inference: meta/llama-3.1-8b-instruct, judge LLM: meta/llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5	7 A100s	(academic) FiQA dataset	1–5 Hours
RAG (retriever with embedding + reranking)	Embedding: nvidia/nv-embedqa-e5-v5, reranking: nvidia/nv-rerankqa-mistral-4b-v3, inference: meta/llama-3.1-70b-instruct, judge LLM: meta/llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5	11 A100s	(academic) FiQA dataset	1–5 Hours