Expected Evaluation Duration#
The time that an evaluation job takes can vary from a few minutes to many hours, depending on the target model, configuration, and other factors. The following tables contain example evaluation durations organized by category.
Important
These are only a few possible evaluation examples.
Basic Model Evaluations#
Example Evaluation |
Example Models |
Example Hardware |
Example Dataset |
Example Expected Time |
---|---|---|---|---|
LM Evaluation Harness (gsm8k task) |
Inference: meta/llama-3.1-8b-instruct |
1 A100 |
(academic) gsm8k dataset |
5–10 Hours |
BigCode Evaluation |
Inference: meta/llama-3.1-8b-instruct |
1 A100 |
(academic) HumanEval dataset |
1–5 Hours |
Similarity and Judgment Evaluations#
Example Evaluation |
Example Models |
Example Hardware |
Example Dataset |
Example Expected Time |
---|---|---|---|---|
Similarity Metrics (offline) |
Offline generated answers |
— |
20 answers |
Minutes |
Similarity Metrics (online) |
Inference: meta/llama-3.1-8b-instruct |
1 A100 |
113 questions/prompts |
Minutes–1 Hour |
LLM-as-a-Judge |
Inference: meta/llama-3.1-8b-instruct and judge: meta/llama-3.1-70b-instruct |
5 A100s |
(academic) MT-bench dataset |
1–5 Hours |
LLM-as-a-Judge (judgment only) |
Judge: meta/llama-3.1-70b-instruct |
4 A100s |
2 answers |
Minutes |
Retriever Evaluations#
Example Evaluation |
Example Models |
Example Hardware |
Example Dataset |
Example Expected Time |
---|---|---|---|---|
Retriever Evaluation (embedding only) |
Embedding: nvidia/nv-embedqa-e5-v5 |
1 A100 |
(academic) FiQA dataset |
Minutes |
Retriever Evaluation (embedding + reranking) |
Embedding: nvidia/nv-embedqa-e5-v5 and Reranking: nvidia/nv-rerankqa-mistral-4b-v3 |
2 A100s |
(academic) FiQA dataset |
Minutes |
RAG Evaluations#
Example Evaluation |
Example Models |
Example Hardware |
Example Dataset |
Example Expected Time |
---|---|---|---|---|
RAG (with pre-retrieved contexts) |
Inference: meta/llama-3.1-70b-instruct, judge LLM: meta/llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5 |
9 A100s |
3 questions |
Minutes |
RAG (with pre-generated answers) |
Judge LLM: meta/llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5 |
5 A100s |
322 questions |
Minutes–1 Hour |
RAG (retriever with embedding only) |
Embedding: nvidia/nv-embedqa-e5-v5, inference: meta/llama-3.1-8b-instruct, judge LLM: meta/llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5 |
7 A100s |
(academic) FiQA dataset |
1–5 Hours |
RAG (retriever with embedding + reranking) |
Embedding: nvidia/nv-embedqa-e5-v5, reranking: nvidia/nv-rerankqa-mistral-4b-v3, inference: meta/llama-3.1-70b-instruct, judge LLM: meta/llama-3.1-70b-instruct, judge embeddings: nvidia/nv-embedqa-e5-v5 |
11 A100s |
(academic) FiQA dataset |
1–5 Hours |