AA-LCR |
A challenging benchmark measuring language models’ ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer). |
NGC |
26.01 |
multiarch
|
aa_lcr |
bfcl |
The Berkeley Function Calling Leaderboard V3 (also called Berkeley Tool Calling Leaderboard V3) evaluates the LLM’s ability to call functions (aka tools) accurately. |
NGC |
26.01 |
multiarch
|
bfclv2, bfclv2_ast, bfclv2_ast_prompting, bfclv3, bfclv3_ast, bfclv3_ast_prompting |
bigcode-evaluation-harness |
A framework for the evaluation of autoregressive code generation language models. |
NGC |
26.01 |
multiarch
|
humaneval, humaneval_instruct, humanevalplus, mbpp-chat, mbpp-completions, mbppplus-chat, mbppplus-completions, mbppplus_nemo, multiple-cpp, multiple-cs, multiple-d, multiple-go, multiple-java, multiple-jl, multiple-js, multiple-lua, multiple-php, multiple-pl, multiple-py, multiple-r, multiple-rb, multiple-rkt, multiple-rs, multiple-scala, multiple-sh, multiple-swift, multiple-ts |
codec |
Contamination detection framework for evaluating language models |
NGC |
26.01 |
amd
|
aime_2024, aime_2025, bbq, bfcl_v3, frames, gpqa_diamond, gsm8k_test, gsm8k_train, hellaswag_test, hellaswag_train, hle, ifbench, ifeval, livecodebench_v1, livecodebench_v5, math_500_problem, math_500_solution, mmlu_pro_test, mmlu_test, openai_humaneval, reward_bench_v1, reward_bench_v2, scicode, swebench_test, swebench_train, taubench, terminalbench |
garak |
Garak is an LLM vulnerability scanner. |
NGC |
26.01 |
multiarch
|
garak, garak-completions |
genai_perf_eval |
GenAI Perf is a tool to evaluate the performance of LLM endpoints, based on GenAI Perf. |
NGC |
26.01 |
amd
|
genai_perf_generation, genai_perf_generation_completions, genai_perf_summarization, genai_perf_summarization_completions |
helm |
A framework for evaluating large language models in medical applications across various healthcare tasks |
NGC |
26.01 |
amd
|
aci_bench, ehr_sql, head_qa, med_dialog_healthcaremagic, med_dialog_icliniq, medbullets, medcalc_bench, medec, medhallu, medi_qa, medication_qa, mtsamples_procedures, mtsamples_replicate, pubmed_qa, race_based_med |
hle |
Humanity’s Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. Humanity’s Last Exam consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. |
NGC |
26.01 |
multiarch
|
hle, hle_aa_v2 |
ifbench |
IFBench is a new, challenging benchmark for precise instruction following. |
NGC |
26.01 |
multiarch
|
ifbench, ifbench_aa_v2 |
livecodebench |
Holistic and Contamination Free Evaluation of Large Language Models for Code. |
NGC |
26.01 |
multiarch
|
codeexecution_v2, codeexecution_v2_cot, codegeneration_notfast, codegeneration_release_latest, codegeneration_release_v1, codegeneration_release_v2, codegeneration_release_v3, codegeneration_release_v4, codegeneration_release_v5, codegeneration_release_v6, livecodebench_0724_0125, livecodebench_0824_0225, livecodebench_aa_v2, testoutputprediction |
lm-evaluation-harness |
This project provides a unified framework to test generative language models on a large number of different evaluation tasks. |
NGC |
26.01 |
multiarch
|
adlr_agieval_en_cot, adlr_arc_challenge_llama_25_shot, adlr_commonsense_qa_7_shot, adlr_global_mmlu_lite_5_shot, adlr_gpqa_diamond_cot_5_shot, adlr_gsm8k_cot_8_shot, adlr_humaneval_greedy, adlr_humaneval_sampled, adlr_math_500_4_shot_sampled, adlr_mbpp_sanitized_3_shot_greedy, adlr_mbpp_sanitized_3_shot_sampled, adlr_mgsm_native_cot_8_shot, adlr_minerva_math_nemo_4_shot, adlr_mmlu, adlr_mmlu_pro_5_shot_base, adlr_race, adlr_truthfulqa_mc2, adlr_winogrande_5_shot, agieval, arc_challenge, arc_challenge_chat, arc_multilingual, bbh, bbh_instruct, bbq_chat, bbq_completions, commonsense_qa, global_mmlu, global_mmlu_ar, global_mmlu_bn, global_mmlu_de, global_mmlu_en, global_mmlu_es, global_mmlu_fr, global_mmlu_full, global_mmlu_full_am, global_mmlu_full_ar, global_mmlu_full_bn, global_mmlu_full_cs, global_mmlu_full_de, global_mmlu_full_el, global_mmlu_full_en, global_mmlu_full_es, global_mmlu_full_fa, global_mmlu_full_fil, global_mmlu_full_fr, global_mmlu_full_ha, global_mmlu_full_he, global_mmlu_full_hi, global_mmlu_full_id, global_mmlu_full_ig, global_mmlu_full_it, global_mmlu_full_ja, global_mmlu_full_ko, global_mmlu_full_ky, global_mmlu_full_lt, global_mmlu_full_mg, global_mmlu_full_ms, global_mmlu_full_ne, global_mmlu_full_nl, global_mmlu_full_ny, global_mmlu_full_pl, global_mmlu_full_pt, global_mmlu_full_ro, global_mmlu_full_ru, global_mmlu_full_si, global_mmlu_full_sn, global_mmlu_full_so, global_mmlu_full_sr, global_mmlu_full_sv, global_mmlu_full_sw, global_mmlu_full_te, global_mmlu_full_tr, global_mmlu_full_uk, global_mmlu_full_vi, global_mmlu_full_yo, global_mmlu_full_zh, global_mmlu_hi, global_mmlu_id, global_mmlu_it, global_mmlu_ja, global_mmlu_ko, global_mmlu_pt, global_mmlu_sw, global_mmlu_yo, global_mmlu_zh, gpqa, gpqa_diamond_cot, gsm8k, gsm8k_cot_instruct, gsm8k_cot_llama, gsm8k_cot_zeroshot, gsm8k_cot_zeroshot_llama, hellaswag, hellaswag_multilingual, humaneval_instruct, ifeval, m_mmlu_id_str_chat, m_mmlu_id_str_completions, mbpp_plus_chat, mbpp_plus_completions, mgsm, mgsm_cot_chat, mgsm_cot_completions, mmlu, mmlu_cot_0_shot_chat, mmlu_instruct, mmlu_instruct_completions, mmlu_logits, mmlu_pro, mmlu_pro_instruct, mmlu_prox_chat, mmlu_prox_completions, mmlu_prox_de_chat, mmlu_prox_de_completions, mmlu_prox_es_chat, mmlu_prox_es_completions, mmlu_prox_fr_chat, mmlu_prox_fr_completions, mmlu_prox_it_chat, mmlu_prox_it_completions, mmlu_prox_ja_chat, mmlu_prox_ja_completions, mmlu_redux, mmlu_redux_instruct, musr, openbookqa, piqa, social_iqa, truthfulqa, wikilingua, wikitext, winogrande |
mmath |
MMATH is a new benchmark specifically designed for multilingual complex reasoning. It comprises 374 carefully selected math problems from high-quality sources, including AIME, CNMO, and MATH-500, and covers ten typologically and geographically diverse languages. Each problem is translated and validated through a rigorous pipeline that combines frontier LLMs with human verification, ensuring semantic consistency. |
NGC |
26.01 |
multiarch
|
mmath_ar, mmath_en, mmath_es, mmath_fr, mmath_ja, mmath_ko, mmath_pt, mmath_th, mmath_vi, mmath_zh |
mtbench |
MT-bench is designed to test multi-turn conversation and instruction-following ability, covering common use cases and focusing on challenging questions to differentiate models. |
NGC |
26.01 |
multiarch
|
mtbench, mtbench-cor1 |
mteb |
The Massive Text Embedding Benchmark (MTEB) is a comprehensive benchmark designed to evaluate the performance of text embedding models across a wide range of tasks and datasets. It includes 58 datasets covering 8 tasks and 112 languages. |
NGC |
26.01 |
multiarch
|
MMTEB, MTEB, MTEB_NL_RETRIEVAL, MTEB_VDR, RTEB, ViDoReV1, ViDoReV2, ViDoReV3, ViDoReV3_Text, ViDoReV3_Text_Image, custom_beir_task, fiqa, hotpotqa, miracl, miracl_lite, mldr, mlqa, nano_fiqa, nq |
nemo_skills |
NeMo Skills - a project to improve skills of LLMs |
NGC |
26.01 |
multiarch
|
ns_aa_lcr, ns_aime2024, ns_aime2025, ns_bfcl_v3, ns_bfcl_v4, ns_gpqa, ns_hle, ns_hle_aa, ns_hmmt_feb2025, ns_ifbench, ns_ifeval, ns_livecodebench, ns_livecodebench_aa, ns_livecodebench_v5, ns_mmlu, ns_mmlu_pro, ns_mmlu_prox, ns_ruler, ns_scicode, ns_wmt24pp |
profbench |
Professional domain benchmark for evaluating LLMs on Physics PhD, Chemistry PhD, Finance MBA, and Consulting MBA tasks |
NGC |
26.01 |
multiarch
|
llm_judge, report_generation |
ruler |
RULER generates synthetic examples to evaluate long-context language models with configurable sequence length and task complexity. |
NGC |
26.01 |
multiarch
|
ruler-128k-chat, ruler-128k-completions, ruler-16k-chat, ruler-16k-completions, ruler-1m-chat, ruler-1m-completions, ruler-256k-chat, ruler-256k-completions, ruler-32k-chat, ruler-32k-completions, ruler-4k-chat, ruler-4k-completions, ruler-512k-chat, ruler-512k-completions, ruler-64k-chat, ruler-64k-completions, ruler-8k-chat, ruler-8k-completions, ruler-chat, ruler-completions |
safety_eval |
Harness for Safety evaluations |
NGC |
25.11 |
multiarch
|
aegis_v2, aegis_v2_reasoning, wildguard |
scicode |
SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. |
NGC |
26.01 |
multiarch
|
scicode, scicode_aa_v2, scicode_background |
simple_evals |
simple-evals - a lightweight library for evaluating language models. |
NGC |
26.01 |
multiarch
|
AA_AIME_2024, AA_math_test_500, AIME_2024, AIME_2025, AIME_2025_aa_v2, aime_2024_nemo, aime_2025_nemo, browsecomp, gpqa_diamond, gpqa_diamond_aa_v2, gpqa_diamond_aa_v2_llama_4, gpqa_diamond_aa_v3, gpqa_diamond_nemo, gpqa_extended, gpqa_main, healthbench, healthbench_consensus, healthbench_hard, humaneval, humanevalplus, math_test_500, math_test_500_nemo, mgsm, mgsm_aa_v2, mmlu, mmlu_am, mmlu_ar, mmlu_ar-lite, mmlu_bn, mmlu_bn-lite, mmlu_cs, mmlu_de, mmlu_de-lite, mmlu_el, mmlu_en, mmlu_en-lite, mmlu_es, mmlu_es-lite, mmlu_fa, mmlu_fil, mmlu_fr, mmlu_fr-lite, mmlu_ha, mmlu_he, mmlu_hi, mmlu_hi-lite, mmlu_id, mmlu_id-lite, mmlu_ig, mmlu_it, mmlu_it-lite, mmlu_ja, mmlu_ja-lite, mmlu_ko, mmlu_ko-lite, mmlu_ky, mmlu_llama_4, mmlu_lt, mmlu_mg, mmlu_ms, mmlu_my-lite, mmlu_ne, mmlu_nl, mmlu_ny, mmlu_pl, mmlu_pro, mmlu_pro_aa_v2, mmlu_pro_aa_v3, mmlu_pro_llama_4, mmlu_pt, mmlu_pt-lite, mmlu_ro, mmlu_ru, mmlu_si, mmlu_sn, mmlu_so, mmlu_sr, mmlu_sv, mmlu_sw, mmlu_sw-lite, mmlu_te, mmlu_tr, mmlu_uk, mmlu_vi, mmlu_yo, mmlu_yo-lite, mmlu_zh-lite, simpleqa |
tau2_bench |
Evaluating Conversational Agents in a Dual-Control Environment |
NGC |
26.01 |
multiarch
|
tau2_bench_airline, tau2_bench_retail, tau2_bench_telecom |
tooltalk |
ToolTalk is designed to evaluate tool-augmented LLMs as a chatbot. ToolTalk contains a handcrafted dataset of 28 easy conversations and 50 hard conversations. |
NGC |
26.01 |
multiarch
|
tooltalk |
vlmevalkit |
VLMEvalKit is an open-source evaluation toolkit of large vision-language models (LVLMs). It enables one-command evaluation of LVLMs on various benchmarks, without the heavy workload of data preparation under multiple repositories. In VLMEvalKit, we adopt generation-based evaluation for all LVLMs, and provide the evaluation results obtained with both exact matching and LLM-based answer extraction. |
NGC |
26.01 |
amd
|
ai2d_judge, chartqa, mathvista-mini, mmmu_judge, ocr_reasoning, ocrbench, slidevqa |