Available Benchmarks#

Container	Description	NGC Catalog	Latest Tag	Arch	Tasks
AA-LCR	A challenging benchmark measuring language models’ ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).	NGC	26.01	`multiarch`	aa_lcr
bfcl	The Berkeley Function Calling Leaderboard V3 (also called Berkeley Tool Calling Leaderboard V3) evaluates the LLM’s ability to call functions (aka tools) accurately.	NGC	26.01	`multiarch`	bfclv2, bfclv2_ast, bfclv2_ast_prompting, bfclv3, bfclv3_ast, bfclv3_ast_prompting
bigcode-evaluation-harness	A framework for the evaluation of autoregressive code generation language models.	NGC	26.01	`multiarch`	humaneval, humaneval_instruct, humanevalplus, mbpp-chat, mbpp-completions, mbppplus-chat, mbppplus-completions, mbppplus_nemo, multiple-cpp, multiple-cs, multiple-d, multiple-go, multiple-java, multiple-jl, multiple-js, multiple-lua, multiple-php, multiple-pl, multiple-py, multiple-r, multiple-rb, multiple-rkt, multiple-rs, multiple-scala, multiple-sh, multiple-swift, multiple-ts
codec	Contamination detection framework for evaluating language models	NGC	26.01	`amd`	aime_2024, aime_2025, bbq, bfcl_v3, frames, gpqa_diamond, gsm8k_test, gsm8k_train, hellaswag_test, hellaswag_train, hle, ifbench, ifeval, livecodebench_v1, livecodebench_v5, math_500_problem, math_500_solution, mmlu_pro_test, mmlu_test, openai_humaneval, reward_bench_v1, reward_bench_v2, scicode, swebench_test, swebench_train, taubench, terminalbench
garak	Garak is an LLM vulnerability scanner.	NGC	26.01	`multiarch`	garak, garak-completions
genai_perf_eval	GenAI Perf is a tool to evaluate the performance of LLM endpoints, based on GenAI Perf.	NGC	26.01	`amd`	genai_perf_generation, genai_perf_generation_completions, genai_perf_summarization, genai_perf_summarization_completions
helm	A framework for evaluating large language models in medical applications across various healthcare tasks	NGC	26.01	`amd`	aci_bench, ehr_sql, head_qa, med_dialog_healthcaremagic, med_dialog_icliniq, medbullets, medcalc_bench, medec, medhallu, medi_qa, medication_qa, mtsamples_procedures, mtsamples_replicate, pubmed_qa, race_based_med
hle	Humanity’s Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. Humanity’s Last Exam consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading.	NGC	26.01	`multiarch`	hle, hle_aa_v2
ifbench	IFBench is a new, challenging benchmark for precise instruction following.	NGC	26.01	`multiarch`	ifbench, ifbench_aa_v2
livecodebench	Holistic and Contamination Free Evaluation of Large Language Models for Code.	NGC	26.01	`multiarch`	codeexecution_v2, codeexecution_v2_cot, codegeneration_notfast, codegeneration_release_latest, codegeneration_release_v1, codegeneration_release_v2, codegeneration_release_v3, codegeneration_release_v4, codegeneration_release_v5, codegeneration_release_v6, livecodebench_0724_0125, livecodebench_0824_0225, livecodebench_aa_v2, testoutputprediction
lm-evaluation-harness	This project provides a unified framework to test generative language models on a large number of different evaluation tasks.	NGC	26.01	`multiarch`	adlr_agieval_en_cot, adlr_arc_challenge_llama_25_shot, adlr_commonsense_qa_7_shot, adlr_global_mmlu_lite_5_shot, adlr_gpqa_diamond_cot_5_shot, adlr_gsm8k_cot_8_shot, adlr_humaneval_greedy, adlr_humaneval_sampled, adlr_math_500_4_shot_sampled, adlr_mbpp_sanitized_3_shot_greedy, adlr_mbpp_sanitized_3_shot_sampled, adlr_mgsm_native_cot_8_shot, adlr_minerva_math_nemo_4_shot, adlr_mmlu, adlr_mmlu_pro_5_shot_base, adlr_race, adlr_truthfulqa_mc2, adlr_winogrande_5_shot, agieval, arc_challenge, arc_challenge_chat, arc_multilingual, bbh, bbh_instruct, bbq_chat, bbq_completions, commonsense_qa, global_mmlu, global_mmlu_ar, global_mmlu_bn, global_mmlu_de, global_mmlu_en, global_mmlu_es, global_mmlu_fr, global_mmlu_full, global_mmlu_full_am, global_mmlu_full_ar, global_mmlu_full_bn, global_mmlu_full_cs, global_mmlu_full_de, global_mmlu_full_el, global_mmlu_full_en, global_mmlu_full_es, global_mmlu_full_fa, global_mmlu_full_fil, global_mmlu_full_fr, global_mmlu_full_ha, global_mmlu_full_he, global_mmlu_full_hi, global_mmlu_full_id, global_mmlu_full_ig, global_mmlu_full_it, global_mmlu_full_ja, global_mmlu_full_ko, global_mmlu_full_ky, global_mmlu_full_lt, global_mmlu_full_mg, global_mmlu_full_ms, global_mmlu_full_ne, global_mmlu_full_nl, global_mmlu_full_ny, global_mmlu_full_pl, global_mmlu_full_pt, global_mmlu_full_ro, global_mmlu_full_ru, global_mmlu_full_si, global_mmlu_full_sn, global_mmlu_full_so, global_mmlu_full_sr, global_mmlu_full_sv, global_mmlu_full_sw, global_mmlu_full_te, global_mmlu_full_tr, global_mmlu_full_uk, global_mmlu_full_vi, global_mmlu_full_yo, global_mmlu_full_zh, global_mmlu_hi, global_mmlu_id, global_mmlu_it, global_mmlu_ja, global_mmlu_ko, global_mmlu_pt, global_mmlu_sw, global_mmlu_yo, global_mmlu_zh, gpqa, gpqa_diamond_cot, gsm8k, gsm8k_cot_instruct, gsm8k_cot_llama, gsm8k_cot_zeroshot, gsm8k_cot_zeroshot_llama, hellaswag, hellaswag_multilingual, humaneval_instruct, ifeval, m_mmlu_id_str_chat, m_mmlu_id_str_completions, mbpp_plus_chat, mbpp_plus_completions, mgsm, mgsm_cot_chat, mgsm_cot_completions, mmlu, mmlu_cot_0_shot_chat, mmlu_instruct, mmlu_instruct_completions, mmlu_logits, mmlu_pro, mmlu_pro_instruct, mmlu_prox_chat, mmlu_prox_completions, mmlu_prox_de_chat, mmlu_prox_de_completions, mmlu_prox_es_chat, mmlu_prox_es_completions, mmlu_prox_fr_chat, mmlu_prox_fr_completions, mmlu_prox_it_chat, mmlu_prox_it_completions, mmlu_prox_ja_chat, mmlu_prox_ja_completions, mmlu_redux, mmlu_redux_instruct, musr, openbookqa, piqa, social_iqa, truthfulqa, wikilingua, wikitext, winogrande
mmath	MMATH is a new benchmark specifically designed for multilingual complex reasoning. It comprises 374 carefully selected math problems from high-quality sources, including AIME, CNMO, and MATH-500, and covers ten typologically and geographically diverse languages. Each problem is translated and validated through a rigorous pipeline that combines frontier LLMs with human verification, ensuring semantic consistency.	NGC	26.01	`multiarch`	mmath_ar, mmath_en, mmath_es, mmath_fr, mmath_ja, mmath_ko, mmath_pt, mmath_th, mmath_vi, mmath_zh
mtbench	MT-bench is designed to test multi-turn conversation and instruction-following ability, covering common use cases and focusing on challenging questions to differentiate models.	NGC	26.01	`multiarch`	mtbench, mtbench-cor1
mteb	The Massive Text Embedding Benchmark (MTEB) is a comprehensive benchmark designed to evaluate the performance of text embedding models across a wide range of tasks and datasets. It includes 58 datasets covering 8 tasks and 112 languages.	NGC	26.01	`multiarch`	MMTEB, MTEB, MTEB_NL_RETRIEVAL, MTEB_VDR, RTEB, ViDoReV1, ViDoReV2, ViDoReV3, ViDoReV3_Text, ViDoReV3_Text_Image, custom_beir_task, fiqa, hotpotqa, miracl, miracl_lite, mldr, mlqa, nano_fiqa, nq, nvidia_digital_corpora_10k, nvidia_digital_corpora_10k_text, nvidia_earnings_v2, nvidia_earnings_v2_text, nvidia_vidore_v1, nvidia_vidore_v1_text, nvidia_vidore_v2, nvidia_vidore_v2_text, nvidia_vidore_v3, nvidia_vidore_v3_text, techqa
nemo_skills	NeMo Skills - a project to improve skills of LLMs	NGC	26.01	`multiarch`	ns_aa_lcr, ns_aime2024, ns_aime2025, ns_bfcl_v3, ns_bfcl_v4, ns_gpqa, ns_hle, ns_hle_aa, ns_hmmt_feb2025, ns_ifbench, ns_ifeval, ns_livecodebench, ns_livecodebench_aa, ns_livecodebench_v5, ns_mmlu, ns_mmlu_pro, ns_mmlu_prox, ns_ruler, ns_scicode, ns_wmt24pp
profbench	Professional domain benchmark for evaluating LLMs on Physics PhD, Chemistry PhD, Finance MBA, and Consulting MBA tasks	NGC	26.01	`multiarch`	llm_judge, report_generation
ruler	RULER generates synthetic examples to evaluate long-context language models with configurable sequence length and task complexity.	NGC	26.01	`multiarch`	ruler-128k-chat, ruler-128k-completions, ruler-16k-chat, ruler-16k-completions, ruler-1m-chat, ruler-1m-completions, ruler-256k-chat, ruler-256k-completions, ruler-32k-chat, ruler-32k-completions, ruler-4k-chat, ruler-4k-completions, ruler-512k-chat, ruler-512k-completions, ruler-64k-chat, ruler-64k-completions, ruler-8k-chat, ruler-8k-completions, ruler-chat, ruler-completions
safety_eval	Harness for Safety evaluations	NGC	26.01	`multiarch`	aegis_v2, aegis_v2_completions, aegis_v2_reasoning, compliance, wildguard, wildguard_completions
scicode	SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems.	NGC	26.01	`multiarch`	scicode, scicode_aa_v2, scicode_background
simple_evals	simple-evals - a lightweight library for evaluating language models.	NGC	26.01	`multiarch`	AA_AIME_2024, AA_math_test_500, AIME_2024, AIME_2025, AIME_2025_aa_v2, aime_2024_nemo, aime_2025_nemo, browsecomp, gpqa_diamond, gpqa_diamond_aa_v2, gpqa_diamond_aa_v2_llama_4, gpqa_diamond_aa_v3, gpqa_diamond_nemo, gpqa_extended, gpqa_main, healthbench, healthbench_consensus, healthbench_hard, humaneval, humanevalplus, math_test_500, math_test_500_nemo, mgsm, mgsm_aa_v2, mmlu, mmlu_am, mmlu_ar, mmlu_ar-lite, mmlu_bn, mmlu_bn-lite, mmlu_cs, mmlu_de, mmlu_de-lite, mmlu_el, mmlu_en, mmlu_en-lite, mmlu_es, mmlu_es-lite, mmlu_fa, mmlu_fil, mmlu_fr, mmlu_fr-lite, mmlu_ha, mmlu_he, mmlu_hi, mmlu_hi-lite, mmlu_id, mmlu_id-lite, mmlu_ig, mmlu_it, mmlu_it-lite, mmlu_ja, mmlu_ja-lite, mmlu_ko, mmlu_ko-lite, mmlu_ky, mmlu_llama_4, mmlu_lt, mmlu_mg, mmlu_ms, mmlu_my-lite, mmlu_ne, mmlu_nl, mmlu_ny, mmlu_pl, mmlu_pro, mmlu_pro_aa_v2, mmlu_pro_aa_v3, mmlu_pro_llama_4, mmlu_pt, mmlu_pt-lite, mmlu_ro, mmlu_ru, mmlu_si, mmlu_sn, mmlu_so, mmlu_sr, mmlu_sv, mmlu_sw, mmlu_sw-lite, mmlu_te, mmlu_tr, mmlu_uk, mmlu_vi, mmlu_yo, mmlu_yo-lite, mmlu_zh-lite, simpleqa
tau2_bench	Evaluating Conversational Agents in a Dual-Control Environment	NGC	26.01	`multiarch`	tau2_bench_airline, tau2_bench_retail, tau2_bench_telecom
tooltalk	ToolTalk is designed to evaluate tool-augmented LLMs as a chatbot. ToolTalk contains a handcrafted dataset of 28 easy conversations and 50 hard conversations.	NGC	26.01	`multiarch`	tooltalk
vlmevalkit	VLMEvalKit is an open-source evaluation toolkit of large vision-language models (LVLMs). It enables one-command evaluation of LVLMs on various benchmarks, without the heavy workload of data preparation under multiple repositories. In VLMEvalKit, we adopt generation-based evaluation for all LVLMs, and provide the evaluation results obtained with both exact matching and LLM-based answer extraction.	NGC	26.01	`amd`	ai2d_judge, chartqa, mathvista-mini, mmmu_judge, ocr_reasoning, ocrbench, slidevqa