<no title>

Container	Description	Container Ref	Arch	Tasks
AA-LCR	A challenging benchmark measuring language models’ ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).	`26.01`	`multiarch`	aa_lcr
bfcl	The Berkeley Function Calling Leaderboard V3 (also called Berkeley Tool Calling Leaderboard V3) evaluates the LLM’s ability to call functions (aka tools) accurately.	`26.01`	`multiarch`	bfclv2, bfclv2_ast, bfclv2_ast_prompting, bfclv3, bfclv3_ast, bfclv3_ast_prompting
bigcode-evaluation-harness	A framework for the evaluation of autoregressive code generation language models.	`26.01`	`multiarch`	humaneval, humaneval_instruct, humanevalplus, mbpp-chat, mbpp-completions, mbppplus-chat, mbppplus-completions, mbppplus_nemo, multiple-cpp, multiple-cs, multiple-d, multiple-go, multiple-java, multiple-jl, multiple-js, multiple-lua, multiple-php, multiple-pl, multiple-py, multiple-r, multiple-rb, multiple-rkt, multiple-rs, multiple-scala, multiple-sh, multiple-swift, multiple-ts
codec	Contamination detection framework for evaluating language models	`26.01`	`amd`	aime_2024, aime_2025, bbq, bfcl_v3, frames, gpqa_diamond, gsm8k_test, gsm8k_train, hellaswag_test, hellaswag_train, hle, ifbench, ifeval, livecodebench_v1, livecodebench_v5, math_500_problem, math_500_solution, mmlu_pro_test, mmlu_test, openai_humaneval, reward_bench_v1, reward_bench_v2, scicode, swebench_test, swebench_train, taubench, terminalbench
garak	Garak is an LLM vulnerability scanner.	`26.01`	`multiarch`	garak, garak-completions
genai_perf_eval	GenAI Perf is a tool to evaluate the performance of LLM endpoints, based on GenAI Perf.	`26.01`	`amd`	genai_perf_generation, genai_perf_generation_completions, genai_perf_summarization, genai_perf_summarization_completions
helm	A framework for evaluating large language models in medical applications across various healthcare tasks	`26.01`	`amd`	aci_bench, ehr_sql, head_qa, med_dialog_healthcaremagic, med_dialog_icliniq, medbullets, medcalc_bench, medec, medhallu, medi_qa, medication_qa, mtsamples_procedures, mtsamples_replicate, pubmed_qa, race_based_med
hle	Humanity’s Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. Humanity’s Last Exam consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading.	`26.01`	`multiarch`	hle, hle_aa_v2
ifbench	IFBench is a new, challenging benchmark for precise instruction following.	`26.01`	`multiarch`	ifbench, ifbench_aa_v2
livecodebench	Holistic and Contamination Free Evaluation of Large Language Models for Code.	`26.01`	`multiarch`	codeexecution_v2, codeexecution_v2_cot, codegeneration_notfast, codegeneration_release_latest, codegeneration_release_v1, codegeneration_release_v2, codegeneration_release_v3, codegeneration_release_v4, codegeneration_release_v5, codegeneration_release_v6, livecodebench_0724_0125, livecodebench_0824_0225, livecodebench_aa_v2, testoutputprediction
lm-evaluation-harness	This project provides a unified framework to test generative language models on a large number of different evaluation tasks.	`26.01`	`multiarch`	adlr_agieval_en_cot, adlr_arc_challenge_llama_25_shot, adlr_commonsense_qa_7_shot, adlr_global_mmlu_lite_5_shot, adlr_gpqa_diamond_cot_5_shot, adlr_gsm8k_cot_8_shot, adlr_humaneval_greedy, adlr_humaneval_sampled, adlr_math_500_4_shot_sampled, adlr_mbpp_sanitized_3_shot_greedy, adlr_mbpp_sanitized_3_shot_sampled, adlr_mgsm_native_cot_8_shot, adlr_minerva_math_nemo_4_shot, adlr_mmlu, adlr_mmlu_pro_5_shot_base, adlr_race, adlr_truthfulqa_mc2, adlr_winogrande_5_shot, agieval, arc_challenge, arc_challenge_chat, arc_multilingual, bbh, bbh_instruct, bbq_chat, bbq_completions, commonsense_qa, global_mmlu, global_mmlu_ar, global_mmlu_bn, global_mmlu_de, global_mmlu_en, global_mmlu_es, global_mmlu_fr, global_mmlu_full, global_mmlu_full_am, global_mmlu_full_ar, global_mmlu_full_bn, global_mmlu_full_cs, global_mmlu_full_de, global_mmlu_full_el, global_mmlu_full_en, global_mmlu_full_es, global_mmlu_full_fa, global_mmlu_full_fil, global_mmlu_full_fr, global_mmlu_full_ha, global_mmlu_full_he, global_mmlu_full_hi, global_mmlu_full_id, global_mmlu_full_ig, global_mmlu_full_it, global_mmlu_full_ja, global_mmlu_full_ko, global_mmlu_full_ky, global_mmlu_full_lt, global_mmlu_full_mg, global_mmlu_full_ms, global_mmlu_full_ne, global_mmlu_full_nl, global_mmlu_full_ny, global_mmlu_full_pl, global_mmlu_full_pt, global_mmlu_full_ro, global_mmlu_full_ru, global_mmlu_full_si, global_mmlu_full_sn, global_mmlu_full_so, global_mmlu_full_sr, global_mmlu_full_sv, global_mmlu_full_sw, global_mmlu_full_te, global_mmlu_full_tr, global_mmlu_full_uk, global_mmlu_full_vi, global_mmlu_full_yo, global_mmlu_full_zh, global_mmlu_hi, global_mmlu_id, global_mmlu_it, global_mmlu_ja, global_mmlu_ko, global_mmlu_pt, global_mmlu_sw, global_mmlu_yo, global_mmlu_zh, gpqa, gpqa_diamond_cot, gsm8k, gsm8k_cot_instruct, gsm8k_cot_llama, gsm8k_cot_zeroshot, gsm8k_cot_zeroshot_llama, hellaswag, hellaswag_multilingual, humaneval_instruct, ifeval, m_mmlu_id_str_chat, m_mmlu_id_str_completions, mbpp_plus_chat, mbpp_plus_completions, mgsm, mgsm_cot_chat, mgsm_cot_completions, mmlu, mmlu_cot_0_shot_chat, mmlu_instruct, mmlu_instruct_completions, mmlu_logits, mmlu_pro, mmlu_pro_instruct, mmlu_prox_chat, mmlu_prox_completions, mmlu_prox_de_chat, mmlu_prox_de_completions, mmlu_prox_es_chat, mmlu_prox_es_completions, mmlu_prox_fr_chat, mmlu_prox_fr_completions, mmlu_prox_it_chat, mmlu_prox_it_completions, mmlu_prox_ja_chat, mmlu_prox_ja_completions, mmlu_redux, mmlu_redux_instruct, musr, openbookqa, piqa, social_iqa, truthfulqa, wikilingua, wikitext, winogrande
mmath	MMATH is a new benchmark specifically designed for multilingual complex reasoning. It comprises 374 carefully selected math problems from high-quality sources, including AIME, CNMO, and MATH-500, and covers ten typologically and geographically diverse languages. Each problem is translated and validated through a rigorous pipeline that combines frontier LLMs with human verification, ensuring semantic consistency.	`26.01`	`multiarch`	mmath_ar, mmath_en, mmath_es, mmath_fr, mmath_ja, mmath_ko, mmath_pt, mmath_th, mmath_vi, mmath_zh
mtbench	MT-bench is designed to test multi-turn conversation and instruction-following ability, covering common use cases and focusing on challenging questions to differentiate models.	`26.01`	`multiarch`	mtbench, mtbench-cor1
mteb	The Massive Text Embedding Benchmark (MTEB) is a comprehensive benchmark designed to evaluate the performance of text embedding models across a wide range of tasks and datasets. It includes 58 datasets covering 8 tasks and 112 languages.	`26.01`	`multiarch`	MMTEB, MTEB, MTEB_NL_RETRIEVAL, MTEB_VDR, RTEB, ViDoReV1, ViDoReV2, ViDoReV3, ViDoReV3_Text, ViDoReV3_Text_Image, custom_beir_task, fiqa, hotpotqa, miracl, miracl_lite, mldr, mlqa, nano_fiqa, nq, nvidia_digital_corpora_10k, nvidia_digital_corpora_10k_text, nvidia_earnings_v2, nvidia_earnings_v2_text, nvidia_vidore_v1, nvidia_vidore_v1_text, nvidia_vidore_v2, nvidia_vidore_v2_text, nvidia_vidore_v3, nvidia_vidore_v3_text, techqa
nemo_skills	NeMo Skills - a project to improve skills of LLMs	`26.01`	`multiarch`	ns_aa_lcr, ns_aime2024, ns_aime2025, ns_bfcl_v3, ns_bfcl_v4, ns_gpqa, ns_hle, ns_hle_aa, ns_hmmt_feb2025, ns_ifbench, ns_ifeval, ns_livecodebench, ns_livecodebench_aa, ns_livecodebench_v5, ns_mmlu, ns_mmlu_pro, ns_mmlu_prox, ns_ruler, ns_scicode, ns_wmt24pp
profbench	Professional domain benchmark for evaluating LLMs on Physics PhD, Chemistry PhD, Finance MBA, and Consulting MBA tasks	`26.01`	`multiarch`	llm_judge, report_generation
ruler	RULER generates synthetic examples to evaluate long-context language models with configurable sequence length and task complexity.	`26.01`	`multiarch`	ruler-128k-chat, ruler-128k-completions, ruler-16k-chat, ruler-16k-completions, ruler-1m-chat, ruler-1m-completions, ruler-256k-chat, ruler-256k-completions, ruler-32k-chat, ruler-32k-completions, ruler-4k-chat, ruler-4k-completions, ruler-512k-chat, ruler-512k-completions, ruler-64k-chat, ruler-64k-completions, ruler-8k-chat, ruler-8k-completions, ruler-chat, ruler-completions
safety_eval	Harness for Safety evaluations	`26.01`	`multiarch`	aegis_v2, aegis_v2_completions, aegis_v2_reasoning, compliance, wildguard, wildguard_completions
scicode	SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems.	`26.01`	`multiarch`	scicode, scicode_aa_v2, scicode_background
simple_evals	simple-evals - a lightweight library for evaluating language models.	`26.01`	`multiarch`	AA_AIME_2024, AA_math_test_500, AIME_2024, AIME_2025, AIME_2025_aa_v2, aime_2024_nemo, aime_2025_nemo, browsecomp, gpqa_diamond, gpqa_diamond_aa_v2, gpqa_diamond_aa_v2_llama_4, gpqa_diamond_aa_v3, gpqa_diamond_nemo, gpqa_extended, gpqa_main, healthbench, healthbench_consensus, healthbench_hard, humaneval, humanevalplus, math_test_500, math_test_500_nemo, mgsm, mgsm_aa_v2, mmlu, mmlu_am, mmlu_ar, mmlu_ar-lite, mmlu_bn, mmlu_bn-lite, mmlu_cs, mmlu_de, mmlu_de-lite, mmlu_el, mmlu_en, mmlu_en-lite, mmlu_es, mmlu_es-lite, mmlu_fa, mmlu_fil, mmlu_fr, mmlu_fr-lite, mmlu_ha, mmlu_he, mmlu_hi, mmlu_hi-lite, mmlu_id, mmlu_id-lite, mmlu_ig, mmlu_it, mmlu_it-lite, mmlu_ja, mmlu_ja-lite, mmlu_ko, mmlu_ko-lite, mmlu_ky, mmlu_llama_4, mmlu_lt, mmlu_mg, mmlu_ms, mmlu_my-lite, mmlu_ne, mmlu_nl, mmlu_ny, mmlu_pl, mmlu_pro, mmlu_pro_aa_v2, mmlu_pro_aa_v3, mmlu_pro_llama_4, mmlu_pt, mmlu_pt-lite, mmlu_ro, mmlu_ru, mmlu_si, mmlu_sn, mmlu_so, mmlu_sr, mmlu_sv, mmlu_sw, mmlu_sw-lite, mmlu_te, mmlu_tr, mmlu_uk, mmlu_vi, mmlu_yo, mmlu_yo-lite, mmlu_zh-lite, simpleqa
tau2_bench	Evaluating Conversational Agents in a Dual-Control Environment	`26.01`	`multiarch`	tau2_bench_airline, tau2_bench_retail, tau2_bench_telecom
tooltalk	ToolTalk is designed to evaluate tool-augmented LLMs as a chatbot. ToolTalk contains a handcrafted dataset of 28 easy conversations and 50 hard conversations.	`26.01`	`multiarch`	tooltalk
vlmevalkit	VLMEvalKit is an open-source evaluation toolkit of large vision-language models (LVLMs). It enables one-command evaluation of LVLMs on various benchmarks, without the heavy workload of data preparation under multiple repositories. In VLMEvalKit, we adopt generation-based evaluation for all LVLMs, and provide the evaluation results obtained with both exact matching and LLM-based answer extraction.	`26.01`	`amd`	ai2d_judge, chartqa, mathvista-mini, mmmu_judge, ocr_reasoning, ocrbench, slidevqa