(template-home)=
# NeMo Evaluator SDK Documentation
Welcome to the NeMo Evaluator SDK Documentation.
````{div} sd-d-flex-row
```{button-ref} get-started/install
:ref-type: doc
:color: primary
:class: sd-rounded-pill sd-mr-3
Install
```
```{button-ref} get-started/quickstart/launcher
:ref-type: doc
:color: secondary
:class: sd-rounded-pill sd-mr-3
Quickstart Evaluations
```
```{raw} html
Download Docs for LLM Context
```
````
---
## Introduction to NeMo Evaluator SDK
Discover how NeMo Evaluator SDK works and explore its key features.
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`info;1.5em;sd-mr-1` About NeMo Evaluator SDK
:link: about/index
:link-type: doc
Explore the NeMo Evaluator Core and Launcher architecture
:::
:::{grid-item-card} {octicon}`star;1.5em;sd-mr-1` Key Features
:link: about/key-features
:link-type: doc
Discover NeMo Evaluator SDK's powerful capabilities.
:::
:::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` Concepts
:link: about/concepts/index
:link-type: doc
Master core concepts powering NeMo Evaluator SDK.
:::
:::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` Release Notes
:link: about/release-notes/index
:link-type: doc
Release notes for the NeMo Evaluator SDK.
:::
::::
## Choose a Quickstart
Select the evaluation approach that best fits your workflow and technical requirements.
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`terminal;1.5em;sd-mr-1` Launcher
:link: gs-quickstart-launcher
:link-type: ref
Use the CLI to orchestrate evaluations with automated container management.
+++
{bdg-secondary}`cli`
:::
:::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Core
:link: gs-quickstart-core
:link-type: ref
Get direct Python API access with full adapter features, custom configurations, and workflow integration capabilities.
+++
{bdg-secondary}`api`
:::
:::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Container
:link: gs-quickstart-container
:link-type: ref
Gain full control over the container environment with volume mounting, environment variable management, and integration into Docker-based CI/CD pipelines.
+++
{bdg-secondary}`Docker`
:::
::::
## Libraries
### Launcher
Orchestrate evaluations across different execution backends with unified CLI and programmatic interfaces.
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Configuration
:link: libraries/nemo-evaluator-launcher/configuration/index
:link-type: doc
Complete configuration schema, examples, and advanced patterns for all use cases.
+++
{bdg-secondary}`Setup`
:::
:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` Executors
:link: libraries/nemo-evaluator-launcher/configuration/executors/index
:link-type: doc
Run evaluations on local machines, HPC clusters (Slurm), or cloud platforms (Lepton AI).
+++
{bdg-secondary}`Execution`
:::
:::{grid-item-card} {octicon}`upload;1.5em;sd-mr-1` Exporters
:link: libraries/nemo-evaluator-launcher/exporters/index
:link-type: doc
Export results to MLflow, Weights & Biases, Google Sheets, or local files with one command.
+++
{bdg-secondary}`Export`
:::
::::
### Core
Access the core evaluation engine directly with containerized benchmarks and flexible adapter architecture.
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`workflow;1.5em;sd-mr-1` Workflows
:link: libraries/nemo-evaluator/workflows/index
:link-type: doc
Use the evaluation engine through Python API, containers, or programmatic workflows.
+++
{bdg-secondary}`Integration`
:::
:::{grid-item-card} {octicon}`container;1.5em;sd-mr-1` Containers
:link: libraries/nemo-evaluator/containers/index
:link-type: doc
Ready-to-use evaluation containers with curated benchmarks and frameworks.
+++
{bdg-secondary}`Containers`
:::
:::{grid-item-card} {octicon}`plug;1.5em;sd-mr-1` Interceptors
:link: libraries/nemo-evaluator/interceptors/index
:link-type: doc
Configure request/response interceptors for logging, caching, and custom processing.
+++
{bdg-secondary}`Customization`
:::
:::{grid-item-card} {octicon}`log;1.5em;sd-mr-1` Logging
:link: libraries/nemo-evaluator/logging
:link-type: doc
Comprehensive logging setup for evaluation runs, debugging, and audit trails.
+++
{bdg-secondary}`Monitoring`
:::
:::{grid-item-card} {octicon}`tools;1.5em;sd-mr-1` Extending
:link: libraries/nemo-evaluator/extending/index
:link-type: doc
Add custom benchmarks and frameworks by defining configuration and interfaces.
+++
{bdg-secondary}`Extension`
:::
:::{grid-item-card} {octicon}`code;1.5em;sd-mr-1` API Reference
:link: libraries/nemo-evaluator/api
:link-type: doc
Python API documentation for programmatic evaluation control and integration.
+++
{bdg-secondary}`API`
:::
::::
:::{toctree}
:hidden:
Home
:::
:::{toctree}
:caption: About
:hidden:
Overview
Key Features
Concepts
Release Notes
:::
:::{toctree}
:caption: Get Started
:hidden:
Getting Started
Install SDK
Quickstart
:::
:::{toctree}
:caption: Tutorials
:hidden:
About Tutorials
How-To Guides
Tutorials for NeMo Framework
Evaluate an Existing Endpoint
:::
:::{toctree}
:caption: Evaluation
:hidden:
About Model Evaluation
Benchmark Catalog
Tasks Not Explicitly Defined by FDF
Evaluation Techniques
Add Evaluation Packages to NeMo Framework
:::
:::{toctree}
:caption: Model Deployment
:hidden:
About Model Deployment
Bring-Your-Own-Endpoint
Use NeMo Framework
:::
:::{toctree}
:caption: Libraries
:hidden:
About NeMo Evaluator Libraries
Launcher
Core
:::
:::{toctree}
:caption: References
:hidden:
About References
FAQ
NeMo Evaluator Core Python API
NeMo Evaluator Launcher Python API
nemo-evaluator CLI
nemo-evaluator-launcher CLI
:::
(about-overview)=
# About NeMo Evaluator SDK
NeMo Evaluator SDK is NVIDIA's comprehensive platform for AI model evaluation and benchmarking. It consists of two core libraries that work together to enable consistent, scalable, and reproducible evaluation of large language models across diverse capabilities including reasoning, code generation, function calling, and safety.
## System Architecture
NeMo Evaluator SDK consists of two main libraries:
```{list-table} NeMo Evaluator SDK Components
:header-rows: 1
:widths: 30 70
* - Component
- Key Capabilities
* - **nemo-evaluator**
(*Core Evaluation Engine*)
- • {ref}`interceptors-concepts` for request and response processing
• Standardized evaluation workflows and containerized frameworks
• Deterministic configuration and reproducible results
• Consistent result schemas and artifact layouts
* - **nemo-evaluator-launcher**
(*Orchestration Layer*)
- • Unified CLI and programmatic entry points
• Multi-backend execution (local, Slurm, cloud)
• Job monitoring and lifecycle management
• Result export to multiple destinations (MLflow, W&B, Google Sheets)
```
## Target Users
```{list-table} Target User Personas
:header-rows: 1
:widths: 30 70
* - User Type
- Key Benefits
* - **Researchers**
- Access 100+ benchmarks across multiple evaluation harnesses with containerized reproducibility. Run evaluations locally or on HPC clusters.
* - **ML Engineers**
- Integrate evaluations into ML pipelines with programmatic APIs. Deploy models and run evaluations across multiple backends.
* - **Organizations**
- Scale evaluation across teams with unified CLI, multi-backend execution, and result tracking. Export results to MLflow, Weights & Biases, or Google Sheets.
* - **AI Safety Teams**
- Conduct safety assessments using specialized containers for security testing and bias evaluation with detailed logging.
* - **Model Developers**
- Evaluate custom models against standard benchmarks using OpenAI-compatible APIs.
```
---
orphan: true
---
(adapters-concepts)=
# Adapters
Adapters in NeMo Evaluator provide sophisticated request and response processing through a configurable interceptor pipeline. They enable advanced evaluation capabilities like caching, logging, reasoning extraction, and custom prompt injection.
## Architecture Overview
The adapter system transforms simple API calls into sophisticated evaluation workflows through a two-phase pipeline:
1. **Request Processing**: Interceptors modify outgoing requests (system prompts, parameters) before they reach the endpoint
2. **Response Processing**: Interceptors extract reasoning, log data, cache results, and track statistics after receiving responses
The endpoint interceptor bridges these phases by handling HTTP communication with the model API.
## Core Components
- **AdapterConfig**: Configuration class for all interceptor settings
- **Interceptor Pipeline**: Modular components for request/response processing
- **Endpoint Management**: HTTP communication with error handling and retries
- **Resource Management**: Caching, logging, and progress tracking
## Available Interceptors
The adapter system includes several built-in interceptors:
- **System Message**: Inject custom system prompts
- **Payload Modifier**: Transform request parameters
- **Request/Response Logging**: Capture detailed interaction data
- **Caching**: Store and retrieve responses for efficiency
- **Reasoning**: Extract chain-of-thought reasoning
- **Response Stats**: Collect aggregated statistics from API responses
- **Progress Tracking**: Monitor evaluation progress
- **Endpoint**: Handle HTTP communication with the model API
- **Raise Client Errors**: Handle and raise exceptions for client errors
## Integration
The adapter system integrates seamlessly with:
- **Evaluation Frameworks**: Works with any OpenAI-compatible API
- **NeMo Evaluator Core**: Direct integration via `AdapterConfig`
- **NeMo Evaluator Launcher**: YAML configuration support
## Configuration
### Modern Interceptor-Based Configuration
The recommended approach uses the interceptor-based API:
:::{code-block} python
from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig
adapter_config = AdapterConfig(
interceptors=[
InterceptorConfig(
name="system_message",
enabled=True,
config={"system_message": "You are a helpful assistant."}
),
InterceptorConfig(name="request_logging", enabled=True),
InterceptorConfig(
name="caching",
enabled=True,
config={"cache_dir": "./cache", "reuse_cached_responses": True}
),
InterceptorConfig(name="reasoning", enabled=True),
InterceptorConfig(name="endpoint")
]
)
:::
For detailed usage and configuration examples, see {ref}`interceptors-concepts`.
# Architecture Overview
NeMo Evaluator provides a **two-tier architecture** for comprehensive model evaluation:
```{mermaid}
graph TB
subgraph Tier2[" Orchestration Layer"]
Launcher["nemo-evaluator-launcher
• CLI orchestration
• Multi-backend execution (local, Slurm, Lepton)
• Deployment management (vLLM, NIM, SGLang)
• Result export (MLflow, W&B, Google Sheets)"]
end
subgraph Tier1[" Evaluation Engine"]
Evaluator["nemo-evaluator
• Adapter system
• Interceptor pipeline
• Containerized evaluation execution
• Result aggregation"]
end
subgraph External["NVIDIA Eval Factory Containers"]
Containers["Evaluation Frameworks
• nvidia-lm-eval (lm-evaluation-harness)
• nvidia-simple-evals
• nvidia-bfcl, nvidia-bigcode-eval
• nvidia-eval-factory-garak
• nvidia-safety-harness"]
end
Launcher --> Evaluator
Evaluator --> Containers
style Tier2 fill:#e1f5fe
style Tier1 fill:#f3e5f5
style External fill:#fff3e0
```
## Component Overview
### **Orchestration Layer** (`nemo-evaluator-launcher`)
High-level orchestration for complete evaluation workflows.
**Key Features:**
- CLI and YAML configuration management
- Multi-backend execution (local, Slurm, Lepton)
- Deployment management (vLLM, NIM, SGLang, or bring-your-own-endpoint)
- Result export to MLflow, Weights & Biases, and Google Sheets
- Job monitoring and lifecycle management
**Use Cases:**
- Automated evaluation pipelines
- HPC cluster evaluations with Slurm
- Cloud deployments with Lepton AI
- Multi-model comparative studies
### **Evaluation Engine** (`nemo-evaluator`)
Core evaluation capabilities with request/response processing.
**Key Features:**
- **Adapter System**: Request/response processing layer for API endpoints
- **Interceptor Pipeline**: Modular components for logging, caching, and reasoning
- **Containerized Execution**: Evaluation harnesses run in Docker containers
- **Result Aggregation**: Standardized result schemas and metrics
**Use Cases:**
- Programmatic evaluation integration
- Request/response transformation and logging
- Custom interceptor development
- Direct Python API usage
## Interceptor Pipeline
The evaluation engine provides an interceptor system for request/response processing. Interceptors are configurable components that process API requests and responses in a pipeline.
```{mermaid}
graph LR
A[Request] --> B[System Message]
B --> C[Payload Modifier]
C --> D[Request Logging]
D --> E[Caching]
E --> F[API Endpoint]
F --> G[Response Logging]
G --> H[Reasoning]
H --> I[Response Stats]
I --> J[Response]
style E fill:#e1f5fe
style F fill:#f3e5f5
```
**Available Interceptors:**
- **System Message**: Inject system prompts into chat requests
- **Payload Modifier**: Transform request parameters
- **Request/Response Logging**: Log requests and responses to files
- **Caching**: Cache responses to avoid redundant API calls
- **Reasoning**: Extract chain-of-thought from responses
- **Response Stats**: Track token usage and latency metrics
- **Progress Tracking**: Monitor evaluation progress
## Integration Patterns
### **Pattern 1: Launcher with Deployment**
Use the launcher to handle both model deployment and evaluation:
```bash
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_vllm_logprobs.yaml \
-o deployment=vllm \
-o ++deployment.hf_handle=meta-llama/Llama-3.1-8B \
-o ++deployment.served_model_name=meta-llama/Llama-3.1-8B
```
### **Pattern 2: Launcher with Existing Endpoint**
Point the launcher to an existing API endpoint:
```bash
export HF_TOKEN_FOR_GPQA_DIAMOND=hf_your-token
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
-o target.api_endpoint.url=http://localhost:8080/v1/completions \
-o target.api_endpoint.api_key_name=null \
-o deployment=none
```
### **Pattern 3: Python API**
Use the Python API for programmatic integration:
```python
from nemo_evaluator import evaluate, EvaluationConfig, EvaluationTarget, ApiEndpoint, EndpointType
# Configure target endpoint
api_endpoint = ApiEndpoint(
url="http://localhost:8080/v1/completions",
type=EndpointType.COMPLETIONS
)
target = EvaluationTarget(api_endpoint=api_endpoint)
# Configure evaluation
eval_config = EvaluationConfig(
type="mmlu_pro",
output_dir="./results"
)
# Run evaluation
results = evaluate(eval_cfg=eval_config, target_cfg=target)
```
(evaluation-model)=
# Evaluation Model
NeMo Evaluator provides evaluation approaches and endpoint compatibility for comprehensive AI model assessment.
## Evaluation Approaches
NeMo Evaluator supports several evaluation approaches through containerized harnesses:
- **Text Generation**: Models generate responses to prompts, assessed for correctness or quality against reference answers or rubrics.
- **Log Probability**: Models assign probabilities to token sequences, enabling confidence measurement without text generation. Effective for choice-based tasks and base model evaluation.
- **Code Generation**: Models generate code from natural language descriptions, evaluated for correctness through test execution.
- **Function Calling**: Models generate structured outputs for tool use and API interaction scenarios.
- **Retrieval Augmented Generation**: Models fetches content based on context, evaluated for content relevance and converage, as well as answer corectness.
- **Visual Understanding**: Models generate responses to prompts with images and videos, assessed for correctness or quality against reference answers or rubrics.
- **Agentic Workflows**: Models are tasked with complex problems and need to select and engage tools autonomously.
- **Safety & Security**: Evaluation against adversarial prompts and safety benchmarks to test model alignment and robustness.
One or more evaluation harnesses implement each approach. To discover available tasks for each approach, use `nemo-evaluator-launcher ls tasks`.
## Endpoint Compatibility
NeMo Evaluator targets OpenAI-compatible API endpoints. The platform supports the following endpoint types:
- **`completions`**: Direct text completion without chat formatting (`/v1/completions`). Used for base models and academic benchmarks.
- **`chat`**: Conversational interface with role-based messages (`/v1/chat/completions`). Used for instruction-tuned and chat models.
- **`vlm`**: Vision-language model endpoints supporting image inputs.
- **`embedding`**: Embedding generation endpoints for retrieval evaluation.
Each evaluation task specifies which endpoint types it supports. Verify compatibility using `nemo-evaluator-launcher ls tasks`.
## Metrics
Individual evaluation harnesses define metrics that vary by task. Common metric types include:
- **Accuracy metrics**: Exact match, normalized accuracy, F1 scores
- **Generative metrics**: BLEU, ROUGE, code execution pass rates
- **Probability metrics**: Perplexity, log-likelihood scores
- **Safety metrics**: Refusal rates, toxicity scores, vulnerability detection
The platform returns results in a standardized schema regardless of the source harness. To see metrics for a specific task, refer to {ref}`eval-benchmarks` or run an evaluation and inspect the results.
For hands-on guides, refer to {ref}`eval-run`.
(evaluation-output)=
# Evaluation Output
This page describes the structure and content of evaluation output files generated by NVIDIA NeMo Evaluator. The evaluation output provides comprehensive information about the evaluation run, including configuration details, results, and metadata.
## Input Configuration
The input configuration comes from the command described in the [Launcher Quickstart Guide](../../get-started/quickstart/launcher.md#quick-start), namely
```{literalinclude} ../../get-started/_snippets/launcher_full_example.sh
:language: bash
:start-after: "# [snippet-start]"
:end-before: "# [snippet-end]"
```
:::{note}
For local execution all artifacts are already present on your machine.
When working with remote executors such as `slurm` you can download the artifacts with the following command:
```bash
nemo-evaluator-launcher info --copy-artifacts
```
:::
For the reference purposes, we cite here the launcher config that is used in the command:
```{literalinclude} ../../../packages/nemo-evaluator-launcher/examples/local_basic.yaml
:language: yaml
:start-after: "[docs-start-snippet]"
```
## Output Structure
After running an evaluation, NeMo Evaluator creates a structured output directory containing various artifacts.
If you run the command provided above, it will create the following directory structure inside `execution.output_dir` (`./results` in our case):
```bash
./results/
├── -
│ ├── gpqa_diamond
│ │ ├── artifacts
│ │ ├── logs
│ │ └── run.sh
│ ├── ifeval
│ │ ├── artifacts
│ │ ├── logs
│ │ └── run.sh
│ ├── mbpp
│ │ ├── artifacts
│ │ ├── logs
│ │ └── run.sh
│ └── run_all.sequential.sh
```
Each `artifacts` directory contains output produced by the evaluation job.
Such directory will also be created if you use `nemo-evaluator` or direct container access (see {ref}`gs-quickstart` to compare different ways of using NeMo Evaluator SDK)
Regardless of the chosen path, the generated artifacts directory will have the following content:
```text
/
│ ├── run_config.yml # Task-specific configuration used during execution
│ ├── eval_factory_metrics.json # Evaluation metrics and performance statistics
│ ├── results.yml # Detailed results in YAML format
│ ├── report.html # Human-readable HTML report
│ ├── report.json # JSON format report
│ └── / # Task-specific artifacts
```
These files are standardized and always follow the same structure regardless of the underlying evaluation harness:
```{list-table}
:header-rows: 1
:widths: 20 30 30 50
* - File Name
- Description
- Content
- Usage
* - `results.yml`
- Evaluation results in YAML format
- * Final evaluation scores and metrics
* Evaluation configuration used
- The main results file for programmatic analysis and integration with downstream tools.
* - `run_config.yml`
- Complete evaluation configuration (all parameters, overrides, and settings) used for the run.
- * Task and model settings
* Endpoint configuration
* Interceptor config
* Evaluation-specific overrides
- Enables full reproducibility of evaluations and configuration auditing.
* - `eval_factory_metrics.json`
- Detailed metrics and performance statistics for the evaluation execution.
- * Request/response timings
* Token usage
* Error rates
* Resource utilization
- Performance analysis and failure pattern identification.
* - `report.html` and `report.json`
- Example request-response pairs collected during benchmark execution.
- * Human-readable HTML report
* Machine-readable JSON version with the same content
- For sharing, quick review, analysis, and debugging.
* - Task-specific artifacts
- Artifacts procuded by the underlying benchmark (e.g., caches, raw outputs, error logs).
- * Cached queries & responses
* Source/context data
* Special task outputs or logs
- Advanced troubleshooting, debugging, or domain-specific analysis.
```
### Results file
The primary evaluation output is stored in a `results.yaml`.
It is standardized accross all evaluation benchmarks and follows the [API dataclasses](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator/src/nemo_evaluator/api/api_dataclasses.py) specification.
Below we give the output for the
command from the [Launcher Quickstart Section](../../get-started/quickstart/launcher.md#quick-start) for the `GPQA-Diamond` task.
```{literalinclude} ./_snippets/results.yaml
:language: yaml
```
:::{note}
It is instructive to compare the configuration file cited above and the resulting one.
:::
The evaluation output contains the following general sections:
| Section | Description |
|---------|-------------|
| `command` | The exact command executed to run the evaluation |
| `config` | Evaluation configuration including parameters and settings |
| `results` | Evaluation metrics and scores organized by groups and tasks |
| `target` | Model and API endpoint configuration details |
| `git_hash` | Git commit hash (if available) |
The evaluation metrics are available under the `results` key and are stored in a following structure:
```yaml
metrics:
metric_name:
scores:
score_name:
stats: # optional set of statistics, e.g.:
count: 10 # number of values used for computing the score
min: 0 # minimum of all values used for computing the score
max: 1 # maximum of all values used for computing the score
stderr: 0.13 # standard error
value: 0.42 # score value
```
In the example output above, the metric used is the micro-average across the samples (thus `micro` key in the structure) and the standard deviation (`stddev`) and standard error (`stderr`) statistics are reported.
The types of metrics available in the results differ for different evaluation harness and task, but they are always presented using the same structure as shown above.
## Exporting the Results
Once the evaluation has finished and the `results.yaml` file was produced, the scores can be exported.
In this example we show how the local export works. For information on other exporters, see {ref}`exporters-overview`.
The results can be exported using the following command:
```bash
nemo-evaluator-launcher export --dest local --format json
```
This command extracts the scores from the `results.yaml` and creates a `processed_results.json` file with the following content:
```{literalinclude} ./_snippets/processed_results.json
:language: json
```
The `nemo-evaluator-launcher export` can accept multiple invocation IDs and gather results accross different invocations, regardless if they have been run locally or using remote executors (see {ref}`executors-overview`), e.g.:
```bash
nemo-evaluator-launcher export --dest local --format json --output_dir combined-results
```
will create the `combined-results/processed_results.json` with the same stracuture as in the example above.
(execution-backend)=
# Execution Backend
NeMo Evaluator can be run in various environments: locally, on a cluster, or within NVIDIA Lepton. We refer to these environments as **execution backends** and we have corresponding **Executors** in NeMo Evaluator Launcher that take care of evaluation orchestration in the designated backend. Each executor uses NVIDIA-built docker containers with pre-installed harnesses and the right container is automatically selected and run for you through the Launcher.
## Executors
Different environments require a bit different setup. This includes, among others, submitting, launching and collecting results. Refer to the list below for executors available today.
- **Local Executor**: orchestrates evaluations on a local machine using docker daemon.
- **SLURM executor**: orchestrates evaluations through a SLURM manager
- **Lepton Executor**: submit jobs via [Lepton AI](https://www.nvidia.com/en-us/data-center/dgx-cloud-lepton/)
:::{note} SLURM executor operates only on a SLURM cluster with pyxis SPANK plugin installed. Pyxis allows unprivileged cluster users to run containerized tasks through the `srun` command. Visit [pyxis github homepage](https://github.com/NVIDIA/pyxis) to learn more.
:::
Naturally, each executor might require additional configuration. For example, NeMo Evaluator Launcher needs information on partition, account and selected nodes on SLURM execution backend
(fdf-concept)=
# Framework Definition Files
::::{note}
**Who needs this?** This documentation is for framework developers and organizations creating custom evaluation frameworks. If you're running existing evaluation tasks using {ref}`nemo-evaluator-launcher ` (NeMo Evaluator Launcher CLI) or {ref}`nemo-evaluator ` (NeMo Evaluator CLI), you don't need to create FDFs—they're already provided by framework packages.
::::
A Framework Definition File (FDF) is a YAML configuration file that serves as the single source of truth for integrating evaluation frameworks into the NeMo Evaluator ecosystem. FDFs define how evaluation frameworks are configured, executed, and integrated with the Eval Factory system.
## What an FDF Defines
An FDF specifies five key aspects of an evaluation framework:
- **Framework metadata**: Name, description, package information, and repository URL
- **Default configurations**: Parameters, commands, and settings that apply across all evaluations within that framework
- **Evaluation types**: Available evaluation tasks and their specific configurations
- **Execution commands**: Jinja2-templated commands for running evaluations with dynamic parameter injection
- **API compatibility**: Supported endpoint types (chat, completions, vlm, embedding) and their configurations
## How FDFs Integrate with NeMo Evaluator
FDFs sit at the integration point between your evaluation framework's CLI and NeMo Evaluator's orchestration system:
```{mermaid}
graph LR
A[User runs
nemo-evaluator] --> B[System loads
framework.yml]
B --> C[Merges defaults +
user evaluation config]
C --> D[Renders Jinja2
command template]
D --> E[Executes your
CLI command]
E --> F[Parses output]
style B fill:#e1f5fe
style D fill:#fff3e0
style E fill:#f3e5f5
```
**The workflow:**
1. When you run `nemo-evaluator` (see {ref}`nemo-evaluator-cli`), the system discovers and loads your FDF (`framework.yml`)
2. Configuration values are merged from framework defaults, evaluation-specific settings, and user overrides (see {ref}`parameter-overrides`)
3. The system renders the Jinja2 command template with the merged configuration
4. Your framework's CLI is executed with the generated command
5. Results are parsed and processed by the system
This architecture allows you to integrate any evaluation framework that exposes a CLI interface, without modifying NeMo Evaluator's core code.
## Key Concepts
### Jinja2 Templating
FDFs use Jinja2 template syntax to inject configuration values dynamically into command strings. Variables are referenced using `{{variable}}` syntax:
```yaml
command: >-
my-eval-cli --model {{target.api_endpoint.model_id}}
--task {{config.params.task}}
--output {{config.output_dir}}
```
At runtime, these variables are replaced with actual values from the configuration.
### Parameter Inheritance
Configuration values cascade through multiple layers, with later layers overriding earlier ones:
1. **Framework defaults**: Base configuration in the FDF's `defaults` section
2. **Evaluation defaults**: Task-specific overrides in the `evaluations` section
3. **User configuration**: Values from run configuration files
4. **CLI overrides**: Command-line arguments passed at runtime
This inheritance model allows you to define sensible defaults while giving users full control over specific runs. For detailed examples and patterns, see {ref}`advanced-features`.
### Endpoint Types
Evaluations declare which API endpoint types they support (see {ref}`evaluation-model` for details). NeMo Evaluator uses adapters to translate between different API formats:
- **`chat`**: OpenAI-compatible chat completions (messages with roles)
- **`completions`**: Text completion endpoints (prompt in, text out)
- **`vlm`**: Vision-language models (text + image inputs)
- **`embedding`**: Embedding generation endpoints
Your FDF specifies which types each evaluation supports, and the system validates compatibility at runtime.
### Validation
FDFs are validated when loaded to catch configuration errors early:
- **Schema validation**: Pydantic models ensure required fields exist and have correct types
- **Template validation**: Jinja2 templates are parsed with `StrictUndefined` to catch undefined variables
- **Reference validation**: Template variables must reference valid fields in the configuration model
- **Consistency validation**: Endpoint types and parameters are checked for consistency
Validation failures produce clear error messages that help you fix configuration issues before runtime. For common validation errors and solutions, see {ref}`fdf-troubleshooting`.
## File Structure
An FDF follows a three-section hierarchical structure:
```yaml
framework: # Framework identification and metadata
name: my-eval
pkg_name: my_eval
full_name: My Evaluation Framework
description: Evaluates specific capabilities
url: https://github.com/example/my-eval
defaults: # Default configurations and commands
command: >-
my-eval-cli --model {{target.api_endpoint.model_id}}
config:
params:
temperature: 0.0
target:
api_endpoint:
type: chat
evaluations: # Available evaluation types
- name: task_1
description: First task
defaults:
config:
params:
task: task_1
```
## Next Steps
Ready to create your own FDF? Refer to {ref}`framework-definition-file` for detailed reference documentation and practical guidance on building Framework Definition Files.
(about-concepts)=
# Concepts
Use this section to understand how {{ product_name_short }} works at a high level. Start with the evaluation model, then read about adapters and deployment choices.
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} Evaluation Model
:link: evaluation-model
:link-type: ref
Core evaluation types, OpenAI-compatible endpoints, and metrics.
:::
:::{grid-item-card} Execution Backend
:link: execution-backend
:link-type: ref
Your runtime execution environment.
:::
:::{grid-item-card} Evaluation Output
:link: evaluation-output
:link-type: doc
Standardized output structure across all harnesses and tasks is what makes Evaluator powerful.
:::
:::{grid-item-card} Framework Definition Files
:link: fdf-concept
:link-type: ref
YAML configuration files that integrate evaluation frameworks into NeMo Evaluator.
:::
:::{grid-item-card} Interceptors
:link: interceptors
:link-type: doc
Advanced request/response processing with configurable interceptor pipelines.
:::
::::
```{toctree}
:hidden:
Architecture
Evaluation Model
Evaluation Output
Execution Backend
Framework Definition Files
Interceptors
```
(interceptors-concepts)=
# Interceptors
The **interceptor system** is a core architectural concept in NeMo Evaluator that enables sophisticated request and response processing during model evaluation. The main take away information from learning about interceptors is that they enable functionalities that can be plugged-in to many evaluation harnesses without modifying their underlaying code.
If you configure at least one interceptor in your evaluation pipeline, a lightweight middleware server is started next to the evaluation runtime. This server transforms simple API calls through a two-phase pipeline:
1. **Request Processing**: Interceptors modify outgoing requests (system prompts, parameters) before they reach the endpoint
2. **Response Processing**: Interceptors extract reasoning, log data, cache results, and track statistics after receiving responses
:::{note}
You might see throughout evaluation logs that the evaluation harness sends requests to `localhost` on port proximal to `3825` instead of the URL you provided. This is the middleware server at work.
:::
## Conceptual Overview
The interceptor system transforms simple model API calls into sophisticated evaluation workflows through a configurable pipeline of **interceptors**. This design provides:
- **Modularity**: Each interceptor handles a specific concern (logging, caching, reasoning)
- **Composability**: Multiple interceptors can be chained together
- **Configurability**: Interceptors can be enabled/disabled and configured independently
- **Extensibility**: Custom interceptors can be added for specialized processing
The following diagram shows a typical interceptor pipeline configuration. Note that interceptors must follow the order: Request → RequestToResponse → Response, but the specific interceptors and their configuration are flexible:
```{mermaid}
graph TB
A[Evaluation Request] --> B[Adapter System]
B --> C[Interceptor Pipeline]
C --> D[Model API]
D --> E[Response Pipeline]
E --> F[Processed Response]
subgraph "Request Processing"
C --> G[System Message]
G --> H[Payload Modifier]
H --> I[Request Logging]
I --> J[Caching Check]
J --> K[Endpoint Call]
end
subgraph "Response Processing"
E --> L[Response Logging]
L --> M[Reasoning Extraction]
M --> N[Progress Tracking]
N --> O[Cache Storage]
end
style B fill:#f3e5f5
style C fill:#e1f5fe
style E fill:#e8f5e8
```
## Core Concepts
### Adapter Server
**Adapter Server** is a lightweight server that handles communication between evaluation harness and the endpoint under test. It provides:
- **Configuration Management**: Unified interface for interceptor settings
- **Pipeline Coordination**: Manages the flow of requests through interceptors
- **Resource Management**: Handles shared resources like caches and logs
- **Error Handling**: Provides consistent error handling across interceptors
### Interceptors
**Interceptors** are modular components that process requests and responses. Key characteristics:
- **Dual Interface**: Each interceptor can process both requests and responses
- **Context Awareness**: Access to evaluation context (benchmark type, model info)
- **Stateful Processing**: Can maintain state across requests
- **Chainable**: Multiple interceptors work together in sequence
## Interceptor Categories
### Processing Interceptors
Transform or augment requests and responses:
- **System Message**: Inject custom system prompts
- **Payload Modifier**: Modify request parameters
- **Reasoning**: Extract chain-of-thought reasoning
### Infrastructure Interceptors
Provide supporting capabilities:
- **Caching**: Store and retrieve responses
- **Logging**: Capture request/response data
- **Progress Tracking**: Monitor evaluation progress
- **Response Stats**: Track request statistics and metrics
- **Raise Client Error**: Raise exceptions for client errors (4xx status codes, excluding 408 and 429)
### Integration Interceptors
Handle external system integration:
- **Endpoint**: Route requests to model APIs
## Configuration Philosophy
The adapter system follows a **configuration-over-code** philosophy:
### Simple Configuration
Enable basic features with minimal configuration:
:::{code-block} python
from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig
adapter_config = AdapterConfig(
interceptors=[
InterceptorConfig(name="caching", enabled=True),
InterceptorConfig(name="request_logging", enabled=True),
InterceptorConfig(name="endpoint")
]
)
:::
### Advanced Configuration
Full control over interceptor behavior:
:::{code-block} python
adapter_config = AdapterConfig(
interceptors=[
InterceptorConfig(
name="system_message",
enabled=True,
config={"system_message": "You are an expert."}
),
InterceptorConfig(
name="caching",
enabled=True,
config={"cache_dir": "./cache"}
),
InterceptorConfig(
name="request_logging",
enabled=True,
config={"max_requests": 1000}
),
InterceptorConfig(
name="reasoning",
enabled=True,
config={
"start_reasoning_token": "",
"end_reasoning_token": ""
}
),
InterceptorConfig(name="endpoint")
]
)
:::
### YAML Configuration
Declarative configuration for reproducibility:
```yaml
adapter_config:
interceptors:
- name: system_message
enabled: true
config:
system_message: "Think step by step."
- name: caching
enabled: true
- name: reasoning
enabled: true
- name: endpoint
```
## Design Benefits
1. **Separation of Concerns**: Each interceptor handles a single responsibility, making the system easier to understand and maintain.
2. **Reusability**: Interceptors can be reused across different evaluation scenarios and benchmarks.
3. **Testability**: Individual interceptors can be tested in isolation, improving reliability.
4. **Performance**: Interceptors can be optimized independently and disabled when not needed.
5. **Extensibility**: New interceptors can be added without modifying existing code.
## Common Use Cases
### Research Workflows
- **Reasoning Analysis**: Extract and analyze model reasoning patterns
- **Prompt Engineering**: Test different system prompts systematically
- **Behavior Studies**: Log detailed interactions for analysis
### Production Evaluations
- **Performance Optimization**: Cache responses to reduce API costs
- **Monitoring**: Track evaluation progress and performance metrics
- **Compliance**: Maintain audit trails of all interactions
### Development and Debugging
- **Request Inspection**: Log requests to debug evaluation issues
- **Response Analysis**: Capture detailed response data
- **Error Tracking**: Monitor and handle evaluation failures
## Integration with Evaluation Frameworks
The adapter system integrates seamlessly with evaluation frameworks:
- **Framework Agnostic**: Works with any OpenAI-compatible API
- **Benchmark Independent**: Same interceptors work across different benchmarks
- **Container Compatible**: Integrates with containerized evaluation frameworks
## Next Steps
For detailed implementation information, see:
- **{ref}`nemo-evaluator-interceptors`**: Individual interceptor guides with configuration examples
- **{ref}`interceptor-caching`**: Response caching setup and optimization
- **{ref}`interceptor-reasoning`**: Chain-of-thought processing configuration
The adapter and interceptor system represents a fundamental shift from simple API calls to sophisticated, configurable evaluation workflows that can adapt to diverse research and production needs.
(about-key-features)=
# Key Features
NeMo Evaluator SDK delivers comprehensive AI model evaluation through a dual-library architecture that scales from local development to enterprise production. Experience container-first reproducibility, multi-backend execution, and comprehensive set of state-of-the-art benchmarks.
## **Unified Orchestration (NeMo Evaluator Launcher)**
### Multi-Backend Execution
Run evaluations anywhere with unified configuration and monitoring:
- **Local Execution**: Docker-based evaluation on your workstation
- **HPC Clusters**: Slurm integration for large-scale parallel evaluation
- **Cloud Platforms**: Lepton AI and custom cloud backend support
- **Hybrid Workflows**: Mix local development with cloud production
```bash
# Single command, multiple backends
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/slurm_vllm_basic.yaml
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/lepton_basic.yaml
```
### Evaluation Benchmarks & Tasks
Access comprehensive benchmark suite with single CLI:
```bash
# Discover available benchmarks
nemo-evaluator-launcher ls tasks
# Run academic benchmarks
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml
```
### Built-in Result Export
First-class integration with MLOps platforms:
```bash
# Export to MLflow
nemo-evaluator-launcher export --dest mlflow
# Export to Weights & Biases
nemo-evaluator-launcher export --dest wandb
# Export to Google Sheets
nemo-evaluator-launcher export --dest gsheets
```
## **Core Evaluation Engine (NeMo Evaluator Core)**
### Container-First Architecture
Pre-built NGC containers guarantee reproducible results across environments:
```bash
# Pull and run any evaluation container
docker pull nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }}
docker run --rm -it --gpus all nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }}
```
### Advanced Adapter System
Sophisticated request/response processing pipeline with interceptor architecture:
```yaml
# Configure adapter system in YAML configuration
target:
api_endpoint:
url: "http://localhost:8080/v1/completions/"
model_id: "my-model"
adapter_config:
interceptors:
# System message interceptor
- name: system_message
config:
system_message: "You are a helpful AI assistant. Think step by step."
# Request logging interceptor
- name: request_logging
config:
max_requests: 1000
# Caching interceptor
- name: caching
config:
cache_dir: "./evaluation_cache"
# Communication with http://localhost:8080/v1/completions/
- name: endpoint
# Processing of reasoning traces
- name: reasoning
config:
start_reasoning_token: ""
end_reasoning_token: ""
# Response logging interceptor
- name: response_logging
config:
max_responses: 1000
# Progress tracking interceptor
- name: progress_tracking
```
### Programmatic API
Full Python API for integration into ML pipelines:
```python
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import EvaluationConfig, EvaluationTarget
# Configure and run evaluation programmatically
result = evaluate(
eval_cfg=EvaluationConfig(type="mmlu_pro", output_dir="./results"),
target_cfg=EvaluationTarget(api_endpoint=endpoint_config)
)
```
## **Container Direct Access**
### NGC Container Catalog
Direct access to specialized evaluation containers through [NGC Catalog](https://catalog.ngc.nvidia.com/search?orderBy=scoreDESC&query=label%3A%22NSPECT-JL1B-TVGU%22):
```bash
# Academic benchmarks
docker run --rm -it nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }}
# Code generation evaluation
docker run --rm -it nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:{{ docker_compose_latest }}
# Safety and security testing
docker run --rm -it nvcr.io/nvidia/eval-factory/safety-harness:{{ docker_compose_latest }}
# Vision-language model evaluation
docker run --rm -it nvcr.io/nvidia/eval-factory/vlmevalkit:{{ docker_compose_latest }}
```
### Reproducible Evaluation Environments
Every container provides:
- **Fixed dependencies**: Locked versions for consistent results
- **Pre-configured frameworks**: Ready-to-run evaluation harnesses
- **Isolated execution**: No dependency conflicts between evaluations
- **Version tracking**: Tagged releases for exact reproducibility
## **Enterprise Features**
### Multi-Backend Scalability
Scale from laptop to datacenter with unified configuration:
- **Local Development**: Quick iteration with Docker
- **HPC Clusters**: Slurm integration for large-scale evaluation
- **Cloud Platforms**: Lepton AI and custom backend support
- **Hybrid Workflows**: Seamless transition between environments
### Advanced Configuration Management
Hydra-based configuration with full reproducibility:
```yaml
# Evaluation configuration with custom parameters
evaluation:
tasks:
- name: mmlu_pro
nemo_evaluator_config:
config:
params:
limit_samples: 1000
- name: gsm8k
nemo_evaluator_config:
config:
params:
temperature: 0.0
execution:
output_dir: results
target:
api_endpoint:
url: https://my-model-endpoint.com/v1/chat/completions
model_id: my-custom-model
```
## **OpenAI API Compatibility**
### Universal Model Support
NeMo Evaluator supports OpenAI-compatible API endpoints:
- **Hosted Models**: NVIDIA Build, OpenAI, Anthropic, Cohere
- **Self-Hosted**: vLLM, TRT-LLM, NeMo Framework
- **Custom Endpoints**: Any service implementing OpenAI API spec (test compatibility with our {ref}`deployment-testing-compatibility` guide)
The platform supports the following endpoint types:
- **`completions`**: Direct text completion without chat formatting (`/v1/completions`). Used for base models and academic benchmarks.
- **`chat`**: Conversational interface with role-based messages (`/v1/chat/completions`). Used for instruction-tuned and chat models.
- **`vlm`**: Vision-language model endpoints supporting image inputs.
- **`embedding`**: Embedding generation endpoints for retrieval evaluation.
### Endpoint Type Support
Support for diverse evaluation endpoint types through the evaluation configuration:
```yaml
# Text generation evaluation (chat endpoint)
target:
api_endpoint:
type: chat
url: https://api.example.com/v1/chat/completions
# Log-probability evaluation (completions endpoint)
target:
api_endpoint:
type: completions
url: https://api.example.com/v1/completions
# Vision-language evaluation (vlm endpoint)
target:
api_endpoint:
type: vlm
url: https://api.example.com/v1/chat/completions
# Retrival evaluation (embedding endpoint)
target:
api_endpoint:
type: embedding
url: https://api.example.com/v1/embeddings
```
## **Extensibility and Customization**
### Custom Framework Support
Add your own evaluation frameworks using Framework Definition Files:
```yaml
# custom_framework.yml
framework:
name: my_custom_eval
description: Custom evaluation for domain-specific tasks
defaults:
command: >-
python custom_eval.py --model {{target.api_endpoint.model_id}}
--task {{config.params.task}} --output {{config.output_dir}}
evaluations:
- name: domain_specific_task
description: Evaluate domain-specific capabilities
defaults:
config:
params:
task: domain_task
temperature: 0.0
```
### Advanced Interceptor Configuration
Fine-tune request/response processing with the adapter system through YAML configuration:
```yaml
# Production-ready adapter configuration in framework YAML
target:
api_endpoint:
url: "https://production-api.com/v1/completions"
model_id: "production-model"
adapter_config:
log_failed_requests: true
interceptors:
# System message interceptor
- name: system_message
config:
system_message: "You are an expert AI assistant specialized in this domain."
# Request logging interceptor
- name: request_logging
config:
max_requests: 5000
# Caching interceptor
- name: caching
config:
cache_dir: "./production_cache"
# Reasoning interceptor
- name: reasoning
config:
start_reasoning_token: ""
end_reasoning_token: ""
# Response logging interceptor
- name: response_logging
config:
max_responses: 5000
# Progress tracking interceptor
- name: progress_tracking
config:
progress_tracking_url: "http://monitoring.internal:3828/progress"
```
## **Security and Safety**
### Comprehensive Safety Evaluation
Built-in safety assessment through specialized containers:
```bash
# Run Aegis and Garak evaluations
export JUDGE_API_KEY=your-judge-api-key # token with access to your judge endpoint
export HF_TOKEN_FOR_AEGIS_V2=hf_your-token # HF token with access to the gated Aegis dataset and meta-llama/Llama-3.1-8B-Instruct
export NGC_API_KEY=nvapi-your-key # token with access to build.com
# set judge.url in the config or pass with -o
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_safety.yaml \
```
**Safety Containers Available:**
- **safety-harness**: Content safety evaluation using NemoGuard judge models
- **garak**: Security vulnerability scanning and prompt injection detection
## **Monitoring and Observability**
### Real-Time Progress Tracking
Monitor evaluation progress across all backends:
```bash
# Check evaluation status
nemo-evaluator-launcher status
# Kill running evaluations
nemo-evaluator-launcher kill
```
### Result Export and Analysis
Export evaluation results to MLOps platforms for downstream analysis:
```bash
# Export to MLflow for experiment tracking
nemo-evaluator-launcher export --dest mlflow
# Export to Weights & Biases for visualization
nemo-evaluator-launcher export --dest wandb
# Export to Google Sheets for sharing
nemo-evaluator-launcher export --dest gsheets
```
(about-release-notes)=
# Release Notes
## NeMo Evaluator SDK — General Availability (0.1.0)
NVIDIA is excited to announce the general availability of NeMo Evaluator SDK, an open-source platform for robust, reproducible, and scalable evaluation of large language models.
### Overview
NeMo Evaluator SDK provides a comprehensive solution for AI model evaluation and benchmarking, enabling researchers, ML engineers, and organizations to assess model performance across diverse capabilities including reasoning, code generation, function calling, and safety. The platform consists of two core libraries:
- **{ref}`nemo-evaluator `**: The core evaluation engine that manages interactions between evaluation harnesses and models being tested
- **{ref}`nemo-evaluator-launcher `**: The orchestration layer providing unified CLI and programmatic interfaces for multi-backend execution
### Key Features
**Reproducibility by Default**: All configurations, random seeds, and software provenance are captured automatically for auditable and repeatable evaluations.
**Scale Anywhere**: Run evaluations from a local machine to a Slurm cluster or cloud-native backends without changing your workflow.
**State-of-the-Art Benchmarking**: Access a comprehensive suite of over 100 benchmarks from 21+ popular open-source evaluation harnesses, including popular frameworks such as lm-evaluation-harness, bigcode-evaluation-harness, simple-evals, and specialized tools for safety, function calling, and agentic AI evaluation.
**Extensible and Customizable**: Integrate new evaluation harnesses, add custom benchmarks with proprietary data, and define custom result exporters for existing MLOps tooling.
**OpenAI-Compatible API Support**: Evaluate any model that exposes an OpenAI-compatible endpoint, including hosted services (build.nvidia.com), self-hosted solutions (NVIDIA NIM, vLLM, TensorRT-LLM), and models trained with NeMo framework.
**Containerized Execution**: All evaluations run in open-source Docker containers for auditable and trustworthy results, with pre-built containers available through the NVIDIA NGC catalog.
(get-started-overview)=
# Get Started
## Before You Start
Before you begin, make sure you have:
- **Python Environment**: Python 3.10 or higher (up to 3.13)
- **OpenAI-Compatible Endpoint**: Hosted or self-deployed model API
- **Docker**: For container-based evaluation workflows (optional)
- **NVIDIA GPU**: For local model deployment (optional)
---
## Quick Start Path
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`download;1.5em;sd-mr-1` Installation
:link: gs-install
:link-type: ref
Install {{ product_name_short }} and set up your evaluation environment with all necessary dependencies.
:::
:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` Quick Start
:link: gs-quickstart
:link-type: ref
Deploy your first model and run a simple evaluation in just a few minutes.
:::
::::
## Entry Point Decision Guide
NeMo Evaluator provides three primary entry points, each designed for different user needs and workflows. Use this guide to choose the right approach for your use case.
```{mermaid}
flowchart TD
A[I need to evaluate AI models] --> B{What's your primary goal?}
B -->|Quick evaluations with minimal setup| C[NeMo Evaluator Launcher]
B -->|Custom integrations and workflows| D[NeMo Evaluator Core]
B -->|Direct container control| E[Direct Container Usage]
C --> C1[ Unified CLI interface
Multi-backend execution
Built-in result export
100+ benchmarks ready]
D --> D1[ Programmatic API control
Custom evaluation workflows
Adapter/interceptor system
Framework extensions]
E --> E1[ Maximum flexibility
Custom container workflows
Direct framework access
Advanced users only]
C1 --> F[Start with Launcher Quickstart]
D1 --> G[Start with Core API Guide]
E1 --> H[Start with Container Reference]
style C fill:#e1f5fe
style D fill:#f3e5f5
style E fill:#fff3e0
```
## What You'll Learn
By the end of this section, you'll be able to:
1. **Install and configure** NeMo Evaluator components for your needs
2. **Choose the right approach** from the three-tier architecture
3. **Run your first evaluation** using hosted or self-deployed endpoints
4. **Configure advanced features** like adapters and interceptors
5. **Integrate evaluations** into your ML workflows
## Typical Workflows
### **Launcher Workflow** (Most Users)
1. **Install** NeMo Evaluator Launcher
2. **Configure** endpoint and benchmarks in YAML
3. **Run** evaluations with single CLI command
4. **Export** results to MLflow, W&B, or local files
### **Core API Workflow** (Developers)
1. **Install** NeMo Evaluator Core library
2. **Configure** adapters and interceptors programmatically
3. **Integrate** into existing ML pipelines
4. **Customize** evaluation logic and processing
### **Container Workflow** (Container Users)
1. **Pull** pre-built evaluation containers
2. **Run** evaluations directly in isolated environments
3. **Mount** data and results for persistence
4. **Combine** with existing container orchestration
(gs-install)=
# Installation Guide
NeMo Evaluator provides multiple installation paths depending on your needs. Choose the approach that best fits your use case.
## Choose Your Installation Path
```{list-table} Installation Path Comparison
:header-rows: 1
:widths: 25 25 50
* - **Installation Path**
- **Best For**
- **Key Features**
* - **NeMo Evaluator Launcher** (Recommended)
- Most users who want unified CLI and orchestration across backends
- • Unified CLI for 100+ benchmarks
• Multi-backend execution (local, Slurm, cloud)
• Built-in result export to MLflow, W&B, etc.
• Configuration management with examples
* - **NeMo Evaluator Core**
- Developers building custom evaluation pipelines
- • Programmatic Python API
• Direct container access
• Custom framework integration
• Advanced adapter configuration
* - **Container Direct**
- Users who prefer container-based workflows
- • Pre-built NGC evaluation containers
• Guaranteed reproducibility
• No local installation required
• Isolated evaluation environments
```
---
## Prerequisites
### System Requirements
- Python 3.10 or higher (supports 3.10, 3.11, 3.12, and 3.13)
- CUDA-compatible GPU(s) (tested on RTX A6000, A100, H100)
- Docker (for container-based workflows)
### Recommended Environment
- Python 3.12
- CUDA 12.9
- Ubuntu 24.04
---
## Installation Methods
::::{tab-set}
:::{tab-item} Launcher (Recommended)
Install NeMo Evaluator Launcher for unified CLI and orchestration:
```{literalinclude} _snippets/install_launcher.sh
:language: bash
:start-after: "# [snippet-start]"
:end-before: "# [snippet-end]"
```
Quick verification:
```{literalinclude} _snippets/verify_launcher.sh
:language: bash
:start-after: "# [snippet-start]"
:end-before: "# [snippet-end]"
```
:::
:::{tab-item} Core Library
Install NeMo Evaluator Core for programmatic access:
```{literalinclude} _snippets/install_core.sh
:language: bash
:start-after: "# [snippet-start]"
:end-before: "# [snippet-end]"
```
Quick verification:
```{literalinclude} _snippets/verify_core.sh
:language: bash
:start-after: "# [snippet-start]"
:end-before: "# [snippet-end]"
```
:::
:::{tab-item} NGC Containers
Use pre-built evaluation containers from NVIDIA NGC for guaranteed reproducibility:
```bash
# Pull evaluation containers (no local installation needed)
docker pull nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }}
docker pull nvcr.io/nvidia/eval-factory/lm-evaluation-harness:{{ docker_compose_latest }}
docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:{{ docker_compose_latest }}
```
```bash
# Run container interactively
docker run --rm -it \
-v $(pwd)/results:/workspace/results \
nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} bash
# Or run evaluation directly
docker run --rm \
-v $(pwd)/results:/workspace/results \
-e NGC_API_KEY=nvapi-xxx \
nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} \
nemo-evaluator run_eval \
--eval_type mmlu_pro \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--model_id meta/llama-3.2-3b-instruct \
--api_key_name NGC_API_KEY \
--output_dir /workspace/results
```
Quick verification:
```bash
# Test container access
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} \
nemo-evaluator ls | head -5
echo " Container access verified"
```
:::
::::
---
## Clone the Repository
Clone the NeMo Evaluator repository to get easy access to our ready-to-use examples:
```bash
git clone https://github.com/NVIDIA-NeMo/Evaluator.git
```
Run the example:
```bash
cd Evaluator/
export NGC_API_KEY=nvapi-... # API Key with access to build.nvidia.com
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_reasoning.yaml \
--override execution.output_dir=nemotron-eval
```
## Add Evaluation Harnesses to Your Environment
Build your custom evaluation pipeline by adding evaluation harness packages to your environment of choice:
```bash
pip install nemo-evaluator
```
(core-wheels)=
### Available PyPI Packages
```{list-table}
:header-rows: 1
:widths: 30 70
* - Package Name
- PyPI URL
* - nvidia-bfcl
-
* - nvidia-bigcode-eval
-
* - nvidia-compute-eval
-
* - nvidia-eval-factory-garak
-
* - nvidia-genai-perf-eval
-
* - nvidia-crfm-helm
-
* - nvidia-hle
-
* - nvidia-ifbench
-
* - nvidia-livecodebench
-
* - nvidia-lm-eval
-
* - nvidia-mmath
-
* - nvidia-mtbench-evaluator
-
* - nvidia-eval-factory-nemo-skills
-
* - nvidia-safety-harness
-
* - nvidia-scicode
-
* - nvidia-simple-evals
-
* - nvidia-tooltalk
-
* - nvidia-vlmeval
-
```
:::{note}
Evaluation harnessess that require complex environments are not available as packages but only as containers.
:::
(gs-quickstart-container)=
# Container Direct
**Best for**: Users who prefer container-based workflows
The Container Direct approach gives you full control over the container environment with volume mounting, environment variable management, and integration into Docker-based CI/CD pipelines.
## Prerequisites
- Docker with GPU support
- OpenAI-compatible endpoint
## Quick Start
```bash
# 1. Pull evaluation container
docker pull nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }}
# 2. Run container interactively
docker run --rm -it nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} bash
# 3. Inside container - set up environment
export NGC_API_KEY=nvapi-your-key-here
export HF_TOKEN=hf_your-token-here # If using gated datasets
# 4. Run evaluation
nemo-evaluator run_eval \
--eval_type mmlu_pro \
--model_id meta/llama-3.2-3b-instruct \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--model_type chat \
--api_key_name NGC_API_KEY \
--output_dir /tmp/results \
--overrides 'config.params.limit_samples=10' # Remove to run on full benchmark
```
## Complete Container Workflow
Here's a complete example with volume mounting and advanced configuration:
```bash
# 1. Create local directories for persistent storage
mkdir -p ./results ./cache ./logs
# 2. Run container with volume mounts
docker run --rm -it \
-v $(pwd)/results:/workspace/results \
-v $(pwd)/cache:/workspace/cache \
-v $(pwd)/logs:/workspace/logs \
-e NGC_API_KEY=nvapi-your-key-here \
-e HF_TOKEN=hf_your-token-here \
nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} bash
# 3. Inside container - run evaluation
nemo-evaluator run_eval \
--eval_type mmlu_pro \
--model_id meta/llama-3.2-3b-instruct \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--model_type chat \
--api_key_name NGC_API_KEY \
--output_dir /workspace/results \
--overrides 'config.params.limit_samples=3' # Remove to run on full benchmark
# 4. Exit container and check results
exit
ls -la ./results/
```
## One-Liner Container Execution
For automated workflows, you can run everything in a single command:
```bash
NGC_API_KEY=nvapi-your-key-here
# Run evaluation directly in container
docker run --rm \
-v $(pwd)/results:/workspace/results \
-e NGC_API_KEY="${NGC_API_KEY}" \
nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} \
nemo-evaluator run_eval \
--eval_type mmlu_pro \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--model_type chat \
--model_id meta/llama-3.2-3b-instruct \
--api_key_name NGC_API_KEY \
--output_dir /workspace/results
```
## Key Features
### Full Container Control
- Direct access to container environment
- Custom volume mounting strategies
- Environment variable management
- GPU resource allocation
### CI/CD Integration
- Single-command execution for automation
- Docker Compose compatibility
- Kubernetes deployment ready
- Pipeline integration capabilities
### Persistent Storage
- Volume mounting for results persistence
- Cache directory management
- Log file preservation
- Custom configuration mounting
### Environment Isolation
- Clean, reproducible environments
- Dependency management handled
- Version pinning through container tags
- No local Python environment conflicts
## Advanced Container Patterns
### Docker Compose Integration
```yaml
# docker-compose.yml
version: '3.8'
services:
nemo-eval:
image: nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }}
volumes:
- ./results:/workspace/results
- ./cache:/workspace/cache
- ./configs:/workspace/configs
environment:
- MY_API_KEY=${NGC_API_KEY}
command: |
nemo-evaluator run_eval
--eval_type mmlu_pro
--model_id meta/llama-3.2-3b-instruct
--model_url https://integrate.api.nvidia.com/v1/chat/completions
--model_type chat
--api_key_name MY_API_KEY
--output_dir /workspace/results
```
### Batch Processing Script
```bash
#!/bin/bash
# batch_eval.sh
BENCHMARKS=("mmlu_pro" "gpqa_diamond" "humaneval")
NGC_API_KEY=nvapi-your-key-here
HF_TOKEN=hf_your-token-here # Needed for GPQA-Diamond (gated dataset)
for benchmark in "${BENCHMARKS[@]}"; do
echo "Running evaluation for $benchmark..."
docker run --rm \
-v $(pwd)/results:/workspace/results \
-e MY_API_KEY=$NGC_API_KEY \
-e HF_TOKEN=$HF_TOKEN \
nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} \
nemo-evaluator run_eval \
--eval_type $benchmark \
--model_id meta/llama-3.2-3b-instruct \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--model_type chat \
--api_key_name MY_API_KEY \
--output_dir /workspace/results/$benchmark \
--overrides 'config.params.limit_samples=10'
echo "Completed $benchmark evaluation"
done
echo "All evaluations completed. Results in ./results/"
```
## Next Steps
- Integrate into your CI/CD pipelines
- Explore Docker Compose for multi-service setups
- Consider Kubernetes deployment for scale
- Try {ref}`gs-quickstart-launcher` for simplified workflows
- See {ref}`gs-quickstart-core` for programmatic API and advanced adapter features
(gs-quickstart-core)=
# NeMo Evaluator Core
**Best for**: Developers who need programmatic control
The NeMo Evaluator Core provides direct Python API access for custom configurations and integration into existing Python workflows.
## Prerequisites
- Python environment
- OpenAI-compatible endpoint (hosted or self-deployed) and an API key (if the endpoint is gated)
- Verify endpoint compatibility using our {ref}`deployment-testing-compatibility` guide
## Quick Start
```bash
# 1. Install the nemo-evaluator and nvidia-simple-evals
pip install nemo-evaluator nvidia-simple-evals
# 2. List available benchmarks and tasks
nemo-evaluator ls
# 3. Run evaluation
# Prerequisites: Set your API key
export NGC_API_KEY="nvapi-..."
# Launch using python:
```
```{literalinclude} ../_snippets/core_basic.py
:language: python
:start-after: "# [snippet-start]"
:end-before: "# [snippet-end]"
```
## Complete Working Example
### Using Python API
```{literalinclude} ../_snippets/core_full_example.py
:language: python
:start-after: "# [snippet-start]"
:end-before: "# [snippet-end]"
```
### Using CLI
```{literalinclude} ../_snippets/core_full_cli.sh
:language: bash
:start-after: "# [snippet-start]"
:end-before: "# [snippet-end]"
```
## Key Features
### Programmatic Integration
- Direct Python API access
- Pydantic-based configuration with type hints
- Integration with existing Python workflows
### Evaluation Configuration
- Fine-grained parameter control via `ConfigParams`
- Multiple evaluation types: `mmlu_pro`, `gsm8k`, `hellaswag`, and more
- Configurable sampling, temperature, and token limits
### Endpoint Support
- Chat endpoints (`EndpointType.CHAT`)
- Completion endpoints (`EndpointType.COMPLETIONS`)
- VLM endpoints (`EndpointType.VLM`)
- Embedding endpoints (`EndpointType.EMBEDDING`)
## Advanced Usage Patterns
### Multi-Benchmark Evaluation
```{literalinclude} ../_snippets/core_multi_benchmark.py
:language: python
:start-after: "# [snippet-start]"
:end-before: "# [snippet-end]"
```
### Discovering Installed Benchmarks
```python
from nemo_evaluator import show_available_tasks
# List all installed evaluation tasks
show_available_tasks()
```
:::{tip}
To extend the list of benchmarks install additional harnesses. See the list of evaluation harnesses available as PyPI wheels: {ref}`core-wheels`.
:::
### Using Adapters and Interceptors
For advanced evaluation scenarios, configure the adapter system with interceptors for request/response processing, caching, logging, and more:
```python
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
ApiEndpoint, EvaluationConfig, EvaluationTarget, ConfigParams, EndpointType
)
from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig
# Configure evaluation target with adapter
api_endpoint = ApiEndpoint(
url="http://0.0.0.0:8080/v1/completions/",
type=EndpointType.COMPLETIONS,
model_id="my_model"
)
# Create adapter configuration with interceptors
api_endpoint.adapter_config = AdapterConfig(
interceptors=[
InterceptorConfig(
name="system_message",
config={"system_message": "You are a helpful AI assistant. Think step by step."}
),
InterceptorConfig(
name="request_logging",
config={"max_requests": 50}
),
InterceptorConfig(
name="caching",
config={
"cache_dir": "./evaluation_cache",
"reuse_cached_responses": True
}
),
InterceptorConfig(
name="endpoint",
),
InterceptorConfig(
name="response_logging",
config={"max_responses": 50}
),
InterceptorConfig(
name="reasoning",
config={
"start_reasoning_token": "",
"end_reasoning_token": ""
}
),
InterceptorConfig(
name="progress_tracking",
config={"progress_tracking_url": "http://localhost:3828/progress"}
)
]
)
target = EvaluationTarget(api_endpoint=api_endpoint)
# Run evaluation with full adapter pipeline
config = EvaluationConfig(
type="gsm8k",
output_dir="./results/gsm8k",
params=ConfigParams(
limit_samples=10,
temperature=0.0,
max_new_tokens=512,
parallelism=1
)
)
if __name__ == "__main__":
result = evaluate(eval_cfg=config, target_cfg=target)
print(f"Evaluation completed: {result}")
```
**Available Interceptors:**
- `system_message`: Add custom system prompts to chat requests
- `request_logging`: Log incoming requests for debugging
- `response_logging`: Log outgoing responses for debugging
- `caching`: Cache responses to reduce API costs and speed up reruns
- `reasoning`: Extract chain-of-thought reasoning from model responses
- `progress_tracking`: Track evaluation progress and send updates
For complete adapter documentation, refer to {ref}`adapters-usage`.
## Next Steps
- Integrate into your existing Python workflows
- Run multiple benchmarks in sequence
- Explore available evaluation types with `show_available_tasks()`
- Configure adapters and interceptors for advanced evaluation scenarios
- Consider {ref}`gs-quickstart-launcher` for CLI workflows
- Try {ref}`gs-quickstart-container` for containerized environments
(gs-quickstart)=
# Quickstart
Get up and running with NeMo Evaluator in minutes. Choose your preferred approach based on your needs and experience level.
## Prerequisites
All paths require:
- OpenAI-compatible endpoint (hosted or self-deployed)
- Valid API key for your chosen endpoint
## Quick Reference
| Task | Command |
|------|---------|
| List benchmarks | `nemo-evaluator-launcher ls tasks` |
| Run evaluation | `nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/.yaml` |
| Check status | `nemo-evaluator-launcher status ` |
| Job info | `nemo-evaluator-launcher info ` |
| Export results | `nemo-evaluator-launcher export --dest local --format json` |
| Dry run | Add `--dry-run` to any run command |
| Test with limited samples | Add `-o +config.params.limit_samples=3` |
## Choose Your Path
Select the approach that best matches your workflow and technical requirements:
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` NeMo Evaluator Launcher
:link: gs-quickstart-launcher
:link-type: ref
**Recommended for most users**
Unified CLI experience with automated container management, built-in orchestration, and result export capabilities.
:::
:::{grid-item-card} {octicon}`code;1.5em;sd-mr-1` NeMo Evaluator Core
:link: gs-quickstart-core
:link-type: ref
**For Python developers**
Programmatic control with full adapter features, custom configurations, and direct API access for integration into existing workflows.
:::
:::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` NeMo Framework Container
:link: gs-quickstart-nemo-fw
:link-type: ref
**For NeMo Framework Users**
End-to-end training and evaluation of large language models (LLMs).
:::
:::{grid-item-card} {octicon}`container;1.5em;sd-mr-1` Container Direct
:link: gs-quickstart-container
:link-type: ref
**For container workflows**
Direct container execution with volume mounting, environment control, and integration into Docker-based CI/CD pipelines.
:::
::::
## Model Endpoints
NeMo Evaluator works with any OpenAI-compatible endpoint. You have several options:
### **Hosted Endpoints** (Recommended)
- **NVIDIA Build**: [build.nvidia.com](https://build.nvidia.com) - Ready-to-use hosted models
- **OpenAI**: Standard OpenAI API endpoints
- **Other providers**: Anthropic, Cohere, or any OpenAI-compatible API
### **Self-Hosted Options**
If you prefer to host your own models, verify OpenAI compatibility using our {ref}`deployment-testing-compatibility` guide.
If you are deploying the model locally with Docker, you can use a dedicated docker network.
This will provide a secure connetion between deployment and evaluation docker containers.
```bash
# create a dedicated docker network
docker network create my-custom-network
# launch deployment
docker run --gpus all --network my-custom-network --name my-phi-container vllm/vllm-openai:latest \
--model microsoft/Phi-4-mini-instruct --max-model-len 8192
# Or use other serving frameworks
# TRT-LLM, NeMo Framework, etc.
```
Create an evaluation config:
```yaml
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: my_phi_test
extra_docker_args: "--network my-custom-network" # same network as used for deployment
target:
api_endpoint:
model_id: microsoft/Phi-4-mini-instruct
url: http://my-phi-container:8000/v1/chat/completions
api_key_name: null
evaluation:
tasks:
- name: simple_evals.mmlu_pro
nemo_evaluator_config:
config:
params:
limit_samples: 10 # TEST ONLY: Limits to 10 samples for quick testing
parallelism: 1
```
Save the config to a file (e.g. `phi-eval.yaml`) and launch the evaluation:
```bash
nemo-evaluator-launcher run \
--config ./phi-eval.yaml \
-o execution.output_dir=./phi-results
## Validation and Troubleshooting
### Quick Validation Steps
Before running full evaluations, verify your setup:
```bash
# 1. Test your endpoint connectivity
export NGC_API_KEY=nvapi-...
curl -X POST "https://integrate.api.nvidia.com/v1/chat/completions" \
-H "Authorization: Bearer $NGC_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "meta/llama-3.2-3b-instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 10
}'
# 2. Run a dry-run to validate configuration
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
--dry-run
# 3. Run a minimal test with very few samples
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
-o +config.params.limit_samples=1 \
-o execution.output_dir=./test_results
```
### Common Issues and Solutions
::::{tab-set}
:::{tab-item} API Key Issues
:sync: api-key
```bash
# Verify your API key is set correctly
echo $NGC_API_KEY
# Test with a simple curl request (see above)
```
:::
:::{tab-item} Container Issues
:sync: container
```bash
# Check Docker is running and has GPU access
docker run --rm --gpus all nvidia/cuda:11.8-base-ubuntu20.04 nvidia-smi
# Pull the latest container if you have issues
docker pull nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }}
```
:::
:::{tab-item} Configuration Issues
:sync: config
```bash
# Enable debug logging
export LOG_LEVEL=DEBUG
# Check available evaluation types
nemo-evaluator-launcher ls tasks
```
:::
:::{tab-item} Result Validation
:sync: results
```bash
# Check if results were generated
find ./results -name "*.yml" -type f
# View task results
cat ./results///artifacts/results.yml
# Or export and view processed results
nemo-evaluator-launcher export --dest local --format json
cat ./results//processed_results.json
```
:::
::::
## Next Steps
After completing your quickstart:
::::{tab-set}
:::{tab-item} Explore More Benchmarks
:sync: benchmarks
```bash
# List all available tasks
nemo-evaluator-launcher ls tasks
# Run with limited samples for quick testing
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml
```
:::
:::{tab-item} Export Results
:sync: export
```bash
# Export to MLflow
nemo-evaluator-launcher export --dest mlflow
# Export to Weights & Biases
nemo-evaluator-launcher export --dest wandb
# Export to Google Sheets
nemo-evaluator-launcher export --dest gsheets
# Export to local files
nemo-evaluator-launcher export --dest local --format json
```
:::
:::{tab-item} Scale to Clusters
:sync: scale
```bash
cd packages/nemo-evaluator-launcher
# Run on Slurm cluster
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/slurm_vllm_basic.yaml
# Run on Lepton AI
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/lepton_vllm.yaml
```
:::
::::
```{toctree}
:maxdepth: 1
:hidden:
NeMo Evaluator Launcher
NeMo Evaluator Core
NeMo Framework Container
Container Direct
```
(gs-quickstart-launcher)=
# NeMo Evaluator Launcher
**Best for**: Most users who want a unified CLI experience
The NeMo Evaluator Launcher provides the simplest way to run evaluations with automated container management, built-in orchestration, and comprehensive result export capabilities.
## Prerequisites
- OpenAI-compatible endpoint (hosted or self-deployed) and an API key (if the endpoint is gated), below referred as `NGC_API_KEY` in case one uses models hosted under [NVIDIA's serving platform](https://build.nvidia.com)
- Docker installed (for local execution)
- NeMo Evaluator repository cloned (for access to [examples](https://github.com/NVIDIA-NeMo/Evaluator/tree/main/packages/nemo-evaluator-launcher/examples))
```bash
git clone https://github.com/NVIDIA-NeMo/Evaluator.git
```
- Your Hugging Face token with access to the GPQA-Diamond dataset (click [here](https://huggingface.co/datasets/Idavidrein/gpqa) to request), below referred as `HF_TOKEN_FOR_GPQA_DIAMOND`.
## Quick Start
```bash
# 1. Install the launcher
pip install nemo-evaluator-launcher
# Optional: Install with specific exporters
pip install "nemo-evaluator-launcher[all]" # All exporters
pip install "nemo-evaluator-launcher[mlflow]" # MLflow only
pip install "nemo-evaluator-launcher[wandb]" # W&B only
pip install "nemo-evaluator-launcher[gsheets]" # Google Sheets only
# 2. List available benchmarks
nemo-evaluator-launcher ls tasks
# 3. Run evaluation against a hosted endpoint
# Prerequisites: Set your API key and HF token. Visit https://huggingface.co/datasets/Idavidrein/gpqa
# to get access to the gated GPQA dataset for this task.
export NGC_API_KEY=nvapi-...
export HF_TOKEN_FOR_GPQA_DIAMOND=hf_...
# Move into the cloned directory (see above).
cd Evaluator
```
```{literalinclude} ../_snippets/launcher_basic.sh
:language: bash
:start-after: "# [snippet-start]"
:end-before: "# [snippet-end]"
```
```bash
# 4. Check status
nemo-evaluator-launcher status --json # use the ID printed by the run command
# 5. Find all the recent runs you launched
nemo-evaluator-launcher ls runs --since 2h # list runs from last 2 hours
```
:::{note}
It is possible to use short version of IDs in `status` command, for example `abcd` instead of a full `abcdef0123456` or `ab.0` instead of `abcdef0123456.0`, so long as there are no collisions. This is a syntactic sugar allowing for a slightly easier usage.
:::
```bash
# 6a. Check the results
cat /artifacts/results.yml # use the output_dir printed by the run command
# 6b. Check the running logs
tail -f /*/logs/stdout.log # use the output_dir printed by the run command
# 7a. Export your results (JSON/CSV)
nemo-evaluator-launcher export --dest local --format json
# 7b. Or see the job details, with lots of useful subcommands inside
nemo-evaluator-launcher info # use the ID printed by the run command
# 8. Kill the running job(s)
nemo-evaluator-launcher kill # use the ID printed by the run command
```
## Complete Working Example
Here's a complete example using NVIDIA Build (build.nvidia.com):
```bash
# Prerequisites: Set your API key and HF token
export NGC_API_KEY=nvapi-...
export HF_TOKEN_FOR_GPQA_DIAMOND=hf_...
```
```{literalinclude} ../_snippets/launcher_full_example.sh
:language: bash
:start-after: "# [snippet-start]"
:end-before: "# [snippet-end]"
```
**What happens:**
- Pulls appropriate evaluation container
- Runs benchmark against your endpoint
- Saves results to specified directory
- Provides monitoring and status updates
## Key Features
### Automated Container Management
- Automatically pulls and manages evaluation containers
- Handles volume mounting for results
- No manual Docker commands required
### Built-in Orchestration
- Job queuing and parallel execution
- Progress monitoring and status tracking
### Result Export
- Export to MLflow, Weights & Biases, or local formats
- Structured result formatting
- Integration with experiment tracking platforms
### Configuration Management
- YAML-based configuration system
- Override parameters via command line
- Template configurations for common scenarios
## Next Steps
- Explore different evaluation types: `nemo-evaluator-launcher ls tasks`
- Try advanced configurations in the `packages/nemo-evaluator-launcher/examples/` directory
- Export results to your preferred tracking platform
- Scale to cluster execution with Slurm or cloud providers
For more advanced control, consider the {ref}`gs-quickstart-core` Python API or {ref}`gs-quickstart-container` approaches.
(gs-quickstart-nemo-fw)=
# Evaluate checkpoints trained by NeMo Framework
The NeMo Framework is NVIDIA’s GPU-accelerated, end-to-end training platform for large language models (LLMs), multimodal models, and speech models. It enables seamless scaling of both pretraining and post-training workloads, from a single GPU to clusters with thousands of nodes, supporting Hugging Face/PyTorch and Megatron models. NeMo includes a suite of libraries and curated training recipes to help users build models from start to finish.
The NeMo Evaluator is integrated within NeMo Framework, offering streamlined deployment and advanced evaluation capabilities for models trained using NeMo, leveraging state-of-the-art evaluation harnesses.
## Prerequisites
- Docker installed
- CUDA-compatible GPU
- [NeMo Framework docker container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags)
- Access to a Megatron Bridge checkpoint
## Quick Start
```bash
# 1. Start NeMo Framework Container
TAG=...
CHECKPOINT_PATH=/path/to/checkpoint/mbridge_llama3_8b/iter_0000000" # use absolute path
docker run --rm -it -w /workdir -v $(pwd):/workdir -v $CHECKPOINT_PATH:/checkpoint/ \
--entrypoint bash \
--gpus all \
nvcr.io/nvidia/nemo:${TAG}
```
```bash
# Run inside the container:
# 2. Deploy a Model
python \
/opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
--megatron_checkpoint /checkpoint \
--model_id megatron_model \
--port 8080 \
--host 0.0.0.0
```
```{literalinclude} ../_snippets/nemo_fw_basic.py
:language: python
:start-after: "# [snippet-start]"
:end-before: "# [snippet-end]"
```
## Key Features
- **Multi-Backend Deployment**: Supports PyTriton and multi-instance evaluations using the Ray Serve deployment backend
- **Production-Ready**: Supports high-performance inference with CUDA graphs and flash decoding
- **Multi-GPU and Multi-Node Support**: Enables distributed inference across multiple GPUs and compute nodes
- **OpenAI-Compatible API**: Provides RESTful endpoints aligned with OpenAI API specifications
- **Comprehensive Evaluation**: Includes state-of-the-art evaluation harnesses for academic benchmarks, reasoning benchmarks, code generation, and safety testing
- **Adapter System**: Benefits from NeMo Evaluator's Adapter System for customizable request and response processing
## Advanced Usage Patterns
### Evaluate LLMs Using Log-Probabilities
```{literalinclude} ../../deployment/nemo-fw/_snippets/arc_challenge_mbridge.py
:language: python
:start-after: "## Run the evaluation"
```
### Multi-Instance Deployment with Ray
Deploy multiple instances of your model:
```shell
python \
/opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
--megatron_checkpoint /checkpoint \
--model_id "megatron_model" \
--port 8080 \ # Ray server port
--num_gpus 4 \ # Total GPUs available
--num_replicas 2 \ # Number of model replicas
--tensor_model_parallel_size 2 \ # Tensor parallelism per replica
--pipeline_model_parallel_size 1 \ # Pipeline parallelism per replica
--context_parallel_size 1 # Context parallelism per replica
```
Run evaluations with increased parallelism:
```python
from nemo_evaluator.api import check_endpoint, evaluate
from nemo_evaluator.api.api_dataclasses import EvaluationConfig, ApiEndpoint, EvaluationTarget, ConfigParams
# Configure the evaluation target
api_endpoint = ApiEndpoint(
url="http://0.0.0.0:8080/v1/completions/",
type="completions",
model_id="megatron_model",
)
eval_target = EvaluationTarget(api_endpoint=api_endpoint)
eval_params = ConfigParams(top_p=0, temperature=0, parallelism=2)
eval_config = EvaluationConfig(type='mmlu', params=eval_params, output_dir="results")
if __name__ == "__main__":
check_endpoint(
endpoint_url=eval_target.api_endpoint.url,
endpoint_type=eval_target.api_endpoint.type,
model_name=eval_target.api_endpoint.model_id,
)
evaluate(target_cfg=eval_target, eval_cfg=eval_config)
```
## Next Steps
- Integrate evaluation into your training pipeline
- Run deployment and evaluation with NeMo Run
- Configure adapters and interceptors for advanced evaluation scenarios
- Explore {ref}`tutorials-overview`
(tutorials-overview)=
# Tutorials
Master {{ product_name_short }} with hands-on tutorials and practical examples.
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`goal;1.5em;sd-mr-1` How-To
:link: how-to/index
:link-type: doc
Hands-on, step-by-step guides showcasing a single feature or use-case.
:::
:::{grid-item-card} {octicon}`play;1.5em;sd-mr-1` Evaluation with NeMo Framework
:link: nemo-fw/index
:link-type: doc
Deploy models and run evaluations using NeMo Framework container.
:::
:::{grid-item-card} {octicon}`light-bulb;1.5em;sd-mr-1` Evaluate an existing endpoint using local executor
:link: local-evaluation-of-existing-endpoint
:link-type: doc
:::
::::
---
orphan: true
---
(create-framework-definition-file)=
# Tutorial: Create a Framework Definition File
Learn by building a complete FDF for a simple evaluation framework.
**What you'll build**: An FDF that wraps a hypothetical CLI tool called `domain-eval`
**Time**: 20 minutes
**Prerequisites**:
- Python evaluation framework with a CLI
- Basic YAML knowledge
- Understanding of your framework's parameters
## What You're Creating
By the end, you'll have integrated your evaluation framework with {{ product_name_short }}, allowing users to run:
```bash
nemo-evaluator run_eval \
--eval_type domain_specific_task \
--model_id meta/llama-3.2-3b-instruct \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--model_type chat
```
---
## Step 1: Understand Your Framework
First, document your framework's CLI interface. For our example `domain-eval`:
```bash
# How your CLI currently works
domain-eval \
--model-name gpt-4 \
--api-url https://api.example.com/v1/chat/completions \
--task medical_qa \
--temperature 0.0 \
--output-dir ./results
```
**Action**: Write down your own framework's command structure.
---
## Step 2: Create the Directory Structure
```bash
mkdir -p my-framework/core_evals/domain_eval
cd my-framework/core_evals/domain_eval
touch framework.yml output.py __init__.py
```
**Why this structure?** {{ product_name_short }} discovers frameworks by scanning `core_evals/` directories.
---
## Step 3: Add Framework Identification
Create `framework.yml` and start with the identification section:
```yaml
framework:
name: domain-eval # Lowercase, hyphenated
pkg_name: domain_eval # Python package name
full_name: Domain Evaluation Framework
description: Evaluates models on domain-specific medical and legal tasks
url: https://github.com/example/domain-eval
```
**Why these fields?**
- `name`: Used in CLI commands (`--framework domain-eval`)
- `pkg_name`: Used for Python imports
- `full_name`: Shows in documentation
- `url`: Links users to your source code
**Test**: This minimal FDF should now be discoverable (but not runnable yet).
---
## Step 4: Map CLI Parameters to Template Variables
Now map your CLI to {{ product_name_short }}'s configuration structure:
| Your CLI Flag | Maps To | FDF Template Variable |
|---------------|---------|----------------------|
| `--model-name` | Model ID | `{{target.api_endpoint.model_id}}` |
| `--api-url` | Endpoint URL | `{{target.api_endpoint.url}}` |
| `--task` | Task name | `{{config.params.task}}` |
| `--temperature` | Temperature | `{{config.params.temperature}}` |
| `--output-dir` | Output path | `{{config.output_dir}}` |
**Action**: Create this mapping for your own framework.
---
## Step 5: Write the Command Template
Add the `defaults` section with your command template:
```yaml
defaults:
command: >-
{% if target.api_endpoint.api_key is not none %}export API_KEY=${{target.api_endpoint.api_key}} && {% endif %}
domain-eval
--model-name {{target.api_endpoint.model_id}}
--api-url {{target.api_endpoint.url}}
--task {{config.params.task}}
--temperature {{config.params.temperature}}
--output-dir {{config.output_dir}}
```
**Understanding the template**:
- `{% if ... %}`: Conditional - exports API key if provided
- `{{variable}}`: Placeholder filled with actual values at runtime
- Line breaks are optional (using `>-` makes it readable)
**Common pattern**: Export environment variables before the command runs.
---
## Step 6: Define Default Parameters
Add default configuration values:
```yaml
defaults:
command: >-
# ... command from previous step ...
config:
params:
temperature: 0.0 # Deterministic by default
max_new_tokens: 1024 # Token limit
parallelism: 10 # Concurrent requests
max_retries: 5 # API retry attempts
request_timeout: 60 # Seconds
target:
api_endpoint:
type: chat # Default to chat endpoint
```
**Why defaults?** Users can run evaluations without specifying every parameter.
---
## Step 7: Define Your Evaluation Tasks
Add the specific tasks your framework supports:
```yaml
evaluations:
- name: medical_qa
description: Medical question answering evaluation
defaults:
config:
type: medical_qa
supported_endpoint_types:
- chat
params:
task: medical_qa # Passed to --task flag
- name: legal_reasoning
description: Legal reasoning and case analysis
defaults:
config:
type: legal_reasoning
supported_endpoint_types:
- chat
- completions # Supports both endpoint types
params:
task: legal_reasoning
temperature: 0.0 # Override for deterministic reasoning
```
**Key points**:
- Each evaluation has a unique `name` (used in CLI)
- `supported_endpoint_types` declares API compatibility
- Task-specific `params` override framework defaults
---
## Step 8: Create the Output Parser
Create `output.py` to parse your framework's results:
```python
def parse_output(output_dir: str) -> dict:
"""Parse evaluation results from your framework's output format."""
import json
from pathlib import Path
# Adapt this to your framework's output format
results_file = Path(output_dir) / "results.json"
with open(results_file) as f:
raw_results = json.load(f)
# Convert to {{ product_name_short }} format
return {
"tasks": {
"medical_qa": {
"name": "medical_qa",
"metrics": {
"accuracy": raw_results["accuracy"],
"f1_score": raw_results["f1"]
}
}
}
}
```
**What this does**: Translates your framework's output format into {{ product_name_short }}'s standard schema.
---
## Step 9: Test Your FDF
Install your framework package and test:
```bash
# From your-framework/ directory
pip install -e .
# List available evaluations (should show your tasks)
eval-factory list_evals --framework domain-eval
# Run a test evaluation
nemo-evaluator run_eval \
--eval_type medical_qa \
--model_id gpt-3.5-turbo \
--model_url https://api.openai.com/v1/chat/completions \
--model_type chat \
--api_key_name OPENAI_API_KEY \
--output_dir ./test_results \
--overrides "config.params.limit_samples=5"
```
**Expected output**: Your CLI should execute with substituted parameters.
---
## Step 10: Add Conditional Logic (Advanced)
Make parameters optional with Jinja2 conditionals:
```yaml
defaults:
command: >-
domain-eval
--model-name {{target.api_endpoint.model_id}}
--api-url {{target.api_endpoint.url}}
{% if config.params.task is not none %}--task {{config.params.task}}{% endif %}
{% if config.params.temperature is not none %}--temperature {{config.params.temperature}}{% endif %}
{% if config.params.limit_samples is not none %}--num-samples {{config.params.limit_samples}}{% endif %}
--output-dir {{config.output_dir}}
```
**When to use conditionals**: For optional flags that shouldn't appear if not specified.
---
## Complete Example
Here's your full `framework.yml`:
```yaml
framework:
name: domain-eval
pkg_name: domain_eval
full_name: Domain Evaluation Framework
description: Evaluates models on domain-specific tasks
url: https://github.com/example/domain-eval
defaults:
command: >-
{% if target.api_endpoint.api_key is not none %}export API_KEY=${{target.api_endpoint.api_key}} && {% endif %}
domain-eval
--model-name {{target.api_endpoint.model_id}}
--api-url {{target.api_endpoint.url}}
--task {{config.params.task}}
--temperature {{config.params.temperature}}
--output-dir {{config.output_dir}}
config:
params:
temperature: 0.0
max_new_tokens: 1024
parallelism: 10
max_retries: 5
request_timeout: 60
target:
api_endpoint:
type: chat
evaluations:
- name: medical_qa
description: Medical question answering
defaults:
config:
type: medical_qa
supported_endpoint_types:
- chat
params:
task: medical_qa
- name: legal_reasoning
description: Legal reasoning tasks
defaults:
config:
type: legal_reasoning
supported_endpoint_types:
- chat
- completions
params:
task: legal_reasoning
```
---
## Next Steps
**Dive deeper into FDF features**: {ref}`framework-definition-file`
**Learn about advanced templating**: {ref}`advanced-features`
**Share your framework**: Package and distribute via PyPI
**Troubleshooting**: {ref}`fdf-troubleshooting`
---
## Common Patterns
### Pattern 1: Framework with Custom CLI Flags
```yaml
command: >-
my-eval --custom-flag {{config.params.extra.custom_value}}
```
Use `extra` dict for framework-specific parameters.
### Pattern 2: Multiple Output Files
```yaml
command: >-
my-eval --results {{config.output_dir}}/results.json
--logs {{config.output_dir}}/logs.txt
```
Organize outputs in subdirectories using `output_dir`.
### Pattern 3: Environment Variable Setup
```yaml
command: >-
export HF_TOKEN=${{target.api_endpoint.api_key}} &&
export TOKENIZERS_PARALLELISM=false &&
my-eval ...
```
Set environment variables before execution.
---
## Summary
You've learned how to:
✅ Create the FDF directory structure
✅ Map your CLI to template variables
✅ Write Jinja2 command templates
✅ Define default parameters
✅ Declare evaluation tasks
✅ Create output parsers
✅ Test your integration
**Your framework is now integrated with {{ product_name_short }}!**
# How-To Guides
These practical, task-oriented guides walk you through specific configurations and workflows in NeMo Evaluator.
Each guide focuses on a single feature or use case, providing clear instructions to help you accomplish common tasks efficiently.
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`comment;1.5em;sd-mr-1` Remove Reasoning Traces
:link: how-to-reasoning
:link-type: ref
Configure NeMo Evaluator Launcher for evaluating reasoning models.
:::
:::{grid-item-card} {octicon}`arrow-switch;1.5em;sd-mr-1` Switch Executor
:link: how-to-switch-executors
:link-type: ref
Learn how to switch between execution backends (e.g. convert from local to slurm)
:::
::::
:::{toctree}
:caption: How-To Guides
:hidden:
reasoning
local-to-slurm
:::
(how-to-reasoning)=
# Remove Reasoning Traces
This guide walks you through configuring NeMo Evaluator Launcher for evaluating reasoning models. It shows how to:
- adjust sampling parameters
- remove reasoning traces from the answer
- controlling thinking budget
ensuring accurate benchmark evaluation.
:::{tip}
Need more in-depth explanation? See the {ref}`run-eval-reasoning` guide.
:::
## Before You Start
Ensure you have:
- **Model Endpoint**: An OpenAI-compatible API reasoning endpoint for your model (completions or chat). See {ref}`deployment-testing-compatibility` for snippets you can use to test your endpoint and {ref}`run-eval-reasoning` for details on reasoning models.
- **API Access**: Valid API key if your endpoint requires authentication
- **Installed Packages**: NeMo Evaluator or access to evaluation containers
## Prepare your config file
### Configure the Evaluation
1. Select tasks:
```yaml
evaluation:
tasks:
- name: simple_evals.mmlu_pro
- name: mgsm
```
2. Adjust sampling parameters for a reasoning model, e.g.:
```yaml
evaluation:
tasks:
- name: simple_evals.mmlu_pro
- name: mgsm
nemo_evaluator_config:
config:
params:
temperature: 0.6
top_p: 0.95
max_new_tokens: 32768 # for reasoning + final answer
request_timeout: 3600 # long timeout to account for thinking time
parallelism: 1 # single parallel request to avoid overloading the server
```
3. Enable Reasoning Interceptor to remove reasoning traces from the model's responses:
```yaml
evaluation:
tasks:
- name: simple_evals.mmlu_pro
- name: mgsm
nemo_evaluator_config:
config:
params:
temperature: 0.6
top_p: 0.95
max_new_tokens: 32768 # for reasoning + final answer
request_timeout: 3600 # long timeout to account for thinking time
parallelism: 1 # single parallel request to avoid overloading the server
target:
api_endpoint:
adapter_config:
interceptors:
- name: endpoint
- name: reasoning
```
In this example we will use [NVIDIA-Nemotron-Nano-9B-v2](https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2), which produces reasoning trace in a `...` format.
If your model uses a different formatting, make sure to configure the interceptor as shown in {ref}`run-eval-reasoning`.
4. (Optional) Modify the request to turn the reasoning on.
In this example we work with an endpoint that requires "/think" to be present in the system message to enable reasoning.
We will use the Interceptor to add it to the request.
Adjust the example below to match your endpoint (see detailed instructions in {ref}`run-eval-reasoning`).
```yaml
evaluation:
tasks:
- name: simple_evals.mmlu_pro
- name: mgsm
nemo_evaluator_config:
target:
api_endpoint:
adapter_config:
interceptors:
- name: system_message
config:
system_message: "/think"
- name: endpoint
- name: reasoning
```
### Select your execution backend and deployment specification
For the purpose of this example, we will use local execution without deployment.
See other How-to guides to adjust this example to your needs.
1. Configure local executor
```yaml
defaults:
- execution: local
- _self_
execution:
output_dir: nel-results
```
2. Configure target endpoint
```yaml
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: nel-results
target:
api_endpoint:
# see https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2 for endpoint details
model_id: nvidia/nvidia-nemotron-nano-9b-v2
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY # API Key with access to build.nvidia.com
```
### The Full Config
Combine all components into a config file for your experiment:
```yaml
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: nel-results
target:
api_endpoint:
# see https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2 for endpoint details
model_id: nvidia/nvidia-nemotron-nano-9b-v2
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY # API Key with access to build.nvidia.com
evaluation:
tasks:
- name: simple_evals.mmlu_pro
- name: mgsm
nemo_evaluator_config:
config:
params:
temperature: 0.6
top_p: 0.95
max_new_tokens: 32768 # for reasoning + final answer
request_timeout: 3600 # long timeout to account for thinking time
parallelism: 1 # single parallel request to avoid overloading the server
target:
api_endpoint:
adapter_config:
interceptors:
- name: system_message
config:
system_message: "/think"
- name: endpoint
- name: reasoning
```
## Verify and execute your experiment
1. Save the prepared config in a file, e.g. `nemotron_eval.yaml`
2. (Recommended) Inspect the configuration with `--dry_run`
```bash
export NGC_API_KEY=nvapi-your-key
nemo-evaluator-launcher run --config nemotron_eval.yaml --dry_run
```
3. (Recommended) Run a short experiment with 10 samples per benchmark to verify your config
```bash
export NGC_API_KEY=nvapi-your-key
nemo-evaluator-launcher run --config nemotron_eval.yaml \
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=10
```
:::{tip}
If everything works correctly you should see logs from the `ResponseReasoningInterceptor` similar to the ones below:
```bash
[I 2025-12-02T16:14:28.257] Reasoning tracking information reasoning_words=1905 original_content_words=85 updated_content_words=85 reasoning_finished=True reasoning_started=True reasoning_tokens=unknown updated_content_tokens=unknown logger=ResponseReasoningInterceptor request_id=ccff76b2-2b85-4eed-a9d0-2363b533ae58
```
:::
4. Run the full experiment
```bash
export NGC_API_KEY=nvapi-your-key
nemo-evaluator-launcher run --config nemotron_eval.yaml
```
5. Analyze the metrics and reasoning statistics
After evaluation completes, check these key artifacts:
- **`results.yaml`**: Contains the benchmark metrics (see {ref}`evaluation-output`)
- **`eval_factory_metrics.json`**: Contains reasoning statistics under the `reasoning` key, including:
- `responses_with_reasoning`: How many responses included reasoning traces
- `reasoning_finished_count` vs `reasoning_started_count`: If these match, your `max_new_tokens` was sufficient
- `reasoning_unfinished_count`: Number of responses where reasoning started but was truncated (didn't reach end token)
- `reasoning_finished_ratio`: Percentage (expressed as ratio within 0-1) of responses where reasoning completed to all responses with reasoning
- `avg_reasoning_words` and other word- and tokens count metrics: Use these for cost analysis
:::{tip}
For detailed explanation of reasoning statistics and artifacts, see {ref}`run-eval-reasoning`.
:::
(how-to-switch-executors)=
# Switch Executor
With Nemo-evaluator, you can choose how your evaluations run: locally using Docker, on clusters with Slurm, or other options - all managed through _executors_. In this guide you will learn how to switch from one executor to another.
For the purpose of this exercise we will use `local` and `slurm` executors with `vllm` model deployment.
:::{tip}
Learn more about the {ref}`execution-backend` concept and the {ref}`executors-overview` overview for details on available executors and their configuration.
:::
## Before You Start
Ensure you have:
- NeMo Evaluator Launcher config that you would like to run. You can use the config shown below, choose one of our [example configs](https://github.com/NVIDIA-NeMo/Evaluator/tree/main/packages/nemo-evaluator-launcher/examples) or prepare your own configuration.
- NeMo Evaluator Launcher installed in your environment.
- Access to a Slurm cluster (with appropriate partitions/queues)
- [Pyxis SPANK plugin](https://github.com/NVIDIA/pyxis) installed on the cluster
## Starting Point: Config for Modifying
We will use the following config as our starting point:
```yaml
defaults:
- execution: local
- deployment: vllm
- _self_
# set required execution arguments
execution:
output_dir: local_results
deployment:
checkpoint_path: null
hf_model_handle: microsoft/Phi-4-mini-instruct
served_model_name: microsoft/Phi-4-mini-instruct
tensor_parallel_size: 1
data_parallel_size: 1
evaluation:
tasks:
- name: ifeval # chat benchmark will automatically use v1/chat/completions endpoint
- name: gsm8k # completions benchmark will automatically use v1/completions endpoint
```
This config will run deployment of Phi-4-mini-instruct with vLLM and evaluation on IFEval and GSM8k benchmarks.
The workflow is executed locally on a machine when you launch it.
## Modify the Config
To permanently switch to a different execution backend, you can modify the execution section of your config:
```yaml
defaults:
- execution: local # old executor: local
- deployment: vllm
- _self_
execution:
output_dir: local_results # path on your local machine
```
with a different one:
```yaml
defaults:
- execution: slurm/default # new executor: slurm
- deployment: vllm
- _self_
execution:
hostname: my-cluster.login.com # SLURM headnode (login) hostname
account: my_account # SLURM account allocation
output_dir: /absolute/path/on/remote # ABSOLUTE path accessible to SLURM compute nodes
```
This will allow you to run the same deployment and evaluation workflow on a remote Slurm cluster.
If you only want to change the executor, there's no need to update other sections of your config.
## Dynamically switch executor with CLI overrides
You can also specify a different execution backend at runtime to dynamically switch from one executor to another:
```bash
export CLUSTER_HOSTNAME=my-cluster.login.com # SLURM headnode (login) hostname
export ACCOUNT=my_account # SLURM account allocation
export OUT_DIR=/absolute/path/on/remote # ABSOLUTE path accessible to SLURM compute nodes
nel run --config local_config.yaml \
-o execution=slurm/default \
-o execution.hostname=$CLUSTER_HOSTNAME \
-o execution.account=$ACCOUNT \
-o execution.output_dir=$OUT_DIR
```
This also allows you to easily switch from one Slurm cluster to another.
(tutorials-local-eval-existing-endpoint)=
# Local Evaluation of Existing Endpoint
This tutorial shows how to evaluate an existing API endpoint using the Local executor.
## Prerequisites
- Docker
- Python environment with the NeMo Evaluator Launcher CLI available (install the launcher by following {ref}`gs-install`)
## Step-by-Step Guide
### 1. Select a Model
You have the following options:
#### Option I: Use the NVIDIA Build API
- **URL**: `https://integrate.api.nvidia.com/v1/chat/completions`
- **Models**: Choose any endpoint from NVIDIA Build's extensive catalog
- **API Key**: Get from [build.nvidia.com](https://build.nvidia.com/meta/llama-3_1-8b-instruct). See [Setting up API Keys](https://docs.omniverse.nvidia.com/guide-sdg/latest/setup.html#preview-and-set-up-an-api-key).
Make sure to export the API key:
```
export NGC_API_KEY=nvapi-...
```
#### Option II: Another Hosted Endpoint
- **URL**: Your model's endpoint URL
- **Models**: Any OpenAI-compatible endpoint
- **API_KEY**: If your endpoint is gated, get an API key from your provider and export it:
```
export API_KEY=...
```
#### Option III: Deploy Your Own Endpoint
Deploy an OpenAI-compatible endpoint using frameworks like vLLM, SGLang, TRT-LLM, or NIM.
:::{note}
For this tutorial, we will use `meta/llama-3.2-3b-instruct` from [build.nvidia.com](https://build.nvidia.com/meta/llama-3_1-8b-instruct). You will need to export your `NGC_API_KEY` to access this endpoint.
:::
### 2. Select Tasks
Choose which benchmarks to evaluate. You can list all available tasks with the following command:
```bash
nemo-evaluator-launcher ls tasks
```
For a comprehensive list of supported tasks and descriptions, see {ref}`nemo-evaluator-containers`.
**Important**: Each task has a dedicated endpoint type (e.g., `/v1/chat/completions`, `/v1/completions`). Ensure that your model provides the correct endpoint type for the tasks you want to evaluate. Use our {ref}`deployment-testing-compatibility` guide to verify your endpoint supports the required formats.
:::{note}
For this tutorial we will pick: `ifeval` and `humaneval_instruct` as these are fast. They both use the chat endpoint.
:::
### 3. Create a Configuration File
Create a `configs` directory:
```bash
mkdir configs
```
Create a configuration file with a descriptive name (e.g., `configs/local_endpoint.yaml`)
and populate it with the following content:
```yaml
defaults:
- execution: local # The evaluation will run locally on your machine using Docker
- deployment: none # Since we are evaluating an existing endpoint, we don't need to deploy the model
- _self_
execution:
output_dir: results/${target.api_endpoint.model_id} # Logs and artifacts will be saved here
mode: sequential # Default: run tasks sequentially. You can also use the mode 'parallel'
target:
api_endpoint:
model_id: meta/llama-3.2-3b-instruct # TODO: update to the model you want to evaluate
url: https://integrate.api.nvidia.com/v1/chat/completions # TODO: update to the endpoint you want to evaluate
api_key_name: NGC_API_KEY # Name of the env variable that stores the API Key with access to build.nvidia.com (or model of your choice)
# specify the benchmarks to evaluate
evaluation:
# Optional: Global evaluation overrides - these apply to all benchmarks below
nemo_evaluator_config:
config:
params:
parallelism: 2
request_timeout: 1600
tasks:
- name: ifeval # use the default benchmark configuration
- name: humaneval_instruct
# Optional: Task overrides - here they apply only to the task `humaneval_instruct`
nemo_evaluator_config:
config:
params:
max_new_tokens: 1024
temperature: 0.3
```
This configuration will create evaluations for 2 tasks: `ifeval` and `humaneval_instruct`.
You can display the whole configuration and scripts which will be executed using `--dry-run`:
```
nemo-evaluator-launcher run --config configs/local_endpoint.yaml --dry-run
```
### 4. Run the Evaluation
Once your configuration file is complete, you can run the evaluations:
```bash
nemo-evaluator-launcher run --config configs/local_endpoint.yaml
```
### 5. Run the Same Evaluation for a Different Model (Using CLI Overrides)
You can override the values from your configuration file using CLI overrides:
```bash
export API_KEY=
MODEL_NAME=
URL= # Note: endpoint URL needs to be FULL (e.g., https://api.example.com/v1/chat/completions)
nemo-evaluator-launcher run --config configs/local_endpoint.yaml \
-o target.api_endpoint.model_id=$MODEL_NAME \
-o target.api_endpoint.url=$URL \
-o target.api_endpoint.api_key_name=API_KEY
```
### 6. Check the Job Status and Results
List the runs from last 2 hours to see the invocation IDs of the two evaluation jobs:
```bash
nemo-evaluator-launcher ls runs --since 2h # list runs from last 2 hours
```
Use the IDs to check the jobs statuses:
```bash
nemo-evaluator-launcher status --json
```
When jobs finish, you can display results and export them using the available exporters:
```bash
# Check the results
cat results/*/artifacts/results.yml
# Check the running logs
tail -f results/*/*/logs/stdout.log # use the output_dir printed by the run command
# Export metrics and metadata from both runs to json
nemo-evaluator-launcher export --dest local --format json
cat processed_results.json
```
Refer to {ref}`exporters-overview` for available export options.
## Next Steps
- **{ref}`evaluation-configuration`**: Customize evaluation parameters and prompts
- **{ref}`executors-overview`**: Try Slurm or Lepton for different environments
- **{ref}`exporters-overview`**: Send results to W&B, MLFlow, or other platforms
# Tutorials for NeMo Framework
## Before You Start
Before starting the tutorials, ensure you have:
- **NeMo Framework Container**: Running the latest [NeMo Framework container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo)
- **Model Checkpoint**: Access to a Megatron Bridge checkpoint (tutorials use Llama 3.2 1B Instruct converted from a Hugging Face format).
- **GPU Resources**: CUDA-compatible GPU with sufficient memory
- **Jupyter Environment**: Ability to run Jupyter notebooks
---
## Available Tutorials
Build your expertise with these progressive tutorials:
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` Orchestrating evaluations with NeMo Run
:link: nemo-run
:link-type: doc
Launch deployment and evaluation jobs using NeMo Run.
:::
:::{grid-item-card} {octicon}`play;1.5em;sd-mr-1` Basic evaluation with MMLU Evaluation
:link: https://github.com/NVIDIA-NeMo/Eval/tree/main/tutorials/mmlu.ipynb
:link-type: url
Deploy models and run evaluations with the MMLU benchmark for both completions and chat endpoints.
:::
:::{grid-item-card} {octicon}`package;1.5em;sd-mr-1` Enable additional evaluation harnesses
:link: https://github.com/NVIDIA-NeMo/Eval/tree/main/tutorials/simple-evals.ipynb
:link-type: url
Discover how to extend evaluation capabilities by installing additional harnesses and running HumanEval coding assessments.
:::
:::{grid-item-card} {octicon}`tools;1.5em;sd-mr-1` Configure custom tasks
:link: https://github.com/NVIDIA-NeMo/Eval/tree/main/tutorials/wikitext.ipynb
:link-type: url
Master custom evaluation workflows by running WikiText benchmark with advanced configuration and log-probability analysis.
:::
::::
## Run the Notebook Tutorials
1. Start NeMo Framework Container:
```bash
# set your Hugging Face token for access to gated datasets and checkpoints
export HF_TOKEN=hf_...
docker run --rm -it -w /workdir -v $(pwd):/workdir \
-e HF_TOKEN \
--entrypoint bash --gpus all \
nvcr.io/nvidia/nemo:${TAG}
```
2. Launch Jupyter:
```bash
jupyter lab --ip=0.0.0.0 --port=8888 --allow-root
```
3. Navigate to the `tutorials/` directory and open the desired notebook
:::{toctree}
:caption: Tutorials
:hidden:
nemo-run
:::
# Run Evaluations with NeMo Run
This tutorial explains how to run evaluations inside NeMo Framework container with NeMo Run.
For detailed information about [NeMo Run](https://github.com/NVIDIA/NeMo-Run), please refer to its documentation.
Below is a concise guide focused on using NeMo Run to perform evaluations in NeMo.
## Prerequisites
- Docker installed
- [NeMo Framework container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo)
- Access to a NeMo 2.0 checkpoint (tutorials use Llama 3.2 1B Instruct)
- CUDA-compatible GPU with sufficient memory (for running locally) or access to a Slurm-based Cluster (for running on cluster).
- NeMo Evaluator repository cloned (for access to [scripts](https://github.com/NVIDIA-NeMo/Evaluator/tree/main/scripts))
```bash
git clone https://github.com/NVIDIA-NeMo/Evaluator.git
```
- (Optional) Your Hugging Face token if you are using gated datasets (e.g. [GPQA-Diamond dataset](https://huggingface.co/datasets/Idavidrein/gpqa)).
## How it works
The [evaluation_with_nemo_run.py](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/scripts/evaluation_with_nemo_run.py) script serves as a reference for launching evaluations with NeMo Run.
This script demonstrates how to use NeMo Run with both local executors (your local workstation) and Slurm-based executors like clusters.
In this setup, the deploy and evaluate processes are launched as two separate jobs with NeMo Run. The evaluate method waits until the PyTriton server is accessible and the model is deployed before starting the evaluations.
For this purpose we define a helper function:
```{literalinclude} ../../../scripts/helpers.py
:language: python
:start-after: "# [snippet-start]"
:end-before: "# [snippet-end]"
```
The script supports two types of serving: with Triton (default) and with Ray (pass `--serving_backend ray` flag).
User-provided arguments are mapped onto flags exptected by the scripts:
```{literalinclude} ../../../scripts/evaluation_with_nemo_run.py
:language: python
:start-after: "# [snippet-deploy-start]"
:end-before: "# [snippet-deploy-end]"
```
The script supports two modes of running the experiment:
- locally, using your environment
- remotely, sending the job to the Slurm-based cluster
First, an executor is selected based on the arguments provided by the user, either a local one:
```{literalinclude} ../../../scripts/evaluation_with_nemo_run.py
:language: python
:start-after: "# [snippet-local-executor-start]"
:end-before: "# [snippet-local-executor-end]"
```
or a Slurm one:
```{literalinclude} ../../../scripts/evaluation_with_nemo_run.py
:language: python
:start-after: "# [snippet-slurm-executor-start]"
:end-before: "# [snippet-slurm-executor-end]"
```
:::{note}
Please make sure to update `HF_TOKEN` with your token
- in the NeMo Run script's [local_executor env_vars](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/scripts/evaluation_with_nemo_run.py#L274) if using local executor
- in the [slurm_executor's env_vars](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/scripts/evaluation_with_nemo_run.py#L237) if using slurm_executor.
:::
Then, the two jobs are configured:
```{literalinclude} ../../../scripts/evaluation_with_nemo_run.py
:language: python
:start-after: "# [snippet-config-start]"
:end-before: "# [snippet-config-end]"
```
Finally, the experiment is started:
```{literalinclude} ../../../scripts/evaluation_with_nemo_run.py
:language: python
:start-after: "# [snippet-experiment-start]"
:end-before: "# [snippet-experiment-end]"
```
## Run Locally
To run evaluations on your local workstation, use the following command:
```bash
cd Evaluator/scripts
python evaluation_with_nemo_run.py \
--nemo_checkpoint '/workspace/llama3_8b_nemo2/' \
--eval_task 'gsm8k' \
--devices 2
```
:::{note}
When running locally with NeMo Run, you will need to manually terminate the deploy process once evaluations are complete.
:::
## Run on Slurm-based Clusters
To run evaluations on Slurm-based clusters, add the `--slurm` flag to your command and specify any custom parameters such as user, host, remote_job_dir, account, mounts, etc. Refer to the `evaluation_with_nemo_run.py` script for further details. Below is an example command:
```bash
cd Evaluator/scripts
python evaluation_with_nemo_run.py \
--nemo_checkpoint='/workspace/llama3_8b_nemo2' \
--slurm --nodes 1 \
--devices 8 \
--container_image "nvcr.io/nvidia/nemo:25.11" \
--tensor_parallelism_size 8
```
(evaluation-overview)=
# About Evaluation
Evaluate LLMs, VLMs, agentic systems, and retrieval models across 100+ benchmarks using unified workflows.
## Before You Start
Before you run evaluations, ensure you have:
1. **Chosen your approach**: See {ref}`get-started-overview` for installation and setup guidance
2. **Deployed your model**: See {ref}`deployment-overview` for deployment options
3. **OpenAI-compatible endpoint**: Your model must expose a compatible API (see {ref}`deployment-testing-compatibility`).
4. **API credentials**: Access tokens for your model endpoint and Hugging Face Hub.
---
## Quick Start: Academic Benchmarks
:::{admonition} Fastest path to evaluate academic benchmarks
:class: tip
**For researchers and data scientists**: Evaluate your model on standard academic benchmarks in 3 steps.
**Step 1: Choose Your Approach**
- **Launcher CLI** (Recommended): `nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml`
- **Python API**: Direct programmatic control with `evaluate()` function
**Step 2: Select Benchmarks**
Common academic suites:
- **General Knowledge**: `mmlu_pro`, `gpqa_diamond`
- **Mathematical Reasoning**: `AIME_2025`, `mgsm`
- **Instruction Following**: `ifbench`, `mtbench`
Discover all available tasks:
```bash
nemo-evaluator-launcher ls tasks
```
**Step 3: Run Evaluation**
Create `config.yml`:
```yaml
defaults:
- execution: local
- deployment: none
- _self_
evaluation:
tasks:
- name: mmlu_pro
- name: ifbench
```
Launch the job:
```bash
export NGC_API_KEY=nvapi-...
nemo-evaluator-launcher run \
--config ./config.yml \
-o execution.output_dir=results \
-o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
-o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o +target.api_endpoint.api_key_name=NGC_API_KEY
```
:::
---
## Evaluation Workflows
Select a workflow based on your environment and desired level of control.
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`terminal;1.5em;sd-mr-1` Launcher Workflows
:link: ../get-started/quickstart/launcher
:link-type: doc
Unified CLI for running evaluations across local, Slurm, and cloud backends with built-in result export.
:::
:::{grid-item-card} {octicon}`code;1.5em;sd-mr-1` Core API Workflows
:link: ../libraries/nemo-evaluator/workflows/python-api
:link-type: doc
Programmatic evaluation using Python API for integration into ML pipelines and custom workflows.
:::
:::{grid-item-card} {octicon}`package;1.5em;sd-mr-1` Container Workflows
:link: ../libraries/nemo-evaluator/containers/index
:link-type: doc
Direct container access for specialized use cases and custom evaluation environments.
:::
::::
## Configuration and Customization
Configure your evaluations, create custom tasks, explore benchmarks, and extend the framework with these guides.
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`list-unordered;1.5em;sd-mr-1` Benchmark Catalog
:link: eval-benchmarks
:link-type: ref
Explore 100+ available benchmarks across 18 evaluation harnesses and their specific use cases.
:::
:::{grid-item-card} {octicon}`plus;1.5em;sd-mr-1` Extend Framework
:link: ../libraries/nemo-evaluator/extending/framework-definition-file/index
:link-type: doc
Add custom evaluation frameworks using Framework Definition Files for specialized benchmarks.
:::
::::
## Advanced Features
Scale your evaluations, export results, customize adapters, and resolve issues with these advanced features.
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`workflow;1.5em;sd-mr-1` Multi-Backend Execution
:link: ../libraries/nemo-evaluator-launcher/configuration/executors/index
:link-type: doc
Run evaluations on local machines, HPC clusters, or cloud platforms with unified configuration.
:::
:::{grid-item-card} {octicon}`database;1.5em;sd-mr-1` Result Export
:link: ../libraries/nemo-evaluator-launcher/exporters/index
:link-type: doc
Export evaluation results to MLflow, Weights & Biases, Google Sheets, and other platforms.
:::
:::{grid-item-card} {octicon}`shield;1.5em;sd-mr-1` Adapter System
:link: ../libraries/nemo-evaluator/interceptors/index
:link-type: doc
Configure request/response processing, logging, caching, and custom interceptors.
:::
::::
## Core Evaluation Concepts
- For architectural details and core concepts, refer to {ref}`evaluation-model`.
- For container specifications, refer to {ref}`nemo-evaluator-containers`.
(eval-benchmarks)=
# About Selecting Benchmarks
NeMo Evaluator provides a comprehensive suite of benchmarks spanning academic reasoning, code generation, safety testing, and domain-specific evaluations. Whether you're validating a new model's capabilities or conducting rigorous academic research, you'll find the right benchmarks to assess your AI system's performance.
See {ref}`benchmarks-full-list` for the complete catalog of available benchmarks.
## Available via Launcher
```{literalinclude} ../_snippets/commands/list_tasks.sh
:language: bash
:start-after: "# [snippet-start]"
:end-before: "# [snippet-end]"
```
## Available via Direct Container Access
```{literalinclude} ../_snippets/commands/list_tasks_core.sh
:language: bash
:start-after: "# [snippet-start]"
:end-before: "# [snippet-end]"
```
## Choosing Benchmarks for Academic Research
:::{admonition} Benchmark Selection Guide
:class: tip
**For General Knowledge**:
- `mmlu_pro` - Expert-level knowledge across 14 domains
- `gpqa_diamond` - Graduate-level science questions
**For Mathematical & Quantitative Reasoning**:
- `AIME_2025` - American Invitational Mathematics Examination (AIME) 2025 questions
- `mgsm` - Multilingual math reasoning
**For Instruction Following & Alignment**:
- `ifbench` - Precise instruction following
- `mtbench` - Multi-turn conversation quality
See benchmark categories below and {ref}`benchmarks-full-list` for more details.
:::
## Benchmark Categories
### **Academic and Reasoning**
```{list-table}
:header-rows: 1
:widths: 20 30 30 50
* - Container
- Description
- NGC Catalog
- Benchmarks
* - **simple-evals**
- Common evaluation tasks
- [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/simple-evals)
- GPQA-D, MATH-500, AIME 24 & 25, HumanEval, HumanEval+, MGSM, MMLU (also multilingual), MMLU-Pro, MMLU-lite (AR, BN, DE, EN, ES, FR, HI, ID, IT, JA, KO, MY, PT, SW, YO, ZH), SimpleQA, BrowseComp, HealthBench
* - **lm-evaluation-harness**
- Language model benchmarks
- [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/lm-evaluation-harness)
- ARC Challenge (also multilingual), GSM8K, HumanEval, HumanEval+, MBPP, MBPP+, MINERVA Math, RACE, AGIEval, BBH, BBQ, CSQA, Frames, Global MMLU, GPQA-D, HellaSwag (also multilingual), IFEval, MGSM, MMLU, MMLU-Pro, MMLU-ProX (de, es, fr, it, ja), MMLU-Redux, MUSR, OpenbookQA, Piqa, Social IQa, TruthfulQA, WikiLingua, WinoGrande
* - **hle**
- Academic knowledge and problem solving
- [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/hle)
- HLE
* - **ifbench**
- Instruction following
- [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/ifbench)
- IFBench
* - **mtbench**
- Multi-turn conversation evaluation
- [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mtbench)
- MT-Bench
* - **nemo-skills**
- Language model benchmarks (science, math, agentic)
- [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/nemo_skills)
- AIME 24 & 25, BFCL_v3, GPQA, HLE, LiveCodeBench, MMLU, MMLU-Pro
* - **profbench**
- Evaluation of professional knowledge accross Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA
- [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mtbench)
- Report Gerenation, LLM Judge
```
:::{note}
BFCL tasks from the nemo-skills container require function calling capabilities. See {ref}`deployment-testing-compatibility` for checking if your endpoint is compatible.
:::
**Example Usage:**
Create `config.yml`:
```yaml
defaults:
- execution: local
- deployment: none
- _self_
evaluation:
tasks:
- name: ifeval
- name: gsm8k_cot_instruct
- name: gpqa_diamond
```
Run evaluation:
```bash
export NGC_API_KEY=nvapi-...
export HF_TOKEN=hf_...
nemo-evaluator-launcher run \
--config ./config.yml \
-o execution.output_dir=results \
-o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
-o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o +target.api_endpoint.api_key_name=NGC_API_KEY
```
### **Code Generation**
```{list-table}
:header-rows: 1
:widths: 20 30 30 50
* - Container
- Description
- NGC Catalog
- Benchmarks
* - **bigcode-evaluation-harness**
- Code generation evaluation
- [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bigcode-evaluation-harness)
- MBPP, MBPP-Plus, HumanEval, HumanEval+, Multiple (cpp, cs, d, go, java, jl, js, lua, php, pl, py, r, rb, rkt, rs, scala, sh, swift, ts)
* - **livecodebench**
- Coding
- [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/livecodebench)
- LiveCodeBench (v1-v6, 0724_0125, 0824_0225)
* - **scicode**
- Coding for scientific research
- [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/scicode)
- SciCode
```
**Example Usage:**
Create `config.yml`:
```yaml
defaults:
- execution: local
- deployment: none
- _self_
evaluation:
tasks:
- name: humaneval_instruct
- name: mbbp
```
Run evaluation:
```bash
export NGC_API_KEY=nvapi-...
nemo-evaluator-launcher run \
--config ./config.yml \
-o execution.output_dir=results \
-o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
-o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o +target.api_endpoint.api_key_name=NGC_API_KEY
```
### **Safety and Security**
```{list-table}
:header-rows: 1
:widths: 20 30 30 50
* - Container
- Description
- NGC Catalog
- Benchmarks
* - **garak**
- Safety and vulnerability testing
- [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/garak)
- Garak
* - **safety-harness**
- Safety and bias evaluation
- [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/safety-harness)
- Aegis v2, WildGuard
```
**Example Usage:**
Create `config.yml`:
```yaml
defaults:
- execution: local
- deployment: none
- _self_
evaluation:
tasks:
- name: aegis_v2
- name: garak
```
Run evaluation:
```bash
export NGC_API_KEY=nvapi-...
export HF_TOKEN=hf_...
nemo-evaluator-launcher run \
--config ./config.yml \
-o execution.output_dir=results \
-o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
-o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o +target.api_endpoint.api_key_name=NGC_API_KEY
```
### **Function Calling**
```{list-table}
:header-rows: 1
:widths: 20 30 30 50
* - Container
- Description
- NGC Catalog
- Benchmarks
* - **bfcl**
- Function calling
- [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bfcl)
- BFCL v2 and v3
* - **tooltalk**
- Tool usage evaluation
- [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/tooltalk)
- ToolTalk
```
:::{note}
Some of the tasks in this category require function calling capabilities. See {ref}`deployment-testing-compatibility` for checking if your endpoint is compatible.
:::
**Example Usage:**
Create `config.yml`:
```yaml
defaults:
- execution: local
- deployment: none
- _self_
evaluation:
tasks:
- name: bfclv2_ast_prompting
- name: tooltalk
```
Run evaluation:
```bash
export NGC_API_KEY=nvapi-...
nemo-evaluator-launcher run \
--config ./config.yml \
-o execution.output_dir=results \
-o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
-o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o +target.api_endpoint.api_key_name=NGC_API_KEY
```
### **Vision-Language Models**
```{list-table}
:header-rows: 1
:widths: 20 30 30 50
* - Container
- Description
- NGC Catalog
- Benchmarks
* - **vlmevalkit**
- Vision-language model evaluation
- [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/vlmevalkit)
- AI2D, ChartQA, MMMU, MathVista-MINI, OCRBench, SlideVQA
```
:::{note}
The tasks in this category require a VLM chat endpoint. See {ref}`deployment-testing-compatibility` for checking if your endpoint is compatible.
:::
**Example Usage:**
Create `config.yml`:
```yaml
defaults:
- execution: local
- deployment: none
- _self_
evaluation:
tasks:
- name: ocrbench
- name: chartqa
```
Run evaluation:
```bash
export NGC_API_KEY=nvapi-...
nemo-evaluator-launcher run \
--config ./config.yml \
-o execution.output_dir=results \
-o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
-o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o +target.api_endpoint.api_key_name=NGC_API_KEY
```
### **Domain-Specific**
```{list-table}
:header-rows: 1
:widths: 20 30 30 50
* - Container
- Description
- NGC Catalog
- Benchmarks
* - **helm**
- Holistic evaluation framework
- [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/helm)
- MedHelm
```
**Example Usage:**
Create `config.yml`:
```yaml
defaults:
- execution: local
- deployment: none
- _self_
evaluation:
tasks:
- name: pubmed_qa
- name: medcalc_bench
```
Run evaluation:
```bash
export NGC_API_KEY=nvapi-...
nemo-evaluator-launcher run \
--config ./config.yml \
-o execution.output_dir=results \
-o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
-o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o +target.api_endpoint.api_key_name=NGC_API_KEY
```
## Container Details
For detailed specifications of each container, see {ref}`nemo-evaluator-containers`.
### Quick Container Access
Pull and run any evaluation container directly:
```bash
# Academic benchmarks
docker pull nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }}
docker run --rm -it nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }}
# Code generation
docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:{{ docker_compose_latest }}
docker run --rm -it nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:{{ docker_compose_latest }}
# Safety evaluation
docker pull nvcr.io/nvidia/eval-factory/safety-harness:{{ docker_compose_latest }}
docker run --rm -it nvcr.io/nvidia/eval-factory/safety-harness:{{ docker_compose_latest }}
```
### Available Tasks by Container
For a complete list of available tasks in each container:
```bash
# List tasks in any container
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} nemo-evaluator ls
# Or use the launcher for unified access
nemo-evaluator-launcher ls tasks
```
## Integration Patterns
NeMo Evaluator provides multiple integration options to fit your workflow:
```bash
# Launcher CLI (recommended for most users)
nemo-evaluator-launcher ls tasks
nemo-evaluator-launcher run --config ./local_mmlu_evaluation.yaml
# Container direct execution
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} nemo-evaluator ls
# Python API (for programmatic control)
# See the Python API documentation for details
```
## Benchmark Selection Best Practices
### For Model Development
**Iterative Testing**:
- Start with `limit_samples=100` for quick feedback during development
- Run full evaluations before major releases
- Track metrics over time to measure improvement
**Configuration**:
```python
# Development testing
params = ConfigParams(
limit_samples=100, # Quick iteration
temperature=0.01, # Deterministic
parallelism=4
)
# Production evaluation
params = ConfigParams(
limit_samples=None, # Full dataset
temperature=0.01, # Deterministic
parallelism=8 # Higher throughput
)
```
### For Specialized Domains
- **Code Models**: Focus on `humaneval`, `mbpp`, `livecodebench`
- **Instruction Models**: Emphasize `ifbench`, `mtbench`
- **Multilingual Models**: Include `arc_multilingual`, `hellaswag_multilingual`, `mgsm`
- **Safety-Critical**: Prioritize `safety-harness` and `garak` evaluations
## Next Steps
- **Container Details**: Browse {ref}`nemo-evaluator-containers` for complete specifications
- **Custom Benchmarks**: Learn {ref}`framework-definition-file` for custom evaluations
:::{toctree}
:caption: Harnesses
:hidden:
AA-LCR
bfcl
bigcode-evaluation-harness
codec
garak
genai_perf_eval
helm
hle
ifbench
livecodebench
lm-evaluation-harness
mmath
mtbench
mteb
nemo_skills
profbench
ruler
safety_eval
scicode
simple_evals
tau2_bench
tooltalk
vlmevalkit
:::
```{list-table}
:header-rows: 1
:widths: 18 30 18 8 26
* - Container
- Description
- Container Ref
- Arch
- Tasks
* - AA-LCR
- A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).
- `26.01`
- `multiarch`
- aa_lcr
* - bfcl
- The Berkeley Function Calling Leaderboard V3 (also called Berkeley Tool Calling Leaderboard V3) evaluates the LLM's ability to call functions (aka tools) accurately.
- `26.01`
- `multiarch`
- bfclv2, bfclv2_ast, bfclv2_ast_prompting, bfclv3, bfclv3_ast, bfclv3_ast_prompting
* - bigcode-evaluation-harness
- A framework for the evaluation of autoregressive code generation language models.
- `26.01`
- `multiarch`
- humaneval, humaneval_instruct, humanevalplus, mbpp-chat, mbpp-completions, mbppplus-chat, mbppplus-completions, mbppplus_nemo, multiple-cpp, multiple-cs, multiple-d, multiple-go, multiple-java, multiple-jl, multiple-js, multiple-lua, multiple-php, multiple-pl, multiple-py, multiple-r, multiple-rb, multiple-rkt, multiple-rs, multiple-scala, multiple-sh, multiple-swift, multiple-ts
* - codec
- Contamination detection framework for evaluating language models
- `26.01`
- `amd`
- aime_2024, aime_2025, bbq, bfcl_v3, frames, gpqa_diamond, gsm8k_test, gsm8k_train, hellaswag_test, hellaswag_train, hle, ifbench, ifeval, livecodebench_v1, livecodebench_v5, math_500_problem, math_500_solution, mmlu_pro_test, mmlu_test, openai_humaneval, reward_bench_v1, reward_bench_v2, scicode, swebench_test, swebench_train, taubench, terminalbench
* - garak
- Garak is an LLM vulnerability scanner.
- `26.01`
- `multiarch`
- garak, garak-completions
* - genai_perf_eval
- GenAI Perf is a tool to evaluate the performance of LLM endpoints, based on GenAI Perf.
- `26.01`
- `amd`
- genai_perf_generation, genai_perf_generation_completions, genai_perf_summarization, genai_perf_summarization_completions
* - helm
- A framework for evaluating large language models in medical applications across various healthcare tasks
- `26.01`
- `amd`
- aci_bench, ehr_sql, head_qa, med_dialog_healthcaremagic, med_dialog_icliniq, medbullets, medcalc_bench, medec, medhallu, medi_qa, medication_qa, mtsamples_procedures, mtsamples_replicate, pubmed_qa, race_based_med
* - hle
- Humanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. Humanity's Last Exam consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading.
- `26.01`
- `multiarch`
- hle, hle_aa_v2
* - ifbench
- IFBench is a new, challenging benchmark for precise instruction following.
- `26.01`
- `multiarch`
- ifbench, ifbench_aa_v2
* - livecodebench
- Holistic and Contamination Free Evaluation of Large Language Models for Code.
- `26.01`
- `multiarch`
- codeexecution_v2, codeexecution_v2_cot, codegeneration_notfast, codegeneration_release_latest, codegeneration_release_v1, codegeneration_release_v2, codegeneration_release_v3, codegeneration_release_v4, codegeneration_release_v5, codegeneration_release_v6, livecodebench_0724_0125, livecodebench_0824_0225, livecodebench_aa_v2, testoutputprediction
* - lm-evaluation-harness
- This project provides a unified framework to test generative language models on a large number of different evaluation tasks.
- `26.01`
- `multiarch`
- adlr_agieval_en_cot, adlr_arc_challenge_llama_25_shot, adlr_commonsense_qa_7_shot, adlr_global_mmlu_lite_5_shot, adlr_gpqa_diamond_cot_5_shot, adlr_gsm8k_cot_8_shot, adlr_humaneval_greedy, adlr_humaneval_sampled, adlr_math_500_4_shot_sampled, adlr_mbpp_sanitized_3_shot_greedy, adlr_mbpp_sanitized_3_shot_sampled, adlr_mgsm_native_cot_8_shot, adlr_minerva_math_nemo_4_shot, adlr_mmlu, adlr_mmlu_pro_5_shot_base, adlr_race, adlr_truthfulqa_mc2, adlr_winogrande_5_shot, agieval, arc_challenge, arc_challenge_chat, arc_multilingual, bbh, bbh_instruct, bbq_chat, bbq_completions, commonsense_qa, global_mmlu, global_mmlu_ar, global_mmlu_bn, global_mmlu_de, global_mmlu_en, global_mmlu_es, global_mmlu_fr, global_mmlu_full, global_mmlu_full_am, global_mmlu_full_ar, global_mmlu_full_bn, global_mmlu_full_cs, global_mmlu_full_de, global_mmlu_full_el, global_mmlu_full_en, global_mmlu_full_es, global_mmlu_full_fa, global_mmlu_full_fil, global_mmlu_full_fr, global_mmlu_full_ha, global_mmlu_full_he, global_mmlu_full_hi, global_mmlu_full_id, global_mmlu_full_ig, global_mmlu_full_it, global_mmlu_full_ja, global_mmlu_full_ko, global_mmlu_full_ky, global_mmlu_full_lt, global_mmlu_full_mg, global_mmlu_full_ms, global_mmlu_full_ne, global_mmlu_full_nl, global_mmlu_full_ny, global_mmlu_full_pl, global_mmlu_full_pt, global_mmlu_full_ro, global_mmlu_full_ru, global_mmlu_full_si, global_mmlu_full_sn, global_mmlu_full_so, global_mmlu_full_sr, global_mmlu_full_sv, global_mmlu_full_sw, global_mmlu_full_te, global_mmlu_full_tr, global_mmlu_full_uk, global_mmlu_full_vi, global_mmlu_full_yo, global_mmlu_full_zh, global_mmlu_hi, global_mmlu_id, global_mmlu_it, global_mmlu_ja, global_mmlu_ko, global_mmlu_pt, global_mmlu_sw, global_mmlu_yo, global_mmlu_zh, gpqa, gpqa_diamond_cot, gsm8k, gsm8k_cot_instruct, gsm8k_cot_llama, gsm8k_cot_zeroshot, gsm8k_cot_zeroshot_llama, hellaswag, hellaswag_multilingual, humaneval_instruct, ifeval, m_mmlu_id_str_chat, m_mmlu_id_str_completions, mbpp_plus_chat, mbpp_plus_completions, mgsm, mgsm_cot_chat, mgsm_cot_completions, mmlu, mmlu_cot_0_shot_chat, mmlu_instruct, mmlu_instruct_completions, mmlu_logits, mmlu_pro, mmlu_pro_instruct, mmlu_prox_chat, mmlu_prox_completions, mmlu_prox_de_chat, mmlu_prox_de_completions, mmlu_prox_es_chat, mmlu_prox_es_completions, mmlu_prox_fr_chat, mmlu_prox_fr_completions, mmlu_prox_it_chat, mmlu_prox_it_completions, mmlu_prox_ja_chat, mmlu_prox_ja_completions, mmlu_redux, mmlu_redux_instruct, musr, openbookqa, piqa, social_iqa, truthfulqa, wikilingua, wikitext, winogrande
* - mmath
- MMATH is a new benchmark specifically designed for multilingual complex reasoning. It comprises 374 carefully selected math problems from high-quality sources, including AIME, CNMO, and MATH-500, and covers ten typologically and geographically diverse languages. Each problem is translated and validated through a rigorous pipeline that combines frontier LLMs with human verification, ensuring semantic consistency.
- `26.01`
- `multiarch`
- mmath_ar, mmath_en, mmath_es, mmath_fr, mmath_ja, mmath_ko, mmath_pt, mmath_th, mmath_vi, mmath_zh
* - mtbench
- MT-bench is designed to test multi-turn conversation and instruction-following ability, covering common use cases and focusing on challenging questions to differentiate models.
- `26.01`
- `multiarch`
- mtbench, mtbench-cor1
* - mteb
- The Massive Text Embedding Benchmark (MTEB) is a comprehensive benchmark designed to evaluate the performance of text embedding models across a wide range of tasks and datasets. It includes 58 datasets covering 8 tasks and 112 languages.
- `26.01`
- `multiarch`
- MMTEB, MTEB, MTEB_NL_RETRIEVAL, MTEB_VDR, RTEB, ViDoReV1, ViDoReV2, ViDoReV3, ViDoReV3_Text, ViDoReV3_Text_Image, custom_beir_task, fiqa, hotpotqa, miracl, miracl_lite, mldr, mlqa, nano_fiqa, nq
* - nemo_skills
- NeMo Skills - a project to improve skills of LLMs
- `26.01`
- `multiarch`
- ns_aa_lcr, ns_aime2024, ns_aime2025, ns_bfcl_v3, ns_bfcl_v4, ns_gpqa, ns_hle, ns_hle_aa, ns_hmmt_feb2025, ns_ifbench, ns_ifeval, ns_livecodebench, ns_livecodebench_aa, ns_livecodebench_v5, ns_mmlu, ns_mmlu_pro, ns_mmlu_prox, ns_ruler, ns_scicode, ns_wmt24pp
* - profbench
- Professional domain benchmark for evaluating LLMs on Physics PhD, Chemistry PhD, Finance MBA, and Consulting MBA tasks
- `26.01`
- `multiarch`
- llm_judge, report_generation
* - ruler
- RULER generates synthetic examples to evaluate long-context language models with configurable sequence length and task complexity.
- `26.01`
- `multiarch`
- ruler-128k-chat, ruler-128k-completions, ruler-16k-chat, ruler-16k-completions, ruler-1m-chat, ruler-1m-completions, ruler-256k-chat, ruler-256k-completions, ruler-32k-chat, ruler-32k-completions, ruler-4k-chat, ruler-4k-completions, ruler-512k-chat, ruler-512k-completions, ruler-64k-chat, ruler-64k-completions, ruler-8k-chat, ruler-8k-completions, ruler-chat, ruler-completions
* - safety_eval
- Harness for Safety evaluations
- `25.11`
- `multiarch`
- aegis_v2, aegis_v2_reasoning, wildguard
* - scicode
- SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems.
- `26.01`
- `multiarch`
- scicode, scicode_aa_v2, scicode_background
* - simple_evals
- simple-evals - a lightweight library for evaluating language models.
- `26.01`
- `multiarch`
- AA_AIME_2024, AA_math_test_500, AIME_2024, AIME_2025, AIME_2025_aa_v2, aime_2024_nemo, aime_2025_nemo, browsecomp, gpqa_diamond, gpqa_diamond_aa_v2, gpqa_diamond_aa_v2_llama_4, gpqa_diamond_aa_v3, gpqa_diamond_nemo, gpqa_extended, gpqa_main, healthbench, healthbench_consensus, healthbench_hard, humaneval, humanevalplus, math_test_500, math_test_500_nemo, mgsm, mgsm_aa_v2, mmlu, mmlu_am, mmlu_ar, mmlu_ar-lite, mmlu_bn, mmlu_bn-lite, mmlu_cs, mmlu_de, mmlu_de-lite, mmlu_el, mmlu_en, mmlu_en-lite, mmlu_es, mmlu_es-lite, mmlu_fa, mmlu_fil, mmlu_fr, mmlu_fr-lite, mmlu_ha, mmlu_he, mmlu_hi, mmlu_hi-lite, mmlu_id, mmlu_id-lite, mmlu_ig, mmlu_it, mmlu_it-lite, mmlu_ja, mmlu_ja-lite, mmlu_ko, mmlu_ko-lite, mmlu_ky, mmlu_llama_4, mmlu_lt, mmlu_mg, mmlu_ms, mmlu_my-lite, mmlu_ne, mmlu_nl, mmlu_ny, mmlu_pl, mmlu_pro, mmlu_pro_aa_v2, mmlu_pro_aa_v3, mmlu_pro_llama_4, mmlu_pt, mmlu_pt-lite, mmlu_ro, mmlu_ru, mmlu_si, mmlu_sn, mmlu_so, mmlu_sr, mmlu_sv, mmlu_sw, mmlu_sw-lite, mmlu_te, mmlu_tr, mmlu_uk, mmlu_vi, mmlu_yo, mmlu_yo-lite, mmlu_zh-lite, simpleqa
* - tau2_bench
- Evaluating Conversational Agents in a Dual-Control Environment
- `26.01`
- `multiarch`
- tau2_bench_airline, tau2_bench_retail, tau2_bench_telecom
* - tooltalk
- ToolTalk is designed to evaluate tool-augmented LLMs as a chatbot. ToolTalk contains a handcrafted dataset of 28 easy conversations and 50 hard conversations.
- `26.01`
- `multiarch`
- tooltalk
* - vlmevalkit
- VLMEvalKit is an open-source evaluation toolkit of large vision-language models (LVLMs). It enables one-command evaluation of LVLMs on various benchmarks, without the heavy workload of data preparation under multiple repositories. In VLMEvalKit, we adopt generation-based evaluation for all LVLMs, and provide the evaluation results obtained with both exact matching and LLM-based answer extraction.
- `26.01`
- `amd`
- ai2d_judge, chartqa, mathvista-mini, mmmu_judge, ocr_reasoning, ocrbench, slidevqa
```
# AA-LCR
This page contains all evaluation tasks for the **AA-LCR** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [aa_lcr](#aa-lcr-aa-lcr)
- A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).
```
(aa-lcr-aa-lcr)=
## aa_lcr
A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).
::::{tab-set}
:::{tab-item} Container
**Harness:** `AA-LCR`
**Container:**
```
nvcr.io/nvidia/eval-factory/aa-lcr:26.01
```
**Container Digest:**
```
sha256:67dd35302ed15610afc9471a2ff4f515d95a235753f1b259db60748249366939
```
**Container Arch:** `multiarch`
**Task Type:** `aa_lcr`
:::
:::{tab-item} Command
```bash
aa_lcr --model={{target.api_endpoint.model_id}} --endpoint_url={{target.api_endpoint.url}} --temperature={{config.params.temperature}} --top_p={{config.params.top_p}} --request_timeout={{config.params.request_timeout}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --output_dir={{config.output_dir}} {% if target.api_endpoint.api_key_name is not none %}--api_key_name={{target.api_endpoint.api_key_name}}{% endif %} --max_retries={{config.params.max_retries}} --max_new_tokens={{config.params.max_new_tokens}} --async_limit={{config.params.parallelism}} --num_repeats={{config.params.extra.n_samples}} --seed={{config.params.extra.seed}} --judge_model={{config.params.extra.judge.model_id}} --judge_url={{config.params.extra.judge.url}} --judge_temperature={{config.params.extra.judge.temperature}} --judge_top_p={{config.params.extra.judge.top_p}} --judge_max_new_tokens={{config.params.extra.judge.max_new_tokens}} --judge_async_limit={{config.params.extra.judge.parallelism}} {% if config.params.extra.judge.api_key is defined %}--judge_api_key_name={{config.params.extra.judge.api_key}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: AA-LCR
pkg_name: aa_lcr
config:
params:
max_new_tokens: 16384
max_retries: 30
parallelism: 10
temperature: 0.0
request_timeout: 600
top_p: 1.0
extra:
n_samples: 3
seed: 42
judge:
url: https://integrate.api.nvidia.com/v1/chat/completions
model_id: nvdev/qwen/qwen-235b
request_timeout: 600
max_retries: 30
temperature: 0.0
top_p: 1.0
max_new_tokens: 1024
parallelism: 10
api_key: JUDGE_API_KEY
supported_endpoint_types:
- chat
type: aa_lcr
target:
api_endpoint: {}
```
:::
::::
# bfcl
This page contains all evaluation tasks for the **bfcl** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [bfclv2](#bfcl-bfclv2)
- BFCL v2 with Single-turn, Live and Non-Live, AST and Exec evaluation. Not using native function calling.
* - [bfclv2_ast](#bfcl-bfclv2-ast)
- BFCL v2 with Single-turn, Live and Non-Live, AST evaluation only. Uses native function calling.
* - [bfclv2_ast_prompting](#bfcl-bfclv2-ast-prompting)
- BFCL v2 with Single-turn, Live and Non-Live, AST evaluation only. Not using native function calling.
* - [bfclv3](#bfcl-bfclv3)
- BFCL v3 with Single-turn and Multi-turn, Live and Non-Live, AST and Exec evaluation. Not using native function calling.
* - [bfclv3_ast](#bfcl-bfclv3-ast)
- BFCL v3 with Single-turn and Multi-turn, Live and Non-Live, AST evaluation. Uses native function calling.
* - [bfclv3_ast_prompting](#bfcl-bfclv3-ast-prompting)
- BFCL v3 with Single-turn and Multi-turn, Live and Non-Live, AST evaluation. Not using native function calling.
```
(bfcl-bfclv2)=
## bfclv2
BFCL v2 with Single-turn, Live and Non-Live, AST and Exec evaluation. Not using native function calling.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bfcl`
**Container:**
```
nvcr.io/nvidia/eval-factory/bfcl:26.01
```
**Container Digest:**
```
sha256:5016e1f2b9984f5d348ac3806974d7b5d6ff6f550605f3220a3f08318e0c60c9
```
**Container Arch:** `multiarch`
**Task Type:** `bfclv2`
:::
:::{tab-item} Command
```bash
{%- if config.params.extra.custom_dataset.path is not none and config.params.extra.custom_dataset.format is not none -%} echo "Processing custom dataset..." && export BFCL_DATA_DIR=$(core-evals-process-custom-dataset \
--dataset_format {{config.params.extra.custom_dataset.format}} \
--dataset_path {{config.params.extra.custom_dataset.path}} \
--test_category {{config.params.task}} \
--processing_output_dir {{config.output_dir ~ "/custom_dataset_processing"}} \
{% if config.params.extra.custom_dataset.data_template_path %}--data_template_path {{config.params.extra.custom_dataset.data_template_path}}{% endif %}) && \
echo "Using custom dataset at ${BFCL_DATA_DIR}" && \
{% endif -%}
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl generate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} --num-threads {{config.params.parallelism}} && \
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl evaluate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --score-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bfcl
pkg_name: bfcl
config:
params:
parallelism: 10
task: single_turn
extra:
native_calling: false
custom_dataset:
path: null
format: null
data_template_path: null
supported_endpoint_types:
- chat
- vlm
type: bfclv2
target:
api_endpoint: {}
```
:::
::::
---
(bfcl-bfclv2-ast)=
## bfclv2_ast
BFCL v2 with Single-turn, Live and Non-Live, AST evaluation only. Uses native function calling.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bfcl`
**Container:**
```
nvcr.io/nvidia/eval-factory/bfcl:26.01
```
**Container Digest:**
```
sha256:5016e1f2b9984f5d348ac3806974d7b5d6ff6f550605f3220a3f08318e0c60c9
```
**Container Arch:** `multiarch`
**Task Type:** `bfclv2_ast`
:::
:::{tab-item} Command
```bash
{%- if config.params.extra.custom_dataset.path is not none and config.params.extra.custom_dataset.format is not none -%} echo "Processing custom dataset..." && export BFCL_DATA_DIR=$(core-evals-process-custom-dataset \
--dataset_format {{config.params.extra.custom_dataset.format}} \
--dataset_path {{config.params.extra.custom_dataset.path}} \
--test_category {{config.params.task}} \
--processing_output_dir {{config.output_dir ~ "/custom_dataset_processing"}} \
{% if config.params.extra.custom_dataset.data_template_path %}--data_template_path {{config.params.extra.custom_dataset.data_template_path}}{% endif %}) && \
echo "Using custom dataset at ${BFCL_DATA_DIR}" && \
{% endif -%}
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl generate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} --num-threads {{config.params.parallelism}} && \
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl evaluate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --score-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bfcl
pkg_name: bfcl
config:
params:
parallelism: 10
task: ast
extra:
native_calling: true
custom_dataset:
path: null
format: null
data_template_path: null
supported_endpoint_types:
- chat
- vlm
type: bfclv2_ast
target:
api_endpoint: {}
```
:::
::::
---
(bfcl-bfclv2-ast-prompting)=
## bfclv2_ast_prompting
BFCL v2 with Single-turn, Live and Non-Live, AST evaluation only. Not using native function calling.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bfcl`
**Container:**
```
nvcr.io/nvidia/eval-factory/bfcl:26.01
```
**Container Digest:**
```
sha256:5016e1f2b9984f5d348ac3806974d7b5d6ff6f550605f3220a3f08318e0c60c9
```
**Container Arch:** `multiarch`
**Task Type:** `bfclv2_ast_prompting`
:::
:::{tab-item} Command
```bash
{%- if config.params.extra.custom_dataset.path is not none and config.params.extra.custom_dataset.format is not none -%} echo "Processing custom dataset..." && export BFCL_DATA_DIR=$(core-evals-process-custom-dataset \
--dataset_format {{config.params.extra.custom_dataset.format}} \
--dataset_path {{config.params.extra.custom_dataset.path}} \
--test_category {{config.params.task}} \
--processing_output_dir {{config.output_dir ~ "/custom_dataset_processing"}} \
{% if config.params.extra.custom_dataset.data_template_path %}--data_template_path {{config.params.extra.custom_dataset.data_template_path}}{% endif %}) && \
echo "Using custom dataset at ${BFCL_DATA_DIR}" && \
{% endif -%}
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl generate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} --num-threads {{config.params.parallelism}} && \
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl evaluate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --score-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bfcl
pkg_name: bfcl
config:
params:
parallelism: 10
task: ast
extra:
native_calling: false
custom_dataset:
path: null
format: null
data_template_path: null
supported_endpoint_types:
- chat
- vlm
type: bfclv2_ast_prompting
target:
api_endpoint: {}
```
:::
::::
---
(bfcl-bfclv3)=
## bfclv3
BFCL v3 with Single-turn and Multi-turn, Live and Non-Live, AST and Exec evaluation. Not using native function calling.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bfcl`
**Container:**
```
nvcr.io/nvidia/eval-factory/bfcl:26.01
```
**Container Digest:**
```
sha256:5016e1f2b9984f5d348ac3806974d7b5d6ff6f550605f3220a3f08318e0c60c9
```
**Container Arch:** `multiarch`
**Task Type:** `bfclv3`
:::
:::{tab-item} Command
```bash
{%- if config.params.extra.custom_dataset.path is not none and config.params.extra.custom_dataset.format is not none -%} echo "Processing custom dataset..." && export BFCL_DATA_DIR=$(core-evals-process-custom-dataset \
--dataset_format {{config.params.extra.custom_dataset.format}} \
--dataset_path {{config.params.extra.custom_dataset.path}} \
--test_category {{config.params.task}} \
--processing_output_dir {{config.output_dir ~ "/custom_dataset_processing"}} \
{% if config.params.extra.custom_dataset.data_template_path %}--data_template_path {{config.params.extra.custom_dataset.data_template_path}}{% endif %}) && \
echo "Using custom dataset at ${BFCL_DATA_DIR}" && \
{% endif -%}
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl generate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} --num-threads {{config.params.parallelism}} && \
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl evaluate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --score-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bfcl
pkg_name: bfcl
config:
params:
parallelism: 10
task: all
extra:
native_calling: false
custom_dataset:
path: null
format: null
data_template_path: null
supported_endpoint_types:
- chat
- vlm
type: bfclv3
target:
api_endpoint: {}
```
:::
::::
---
(bfcl-bfclv3-ast)=
## bfclv3_ast
BFCL v3 with Single-turn and Multi-turn, Live and Non-Live, AST evaluation. Uses native function calling.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bfcl`
**Container:**
```
nvcr.io/nvidia/eval-factory/bfcl:26.01
```
**Container Digest:**
```
sha256:5016e1f2b9984f5d348ac3806974d7b5d6ff6f550605f3220a3f08318e0c60c9
```
**Container Arch:** `multiarch`
**Task Type:** `bfclv3_ast`
:::
:::{tab-item} Command
```bash
{%- if config.params.extra.custom_dataset.path is not none and config.params.extra.custom_dataset.format is not none -%} echo "Processing custom dataset..." && export BFCL_DATA_DIR=$(core-evals-process-custom-dataset \
--dataset_format {{config.params.extra.custom_dataset.format}} \
--dataset_path {{config.params.extra.custom_dataset.path}} \
--test_category {{config.params.task}} \
--processing_output_dir {{config.output_dir ~ "/custom_dataset_processing"}} \
{% if config.params.extra.custom_dataset.data_template_path %}--data_template_path {{config.params.extra.custom_dataset.data_template_path}}{% endif %}) && \
echo "Using custom dataset at ${BFCL_DATA_DIR}" && \
{% endif -%}
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl generate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} --num-threads {{config.params.parallelism}} && \
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl evaluate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --score-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bfcl
pkg_name: bfcl
config:
params:
parallelism: 10
task: multi_turn,ast
extra:
native_calling: true
custom_dataset:
path: null
format: null
data_template_path: null
supported_endpoint_types:
- chat
- vlm
type: bfclv3_ast
target:
api_endpoint: {}
```
:::
::::
---
(bfcl-bfclv3-ast-prompting)=
## bfclv3_ast_prompting
BFCL v3 with Single-turn and Multi-turn, Live and Non-Live, AST evaluation. Not using native function calling.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bfcl`
**Container:**
```
nvcr.io/nvidia/eval-factory/bfcl:26.01
```
**Container Digest:**
```
sha256:5016e1f2b9984f5d348ac3806974d7b5d6ff6f550605f3220a3f08318e0c60c9
```
**Container Arch:** `multiarch`
**Task Type:** `bfclv3_ast_prompting`
:::
:::{tab-item} Command
```bash
{%- if config.params.extra.custom_dataset.path is not none and config.params.extra.custom_dataset.format is not none -%} echo "Processing custom dataset..." && export BFCL_DATA_DIR=$(core-evals-process-custom-dataset \
--dataset_format {{config.params.extra.custom_dataset.format}} \
--dataset_path {{config.params.extra.custom_dataset.path}} \
--test_category {{config.params.task}} \
--processing_output_dir {{config.output_dir ~ "/custom_dataset_processing"}} \
{% if config.params.extra.custom_dataset.data_template_path %}--data_template_path {{config.params.extra.custom_dataset.data_template_path}}{% endif %}) && \
echo "Using custom dataset at ${BFCL_DATA_DIR}" && \
{% endif -%}
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl generate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} --num-threads {{config.params.parallelism}} && \
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl evaluate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --score-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bfcl
pkg_name: bfcl
config:
params:
parallelism: 10
task: multi_turn,ast
extra:
native_calling: false
custom_dataset:
path: null
format: null
data_template_path: null
supported_endpoint_types:
- chat
- vlm
type: bfclv3_ast_prompting
target:
api_endpoint: {}
```
:::
::::
# bigcode-evaluation-harness
This page contains all evaluation tasks for the **bigcode-evaluation-harness** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [humaneval](#bigcode-evaluation-harness-humaneval)
- HumanEval is used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.
* - [humaneval_instruct](#bigcode-evaluation-harness-humaneval-instruct)
- InstructHumanEval is a modified version of OpenAI HumanEval. For a given prompt, we extracted its signature, its docstring as well as its header to create a flexing setting which would allow to evaluation instruction-tuned LLM. The delimiters used in the instruction-tuning procedure can be use to build and instruction that would allow the model to elicit its best capabilities.
* - [humanevalplus](#bigcode-evaluation-harness-humanevalplus)
- HumanEvalPlus is a modified version of HumanEval containing 80x more test cases.
* - [mbpp-chat](#bigcode-evaluation-harness-mbpp-chat)
- MBPP consists of Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. This variant uses the chat endpoint.
* - [mbpp-completions](#bigcode-evaluation-harness-mbpp-completions)
- MBPP consists of Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. This variant uses the completions endpoint.
* - [mbppplus-chat](#bigcode-evaluation-harness-mbppplus-chat)
- MBPP+ is a modified version of MBPP containing 35x more test cases. This variant uses the chat endpoint.
* - [mbppplus-completions](#bigcode-evaluation-harness-mbppplus-completions)
- MBPP+ is a modified version of MBPP containing 35x more test cases. This variant uses the completions endpoint.
* - [mbppplus_nemo](#bigcode-evaluation-harness-mbppplus-nemo)
- MBPP+NeMo is a modified version of MBPP+ that uses the NeMo alignment prompt template.
* - [multiple-cpp](#bigcode-evaluation-harness-multiple-cpp)
- MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "cpp" subset.
* - [multiple-cs](#bigcode-evaluation-harness-multiple-cs)
- MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "cs" subset.
* - [multiple-d](#bigcode-evaluation-harness-multiple-d)
- MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "d" subset.
* - [multiple-go](#bigcode-evaluation-harness-multiple-go)
- MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "go" subset.
* - [multiple-java](#bigcode-evaluation-harness-multiple-java)
- MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "java" subset.
* - [multiple-jl](#bigcode-evaluation-harness-multiple-jl)
- MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "jl" subset.
* - [multiple-js](#bigcode-evaluation-harness-multiple-js)
- MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "js" subset.
* - [multiple-lua](#bigcode-evaluation-harness-multiple-lua)
- MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "lua" subset.
* - [multiple-php](#bigcode-evaluation-harness-multiple-php)
- MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "php" subset.
* - [multiple-pl](#bigcode-evaluation-harness-multiple-pl)
- MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "pl" subset.
* - [multiple-py](#bigcode-evaluation-harness-multiple-py)
- MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "py" subset.
* - [multiple-r](#bigcode-evaluation-harness-multiple-r)
- MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "r" subset.
* - [multiple-rb](#bigcode-evaluation-harness-multiple-rb)
- MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "rb" subset.
* - [multiple-rkt](#bigcode-evaluation-harness-multiple-rkt)
- MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "rkt" subset.
* - [multiple-rs](#bigcode-evaluation-harness-multiple-rs)
- MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "rs" subset.
* - [multiple-scala](#bigcode-evaluation-harness-multiple-scala)
- MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "scala" subset.
* - [multiple-sh](#bigcode-evaluation-harness-multiple-sh)
- MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "sh" subset.
* - [multiple-swift](#bigcode-evaluation-harness-multiple-swift)
- MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "swift" subset.
* - [multiple-ts](#bigcode-evaluation-harness-multiple-ts)
- MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "ts" subset.
```
(bigcode-evaluation-harness-humaneval)=
## humaneval
HumanEval is used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `humaneval`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: humaneval
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 20
supported_endpoint_types:
- completions
type: humaneval
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-humaneval-instruct)=
## humaneval_instruct
InstructHumanEval is a modified version of OpenAI HumanEval. For a given prompt, we extracted its signature, its docstring as well as its header to create a flexing setting which would allow to evaluation instruction-tuned LLM. The delimiters used in the instruction-tuning procedure can be use to build and instruction that would allow the model to elicit its best capabilities.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `humaneval_instruct`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: instruct-humaneval-nocontext-py
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 20
supported_endpoint_types:
- chat
type: humaneval_instruct
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-humanevalplus)=
## humanevalplus
HumanEvalPlus is a modified version of HumanEval containing 80x more test cases.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `humanevalplus`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: humanevalplus
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: humanevalplus
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-mbpp-chat)=
## mbpp-chat
MBPP consists of Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. This variant uses the chat endpoint.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `mbpp-chat`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 2048
max_retries: 5
parallelism: 10
task: mbpp
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 10
supported_endpoint_types:
- chat
type: mbpp-chat
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-mbpp-completions)=
## mbpp-completions
MBPP consists of Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. This variant uses the completions endpoint.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `mbpp-completions`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 2048
max_retries: 5
parallelism: 10
task: mbpp
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 10
supported_endpoint_types:
- completions
type: mbpp-completions
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-mbppplus-chat)=
## mbppplus-chat
MBPP+ is a modified version of MBPP containing 35x more test cases. This variant uses the chat endpoint.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `mbppplus-chat`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 2048
max_retries: 5
parallelism: 10
task: mbppplus
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- chat
type: mbppplus-chat
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-mbppplus-completions)=
## mbppplus-completions
MBPP+ is a modified version of MBPP containing 35x more test cases. This variant uses the completions endpoint.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `mbppplus-completions`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 2048
max_retries: 5
parallelism: 10
task: mbppplus
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: mbppplus-completions
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-mbppplus-nemo)=
## mbppplus_nemo
MBPP+NeMo is a modified version of MBPP+ that uses the NeMo alignment prompt template.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `mbppplus_nemo`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 2048
max_retries: 5
parallelism: 10
task: mbppplus_nemo
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- chat
type: mbppplus_nemo
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-multiple-cpp)=
## multiple-cpp
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "cpp" subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `multiple-cpp`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-cpp
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-cpp
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-multiple-cs)=
## multiple-cs
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "cs" subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `multiple-cs`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-cs
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-cs
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-multiple-d)=
## multiple-d
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "d" subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `multiple-d`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-d
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-d
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-multiple-go)=
## multiple-go
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "go" subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `multiple-go`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-go
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-go
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-multiple-java)=
## multiple-java
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "java" subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `multiple-java`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-java
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-java
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-multiple-jl)=
## multiple-jl
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "jl" subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `multiple-jl`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-jl
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-jl
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-multiple-js)=
## multiple-js
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "js" subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `multiple-js`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-js
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-js
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-multiple-lua)=
## multiple-lua
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "lua" subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `multiple-lua`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-lua
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-lua
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-multiple-php)=
## multiple-php
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "php" subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `multiple-php`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-php
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-php
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-multiple-pl)=
## multiple-pl
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "pl" subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `multiple-pl`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-pl
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-pl
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-multiple-py)=
## multiple-py
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "py" subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `multiple-py`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-py
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-py
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-multiple-r)=
## multiple-r
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "r" subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `multiple-r`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-r
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-r
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-multiple-rb)=
## multiple-rb
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "rb" subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `multiple-rb`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-rb
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-rb
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-multiple-rkt)=
## multiple-rkt
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "rkt" subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `multiple-rkt`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-rkt
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-rkt
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-multiple-rs)=
## multiple-rs
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "rs" subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `multiple-rs`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-rs
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-rs
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-multiple-scala)=
## multiple-scala
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "scala" subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `multiple-scala`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-scala
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-scala
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-multiple-sh)=
## multiple-sh
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "sh" subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `multiple-sh`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-sh
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-sh
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-multiple-swift)=
## multiple-swift
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "swift" subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `multiple-swift`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-swift
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-swift
target:
api_endpoint: {}
```
:::
::::
---
(bigcode-evaluation-harness-multiple-ts)=
## multiple-ts
MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "ts" subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `bigcode-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd
```
**Container Arch:** `multiarch`
**Task Type:** `multiple-ts`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: bigcode-evaluation-harness
pkg_name: bigcode_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: multiple-ts
temperature: 0.1
request_timeout: 30
top_p: 0.95
extra:
do_sample: true
n_samples: 5
supported_endpoint_types:
- completions
type: multiple-ts
target:
api_endpoint: {}
```
:::
::::
# codec
This page contains all evaluation tasks for the **codec** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [aime_2024](#codec-aime-2024)
- Task for detecting contamination with the AIME 2024 dataset
* - [aime_2025](#codec-aime-2025)
- Task for detecting contamination with the AIME 2025 dataset
* - [bbq](#codec-bbq)
- Task for detecting contamination with the BBQ dataset
* - [bfcl_v3](#codec-bfcl-v3)
- Task for detecting contamination with the BFCL v3 dataset
* - [frames](#codec-frames)
- Task for detecting contamination with the FRAMES dataset
* - [gpqa_diamond](#codec-gpqa-diamond)
- Task for detecting contamination with the GPQA diamond
* - [gsm8k_test](#codec-gsm8k-test)
- Task for detecting contamination with the GSM8K test set
* - [gsm8k_train](#codec-gsm8k-train)
- Task for detecting contamination with the GSM8K train set
* - [hellaswag_test](#codec-hellaswag-test)
- Task for detecting contamination with the Hellaswag test set
* - [hellaswag_train](#codec-hellaswag-train)
- Task for detecting contamination with the Hellaswag train set
* - [hle](#codec-hle)
- Task for detecting contamination with the HLE dataset
* - [ifbench](#codec-ifbench)
- Task for detecting contamination with the IFBench dataset
* - [ifeval](#codec-ifeval)
- Task for detecting contamination with the IFeval dataset
* - [livecodebench_v1](#codec-livecodebench-v1)
- Task for detecting contamination with the LiveCodeBench v1 dataset
* - [livecodebench_v5](#codec-livecodebench-v5)
- Task for detecting contamination with the LiveCodeBench v5 dataset
* - [math_500_problem](#codec-math-500-problem)
- Task for detecting contamination with the Math 500 dataset (problem statements)
* - [math_500_solution](#codec-math-500-solution)
- Task for detecting contamination with the Math 500 dataset (solutions)
* - [mmlu_pro_test](#codec-mmlu-pro-test)
- Task for detecting contamination with the MMLU-Pro test set
* - [mmlu_test](#codec-mmlu-test)
- Task for detecting contamination with the MMLU test set
* - [openai_humaneval](#codec-openai-humaneval)
- Task for detecting contamination with the OpenAI HumanEval dataset
* - [reward_bench_v1](#codec-reward-bench-v1)
- Task for detecting contamination with the Reward Bench v1 dataset
* - [reward_bench_v2](#codec-reward-bench-v2)
- Task for detecting contamination with the Reward Bench v2 dataset
* - [scicode](#codec-scicode)
- Task for detecting contamination with the SciCode dataset
* - [swebench_test](#codec-swebench-test)
- Task for detecting contamination with the SWE-bench dataset (test split)
* - [swebench_train](#codec-swebench-train)
- Task for detecting contamination with the SWE-bench dataset (train split)
* - [taubench](#codec-taubench)
- Task for detecting contamination with the Tau-bench dataset
* - [terminalbench](#codec-terminalbench)
- Task for detecting contamination with the Terminal-Bench dataset
```
(codec-aime-2024)=
## aime_2024
Task for detecting contamination with the AIME 2024 dataset
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `aime_2024`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: aime_2024
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: aime_2024
target:
api_endpoint: {}
```
:::
::::
---
(codec-aime-2025)=
## aime_2025
Task for detecting contamination with the AIME 2025 dataset
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `aime_2025`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: aime_2025
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: aime_2025
target:
api_endpoint: {}
```
:::
::::
---
(codec-bbq)=
## bbq
Task for detecting contamination with the BBQ dataset
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `bbq`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: bbq
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: bbq
target:
api_endpoint: {}
```
:::
::::
---
(codec-bfcl-v3)=
## bfcl_v3
Task for detecting contamination with the BFCL v3 dataset
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `bfcl_v3`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: bfcl_v3
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: bfcl_v3
target:
api_endpoint: {}
```
:::
::::
---
(codec-frames)=
## frames
Task for detecting contamination with the FRAMES dataset
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `frames`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: frames
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: frames
target:
api_endpoint: {}
```
:::
::::
---
(codec-gpqa-diamond)=
## gpqa_diamond
Task for detecting contamination with the GPQA diamond
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `gpqa_diamond`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: gpqa_diamond
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: gpqa_diamond
target:
api_endpoint: {}
```
:::
::::
---
(codec-gsm8k-test)=
## gsm8k_test
Task for detecting contamination with the GSM8K test set
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `gsm8k_test`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: gsm8k_test
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: gsm8k_test
target:
api_endpoint: {}
```
:::
::::
---
(codec-gsm8k-train)=
## gsm8k_train
Task for detecting contamination with the GSM8K train set
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `gsm8k_train`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: gsm8k_train
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: gsm8k_train
target:
api_endpoint: {}
```
:::
::::
---
(codec-hellaswag-test)=
## hellaswag_test
Task for detecting contamination with the Hellaswag test set
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `hellaswag_test`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: hellaswag_test
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: hellaswag_test
target:
api_endpoint: {}
```
:::
::::
---
(codec-hellaswag-train)=
## hellaswag_train
Task for detecting contamination with the Hellaswag train set
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `hellaswag_train`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: hellaswag_train
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: hellaswag_train
target:
api_endpoint: {}
```
:::
::::
---
(codec-hle)=
## hle
Task for detecting contamination with the HLE dataset
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `hle`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: hle
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: hle
target:
api_endpoint: {}
```
:::
::::
---
(codec-ifbench)=
## ifbench
Task for detecting contamination with the IFBench dataset
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `ifbench`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: ifbench
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: ifbench
target:
api_endpoint: {}
```
:::
::::
---
(codec-ifeval)=
## ifeval
Task for detecting contamination with the IFeval dataset
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `ifeval`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: ifeval
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: ifeval
target:
api_endpoint: {}
```
:::
::::
---
(codec-livecodebench-v1)=
## livecodebench_v1
Task for detecting contamination with the LiveCodeBench v1 dataset
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `livecodebench_v1`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: livecodebench_v1
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: livecodebench_v1
target:
api_endpoint: {}
```
:::
::::
---
(codec-livecodebench-v5)=
## livecodebench_v5
Task for detecting contamination with the LiveCodeBench v5 dataset
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `livecodebench_v5`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: livecodebench_v5
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: livecodebench_v5
target:
api_endpoint: {}
```
:::
::::
---
(codec-math-500-problem)=
## math_500_problem
Task for detecting contamination with the Math 500 dataset (problem statements)
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `math_500_problem`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: math_500_problem
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: math_500_problem
target:
api_endpoint: {}
```
:::
::::
---
(codec-math-500-solution)=
## math_500_solution
Task for detecting contamination with the Math 500 dataset (solutions)
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `math_500_solution`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: math_500_solution
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: math_500_solution
target:
api_endpoint: {}
```
:::
::::
---
(codec-mmlu-pro-test)=
## mmlu_pro_test
Task for detecting contamination with the MMLU-Pro test set
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `mmlu_pro_test`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: mmlu_pro_test
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: mmlu_pro_test
target:
api_endpoint: {}
```
:::
::::
---
(codec-mmlu-test)=
## mmlu_test
Task for detecting contamination with the MMLU test set
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `mmlu_test`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: mmlu_test
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: mmlu_test
target:
api_endpoint: {}
```
:::
::::
---
(codec-openai-humaneval)=
## openai_humaneval
Task for detecting contamination with the OpenAI HumanEval dataset
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `openai_humaneval`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: openai_humaneval
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: openai_humaneval
target:
api_endpoint: {}
```
:::
::::
---
(codec-reward-bench-v1)=
## reward_bench_v1
Task for detecting contamination with the Reward Bench v1 dataset
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `reward_bench_v1`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: reward_bench_v1
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: reward_bench_v1
target:
api_endpoint: {}
```
:::
::::
---
(codec-reward-bench-v2)=
## reward_bench_v2
Task for detecting contamination with the Reward Bench v2 dataset
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `reward_bench_v2`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: reward_bench_v2
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: reward_bench_v2
target:
api_endpoint: {}
```
:::
::::
---
(codec-scicode)=
## scicode
Task for detecting contamination with the SciCode dataset
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `scicode`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: scicode
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: scicode
target:
api_endpoint: {}
```
:::
::::
---
(codec-swebench-test)=
## swebench_test
Task for detecting contamination with the SWE-bench dataset (test split)
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `swebench_test`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: swebench_test
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: swebench_test
target:
api_endpoint: {}
```
:::
::::
---
(codec-swebench-train)=
## swebench_train
Task for detecting contamination with the SWE-bench dataset (train split)
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `swebench_train`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: swebench_train
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: swebench_train
target:
api_endpoint: {}
```
:::
::::
---
(codec-taubench)=
## taubench
Task for detecting contamination with the Tau-bench dataset
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `taubench`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: taubench
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: taubench
target:
api_endpoint: {}
```
:::
::::
---
(codec-terminalbench)=
## terminalbench
Task for detecting contamination with the Terminal-Bench dataset
::::{tab-set}
:::{tab-item} Container
**Harness:** `codec`
**Container:**
```
nvcr.io/nvidia/eval-factory/contamination-detection:26.01
```
**Container Digest:**
```
sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2
```
**Container Arch:** `amd`
**Task Type:** `terminalbench`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: codec
pkg_name: codec
config:
params:
limit_samples: 1000
max_retries: 10
parallelism: 20
task: terminalbench
temperature: 0.0
request_timeout: 120
top_p: 1.0
extra:
contamination_type: in_context
n_context_seeds: 5
min_length: 100
max_length: 2048
supported_endpoint_types:
- completions
type: terminalbench
target:
api_endpoint: {}
```
:::
::::
# garak
This page contains all evaluation tasks for the **garak** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [garak](#garak-garak)
- Task for running the default set of Garak probes. This variant uses the chat endpoint.
* - [garak-completions](#garak-garak-completions)
- Task for running the default set of Garak probes. This variant uses the completions endpoint.
```
(garak-garak)=
## garak
Task for running the default set of Garak probes. This variant uses the chat endpoint.
::::{tab-set}
:::{tab-item} Container
**Harness:** `garak`
**Container:**
```
nvcr.io/nvidia/eval-factory/garak:26.01
```
**Container Digest:**
```
sha256:72514ac2c35f76fdb139b02f1c1d4159103969946a121592e50b129087dd455e
```
**Container Arch:** `multiarch`
**Task Type:** `garak`
:::
:::{tab-item} Command
```bash
cat > garak_config.yaml << 'EOF'
{% if config.params.extra.seed is not none %}run:
seed: {{config.params.extra.seed}}{% endif %}
plugins:
{% if config.params.extra.probes is not none %}probe_spec: {{config.params.extra.probes}}{% endif %}
extended_detectors: true
target_type: {% if target.api_endpoint.type == "completions" %}nim.NVOpenAICompletion{% elif target.api_endpoint.type == "chat" %}nim.NVOpenAIChat{% endif %}
target_name: {{target.api_endpoint.model_id}}
generators:
nim:
uri: {{target.api_endpoint.url | replace('/chat/completions', '') | replace('/completions', '')}}
{% if config.params.temperature is not none %}temperature: {{config.params.temperature}}{% endif %}
{% if config.params.top_p is not none %}top_p: {{config.params.top_p}}{% endif %}
{% if config.params.max_new_tokens is not none %}max_tokens: {{config.params.max_new_tokens}}{% endif %}
skip_seq_start: {{config.params.extra.skip_seq_start}}
skip_seq_end: {{config.params.extra.skip_seq_end}}
system:
parallel_attempts: {{config.params.parallelism}}
lite: false
EOF
{% if target.api_endpoint.api_key_name is not none %}
export NIM_API_KEY=${{target.api_endpoint.api_key_name}} &&
{% else %}
export NIM_API_KEY=dummy &&
{% endif %}
export XDG_DATA_HOME={{config.output_dir}} &&
garak --config garak_config.yaml --report_prefix=results
```
:::
:::{tab-item} Defaults
```yaml
framework_name: garak
pkg_name: garak
config:
params:
max_new_tokens: 150
parallelism: 32
temperature: 0.1
top_p: 0.7
extra:
probes: null
seed: 42
skip_seq_start:
skip_seq_end:
supported_endpoint_types:
- chat
type: garak
target:
api_endpoint:
api_key_name: API_KEY
```
:::
::::
---
(garak-garak-completions)=
## garak-completions
Task for running the default set of Garak probes. This variant uses the completions endpoint.
::::{tab-set}
:::{tab-item} Container
**Harness:** `garak`
**Container:**
```
nvcr.io/nvidia/eval-factory/garak:26.01
```
**Container Digest:**
```
sha256:72514ac2c35f76fdb139b02f1c1d4159103969946a121592e50b129087dd455e
```
**Container Arch:** `multiarch`
**Task Type:** `garak-completions`
:::
:::{tab-item} Command
```bash
cat > garak_config.yaml << 'EOF'
{% if config.params.extra.seed is not none %}run:
seed: {{config.params.extra.seed}}{% endif %}
plugins:
{% if config.params.extra.probes is not none %}probe_spec: {{config.params.extra.probes}}{% endif %}
extended_detectors: true
target_type: {% if target.api_endpoint.type == "completions" %}nim.NVOpenAICompletion{% elif target.api_endpoint.type == "chat" %}nim.NVOpenAIChat{% endif %}
target_name: {{target.api_endpoint.model_id}}
generators:
nim:
uri: {{target.api_endpoint.url | replace('/chat/completions', '') | replace('/completions', '')}}
{% if config.params.temperature is not none %}temperature: {{config.params.temperature}}{% endif %}
{% if config.params.top_p is not none %}top_p: {{config.params.top_p}}{% endif %}
{% if config.params.max_new_tokens is not none %}max_tokens: {{config.params.max_new_tokens}}{% endif %}
skip_seq_start: {{config.params.extra.skip_seq_start}}
skip_seq_end: {{config.params.extra.skip_seq_end}}
system:
parallel_attempts: {{config.params.parallelism}}
lite: false
EOF
{% if target.api_endpoint.api_key_name is not none %}
export NIM_API_KEY=${{target.api_endpoint.api_key_name}} &&
{% else %}
export NIM_API_KEY=dummy &&
{% endif %}
export XDG_DATA_HOME={{config.output_dir}} &&
garak --config garak_config.yaml --report_prefix=results
```
:::
:::{tab-item} Defaults
```yaml
framework_name: garak
pkg_name: garak
config:
params:
max_new_tokens: 150
parallelism: 32
temperature: 0.1
top_p: 0.7
extra:
probes: null
seed: 42
skip_seq_start:
skip_seq_end:
supported_endpoint_types:
- completions
type: garak-completions
target:
api_endpoint:
api_key_name: API_KEY
```
:::
::::
# genai_perf_eval
This page contains all evaluation tasks for the **genai_perf_eval** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [genai_perf_generation](#genai-perf-eval-genai-perf-generation)
- GenAI Perf speed evaluation for chat endpoint, generation task - short input, long output
* - [genai_perf_generation_completions](#genai-perf-eval-genai-perf-generation-completions)
- GenAI Perf speed evaluation for completions endpoint, generation task - short input, long output
* - [genai_perf_summarization](#genai-perf-eval-genai-perf-summarization)
- GenAI Perf speed evaluation for chat endpoint, summarization task - long input, short output
* - [genai_perf_summarization_completions](#genai-perf-eval-genai-perf-summarization-completions)
- GenAI Perf speed evaluation for completions endpoint, summarization task - long input, short output
```
(genai-perf-eval-genai-perf-generation)=
## genai_perf_generation
GenAI Perf speed evaluation for chat endpoint, generation task - short input, long output
::::{tab-set}
:::{tab-item} Container
**Harness:** `genai_perf_eval`
**Container:**
```
nvcr.io/nvidia/eval-factory/genai-perf:26.01
```
**Container Digest:**
```
sha256:ab3f8b34a6cb63f7e48e8847fb069be71a3b73eb4f924bcf274cb02c6cc975b6
```
**Container Arch:** `amd`
**Task Type:** `genai_perf_generation`
:::
:::{tab-item} Command
```bash
genai_perf_eval --model_id {{target.api_endpoint.model_id}} --url {{target.api_endpoint.url}} {% if target.api_endpoint.api_key_name is not none %}--api-key {{target.api_endpoint.api_key_name}} {% endif %} --concurrencies {{config.params.parallelism}} --isl {{config.params.extra.isl}} --osl {{config.params.extra.osl}} --tokenizer {{config.params.extra.tokenizer}} --endpoint-type {{target.api_endpoint.type}} --artifact-dir {{config.output_dir}} {% if target.api_endpoint.stream %}--streaming {% endif %}{% if config.params.extra.warmup %}--warmup{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: genai_perf_eval
pkg_name: genai_perf
config:
params:
parallelism: 1
extra:
tokenizer: null
warmup: true
isl: 500
osl: 5000
supported_endpoint_types:
- chat
type: genai_perf_generation
target:
api_endpoint: {}
```
:::
::::
---
(genai-perf-eval-genai-perf-generation-completions)=
## genai_perf_generation_completions
GenAI Perf speed evaluation for completions endpoint, generation task - short input, long output
::::{tab-set}
:::{tab-item} Container
**Harness:** `genai_perf_eval`
**Container:**
```
nvcr.io/nvidia/eval-factory/genai-perf:26.01
```
**Container Digest:**
```
sha256:ab3f8b34a6cb63f7e48e8847fb069be71a3b73eb4f924bcf274cb02c6cc975b6
```
**Container Arch:** `amd`
**Task Type:** `genai_perf_generation_completions`
:::
:::{tab-item} Command
```bash
genai_perf_eval --model_id {{target.api_endpoint.model_id}} --url {{target.api_endpoint.url}} {% if target.api_endpoint.api_key_name is not none %}--api-key {{target.api_endpoint.api_key_name}} {% endif %} --concurrencies {{config.params.parallelism}} --isl {{config.params.extra.isl}} --osl {{config.params.extra.osl}} --tokenizer {{config.params.extra.tokenizer}} --endpoint-type {{target.api_endpoint.type}} --artifact-dir {{config.output_dir}} {% if target.api_endpoint.stream %}--streaming {% endif %}{% if config.params.extra.warmup %}--warmup{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: genai_perf_eval
pkg_name: genai_perf
config:
params:
parallelism: 1
task: genai_perf_generation
extra:
tokenizer: null
warmup: true
isl: 500
osl: 5000
supported_endpoint_types:
- completions
type: genai_perf_generation_completions
target:
api_endpoint: {}
```
:::
::::
---
(genai-perf-eval-genai-perf-summarization)=
## genai_perf_summarization
GenAI Perf speed evaluation for chat endpoint, summarization task - long input, short output
::::{tab-set}
:::{tab-item} Container
**Harness:** `genai_perf_eval`
**Container:**
```
nvcr.io/nvidia/eval-factory/genai-perf:26.01
```
**Container Digest:**
```
sha256:ab3f8b34a6cb63f7e48e8847fb069be71a3b73eb4f924bcf274cb02c6cc975b6
```
**Container Arch:** `amd`
**Task Type:** `genai_perf_summarization`
:::
:::{tab-item} Command
```bash
genai_perf_eval --model_id {{target.api_endpoint.model_id}} --url {{target.api_endpoint.url}} {% if target.api_endpoint.api_key_name is not none %}--api-key {{target.api_endpoint.api_key_name}} {% endif %} --concurrencies {{config.params.parallelism}} --isl {{config.params.extra.isl}} --osl {{config.params.extra.osl}} --tokenizer {{config.params.extra.tokenizer}} --endpoint-type {{target.api_endpoint.type}} --artifact-dir {{config.output_dir}} {% if target.api_endpoint.stream %}--streaming {% endif %}{% if config.params.extra.warmup %}--warmup{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: genai_perf_eval
pkg_name: genai_perf
config:
params:
parallelism: 1
extra:
tokenizer: null
warmup: true
isl: 5000
osl: 500
supported_endpoint_types:
- chat
type: genai_perf_summarization
target:
api_endpoint: {}
```
:::
::::
---
(genai-perf-eval-genai-perf-summarization-completions)=
## genai_perf_summarization_completions
GenAI Perf speed evaluation for completions endpoint, summarization task - long input, short output
::::{tab-set}
:::{tab-item} Container
**Harness:** `genai_perf_eval`
**Container:**
```
nvcr.io/nvidia/eval-factory/genai-perf:26.01
```
**Container Digest:**
```
sha256:ab3f8b34a6cb63f7e48e8847fb069be71a3b73eb4f924bcf274cb02c6cc975b6
```
**Container Arch:** `amd`
**Task Type:** `genai_perf_summarization_completions`
:::
:::{tab-item} Command
```bash
genai_perf_eval --model_id {{target.api_endpoint.model_id}} --url {{target.api_endpoint.url}} {% if target.api_endpoint.api_key_name is not none %}--api-key {{target.api_endpoint.api_key_name}} {% endif %} --concurrencies {{config.params.parallelism}} --isl {{config.params.extra.isl}} --osl {{config.params.extra.osl}} --tokenizer {{config.params.extra.tokenizer}} --endpoint-type {{target.api_endpoint.type}} --artifact-dir {{config.output_dir}} {% if target.api_endpoint.stream %}--streaming {% endif %}{% if config.params.extra.warmup %}--warmup{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: genai_perf_eval
pkg_name: genai_perf
config:
params:
parallelism: 1
task: genai_perf_summarization
extra:
tokenizer: null
warmup: true
isl: 5000
osl: 500
supported_endpoint_types:
- completions
type: genai_perf_summarization_completions
target:
api_endpoint: {}
```
:::
::::
# helm
This page contains all evaluation tasks for the **helm** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [aci_bench](#helm-aci-bench)
- Extract and structure information from patient-doctor conversations
* - [ehr_sql](#helm-ehr-sql)
- Given a natural language instruction, generate an SQL query that would be used in clinical research.
* - [head_qa](#helm-head-qa)
- A collection of biomedical multiple-choice questions for testing medical knowledge (Vilares et al., 2019).
* - [med_dialog_healthcaremagic](#helm-med-dialog-healthcaremagic)
- Generate summaries of doctor-patient conversations, healthcaremagic version
* - [med_dialog_icliniq](#helm-med-dialog-icliniq)
- Generate summaries of doctor-patient conversations, icliniq version
* - [medbullets](#helm-medbullets)
- A USMLE-style medical question dataset with multiple-choice answers and explanations (MedBullets, 2025).
* - [medcalc_bench](#helm-medcalc-bench)
- A dataset which consists of a patient note, a question requesting to compute a specific medical value, and a ground truth answer (Khandekar et al., 2024).
* - [medec](#helm-medec)
- A dataset containing medical narratives with error detection and correction pairs (Abacha et al., 2025).
* - [medhallu](#helm-medhallu)
- A dataset of PubMed articles and associated questions, with the objective being to classify whether the answer is factual or hallucinated.
* - [medi_qa](#helm-medi-qa)
- Retrieve and rank answers based on medical question understanding
* - [medication_qa](#helm-medication-qa)
- Answer consumer medication-related questions
* - [mtsamples_procedures](#helm-mtsamples-procedures)
- Document and extract information about medical procedures
* - [mtsamples_replicate](#helm-mtsamples-replicate)
- Generate treatment plans based on clinical notes
* - [pubmed_qa](#helm-pubmed-qa)
- A dataset that provides PubMed abstracts and asks associated questions (yes/no/maybe format).
* - [race_based_med](#helm-race-based-med)
- A collection of LLM outputs in response to medical questions with race-based biases, with the objective being to classify whether the output contains racially biased content.
```
(helm-aci-bench)=
## aci_bench
Extract and structure information from patient-doctor conversations
::::{tab-set}
:::{tab-item} Container
**Harness:** `helm`
**Container:**
```
nvcr.io/nvidia/eval-factory/helm:26.01
```
**Container Digest:**
```
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
```
**Container Arch:** `amd`
**Task Type:** `aci_bench`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: aci_bench
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: aci_bench
target:
api_endpoint: {}
```
:::
::::
---
(helm-ehr-sql)=
## ehr_sql
Given a natural language instruction, generate an SQL query that would be used in clinical research.
::::{tab-set}
:::{tab-item} Container
**Harness:** `helm`
**Container:**
```
nvcr.io/nvidia/eval-factory/helm:26.01
```
**Container Digest:**
```
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
```
**Container Arch:** `amd`
**Task Type:** `ehr_sql`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: ehr_sql
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: ehr_sql
target:
api_endpoint: {}
```
:::
::::
---
(helm-head-qa)=
## head_qa
A collection of biomedical multiple-choice questions for testing medical knowledge (Vilares et al., 2019).
::::{tab-set}
:::{tab-item} Container
**Harness:** `helm`
**Container:**
```
nvcr.io/nvidia/eval-factory/helm:26.01
```
**Container Digest:**
```
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
```
**Container Arch:** `amd`
**Task Type:** `head_qa`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: head_qa
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: head_qa
target:
api_endpoint: {}
```
:::
::::
---
(helm-med-dialog-healthcaremagic)=
## med_dialog_healthcaremagic
Generate summaries of doctor-patient conversations, healthcaremagic version
::::{tab-set}
:::{tab-item} Container
**Harness:** `helm`
**Container:**
```
nvcr.io/nvidia/eval-factory/helm:26.01
```
**Container Digest:**
```
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
```
**Container Arch:** `amd`
**Task Type:** `med_dialog_healthcaremagic`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: med_dialog
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: healthcaremagic
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: med_dialog_healthcaremagic
target:
api_endpoint: {}
```
:::
::::
---
(helm-med-dialog-icliniq)=
## med_dialog_icliniq
Generate summaries of doctor-patient conversations, icliniq version
::::{tab-set}
:::{tab-item} Container
**Harness:** `helm`
**Container:**
```
nvcr.io/nvidia/eval-factory/helm:26.01
```
**Container Digest:**
```
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
```
**Container Arch:** `amd`
**Task Type:** `med_dialog_icliniq`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: med_dialog
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: icliniq
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: med_dialog_icliniq
target:
api_endpoint: {}
```
:::
::::
---
(helm-medbullets)=
## medbullets
A USMLE-style medical question dataset with multiple-choice answers and explanations (MedBullets, 2025).
::::{tab-set}
:::{tab-item} Container
**Harness:** `helm`
**Container:**
```
nvcr.io/nvidia/eval-factory/helm:26.01
```
**Container Digest:**
```
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
```
**Container Arch:** `amd`
**Task Type:** `medbullets`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: medbullets
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: medbullets
target:
api_endpoint: {}
```
:::
::::
---
(helm-medcalc-bench)=
## medcalc_bench
A dataset which consists of a patient note, a question requesting to compute a specific medical value, and a ground truth answer (Khandekar et al., 2024).
::::{tab-set}
:::{tab-item} Container
**Harness:** `helm`
**Container:**
```
nvcr.io/nvidia/eval-factory/helm:26.01
```
**Container Digest:**
```
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
```
**Container Arch:** `amd`
**Task Type:** `medcalc_bench`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: medcalc_bench
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: medcalc_bench
target:
api_endpoint: {}
```
:::
::::
---
(helm-medec)=
## medec
A dataset containing medical narratives with error detection and correction pairs (Abacha et al., 2025).
::::{tab-set}
:::{tab-item} Container
**Harness:** `helm`
**Container:**
```
nvcr.io/nvidia/eval-factory/helm:26.01
```
**Container Digest:**
```
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
```
**Container Arch:** `amd`
**Task Type:** `medec`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: medec
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: medec
target:
api_endpoint: {}
```
:::
::::
---
(helm-medhallu)=
## medhallu
A dataset of PubMed articles and associated questions, with the objective being to classify whether the answer is factual or hallucinated.
::::{tab-set}
:::{tab-item} Container
**Harness:** `helm`
**Container:**
```
nvcr.io/nvidia/eval-factory/helm:26.01
```
**Container Digest:**
```
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
```
**Container Arch:** `amd`
**Task Type:** `medhallu`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: medhallu
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: medhallu
target:
api_endpoint: {}
```
:::
::::
---
(helm-medi-qa)=
## medi_qa
Retrieve and rank answers based on medical question understanding
::::{tab-set}
:::{tab-item} Container
**Harness:** `helm`
**Container:**
```
nvcr.io/nvidia/eval-factory/helm:26.01
```
**Container Digest:**
```
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
```
**Container Arch:** `amd`
**Task Type:** `medi_qa`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: medi_qa
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: medi_qa
target:
api_endpoint: {}
```
:::
::::
---
(helm-medication-qa)=
## medication_qa
Answer consumer medication-related questions
::::{tab-set}
:::{tab-item} Container
**Harness:** `helm`
**Container:**
```
nvcr.io/nvidia/eval-factory/helm:26.01
```
**Container Digest:**
```
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
```
**Container Arch:** `amd`
**Task Type:** `medication_qa`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: medication_qa
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: medication_qa
target:
api_endpoint: {}
```
:::
::::
---
(helm-mtsamples-procedures)=
## mtsamples_procedures
Document and extract information about medical procedures
::::{tab-set}
:::{tab-item} Container
**Harness:** `helm`
**Container:**
```
nvcr.io/nvidia/eval-factory/helm:26.01
```
**Container Digest:**
```
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
```
**Container Arch:** `amd`
**Task Type:** `mtsamples_procedures`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: mtsamples_procedures
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: mtsamples_procedures
target:
api_endpoint: {}
```
:::
::::
---
(helm-mtsamples-replicate)=
## mtsamples_replicate
Generate treatment plans based on clinical notes
::::{tab-set}
:::{tab-item} Container
**Harness:** `helm`
**Container:**
```
nvcr.io/nvidia/eval-factory/helm:26.01
```
**Container Digest:**
```
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
```
**Container Arch:** `amd`
**Task Type:** `mtsamples_replicate`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: mtsamples_replicate
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: mtsamples_replicate
target:
api_endpoint: {}
```
:::
::::
---
(helm-pubmed-qa)=
## pubmed_qa
A dataset that provides PubMed abstracts and asks associated questions (yes/no/maybe format).
::::{tab-set}
:::{tab-item} Container
**Harness:** `helm`
**Container:**
```
nvcr.io/nvidia/eval-factory/helm:26.01
```
**Container Digest:**
```
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
```
**Container Arch:** `amd`
**Task Type:** `pubmed_qa`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: pubmed_qa
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: pubmed_qa
target:
api_endpoint: {}
```
:::
::::
---
(helm-race-based-med)=
## race_based_med
A collection of LLM outputs in response to medical questions with race-based biases, with the objective being to classify whether the output contains racially biased content.
::::{tab-set}
:::{tab-item} Container
**Harness:** `helm`
**Container:**
```
nvcr.io/nvidia/eval-factory/helm:26.01
```
**Container Digest:**
```
sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589
```
**Container Arch:** `amd`
**Task Type:** `race_based_med`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: helm
pkg_name: helm
config:
params:
parallelism: 1
task: race_based_med
extra:
data_path: null
num_output_tokens: null
subject: null
condition: null
max_length: null
num_train_trials: null
subset: null
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
supported_endpoint_types:
- chat
type: race_based_med
target:
api_endpoint: {}
```
:::
::::
# hle
This page contains all evaluation tasks for the **hle** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [hle](#hle-hle)
- Text-only questions from Humanity's Last Exam
* - [hle_aa_v2](#hle-hle-aa-v2)
- Text-only questions from Humanity's Last Exam and params aligned with Artificial Analysis Index v2
```
(hle-hle)=
## hle
Text-only questions from Humanity's Last Exam
::::{tab-set}
:::{tab-item} Container
**Harness:** `hle`
**Container:**
```
nvcr.io/nvidia/eval-factory/hle:26.01
```
**Container Digest:**
```
sha256:59fa69e20bbaaa251effa5f9d440d60920bc601cfb26f9e03866f1b6aff6dc33
```
**Container Arch:** `multiarch`
**Task Type:** `hle`
:::
:::{tab-item} Command
```bash
hle_eval --dataset=cais/hle --model_name={{target.api_endpoint.model_id}} --model_url={{target.api_endpoint.url}} --temperature={{config.params.temperature}} --top_p={{config.params.top_p}} --timeout={{config.params.request_timeout}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --output_dir={{config.output_dir}} {% if target.api_endpoint.api_key_name is not none %}--api_key_name={{target.api_endpoint.api_key_name}}{% endif %} --max_retries={{config.params.max_retries}} --num_workers={{config.params.parallelism}} --max_new_tokens={{config.params.max_new_tokens}} --text_only --generate --judge
```
:::
:::{tab-item} Defaults
```yaml
framework_name: hle
pkg_name: hle
config:
params:
max_new_tokens: 8192
max_retries: 30
parallelism: 10
temperature: 0.0
request_timeout: 600
top_p: 1.0
extra: {}
supported_endpoint_types:
- chat
type: hle
target:
api_endpoint: {}
```
:::
::::
---
(hle-hle-aa-v2)=
## hle_aa_v2
Text-only questions from Humanity's Last Exam and params aligned with Artificial Analysis Index v2
::::{tab-set}
:::{tab-item} Container
**Harness:** `hle`
**Container:**
```
nvcr.io/nvidia/eval-factory/hle:26.01
```
**Container Digest:**
```
sha256:59fa69e20bbaaa251effa5f9d440d60920bc601cfb26f9e03866f1b6aff6dc33
```
**Container Arch:** `multiarch`
**Task Type:** `hle_aa_v2`
:::
:::{tab-item} Command
```bash
hle_eval --dataset=cais/hle --model_name={{target.api_endpoint.model_id}} --model_url={{target.api_endpoint.url}} --temperature={{config.params.temperature}} --top_p={{config.params.top_p}} --timeout={{config.params.request_timeout}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --output_dir={{config.output_dir}} {% if target.api_endpoint.api_key_name is not none %}--api_key_name={{target.api_endpoint.api_key_name}}{% endif %} --max_retries={{config.params.max_retries}} --num_workers={{config.params.parallelism}} --max_new_tokens={{config.params.max_new_tokens}} --text_only --generate --judge
```
:::
:::{tab-item} Defaults
```yaml
framework_name: hle
pkg_name: hle
config:
params:
max_new_tokens: 16384
max_retries: 30
parallelism: 10
temperature: 0.0
request_timeout: 600
top_p: 1.0
extra: {}
supported_endpoint_types:
- chat
type: hle_aa_v2
target:
api_endpoint: {}
```
:::
::::
# ifbench
This page contains all evaluation tasks for the **ifbench** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [ifbench](#ifbench-ifbench)
- IFBench with vanilla settings
* - [ifbench_aa_v2](#ifbench-ifbench-aa-v2)
- IFBench - params aligned with Artificial Analysis Index v2
```
(ifbench-ifbench)=
## ifbench
IFBench with vanilla settings
::::{tab-set}
:::{tab-item} Container
**Harness:** `ifbench`
**Container:**
```
nvcr.io/nvidia/eval-factory/ifbench:26.01
```
**Container Digest:**
```
sha256:e99059d2e334ef97826629a004c888f7daed1adb9d724ca73274e1b93c743ac1
```
**Container Arch:** `multiarch`
**Task Type:** `ifbench`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} ifbench --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --results-dir {{config.output_dir}} --inference-params max_tokens={{config.params.max_new_tokens}},temperature={{config.params.temperature}},top_p={{config.params.top_p}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: ifbench
pkg_name: ifbench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 8
temperature: 0.01
top_p: 0.95
extra: {}
supported_endpoint_types:
- chat
type: ifbench
target:
api_endpoint:
stream: false
```
:::
::::
---
(ifbench-ifbench-aa-v2)=
## ifbench_aa_v2
IFBench - params aligned with Artificial Analysis Index v2
::::{tab-set}
:::{tab-item} Container
**Harness:** `ifbench`
**Container:**
```
nvcr.io/nvidia/eval-factory/ifbench:26.01
```
**Container Digest:**
```
sha256:e99059d2e334ef97826629a004c888f7daed1adb9d724ca73274e1b93c743ac1
```
**Container Arch:** `multiarch`
**Task Type:** `ifbench_aa_v2`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} ifbench --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --results-dir {{config.output_dir}} --inference-params max_tokens={{config.params.max_new_tokens}},temperature={{config.params.temperature}},top_p={{config.params.top_p}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: ifbench
pkg_name: ifbench
config:
params:
max_new_tokens: 16384
max_retries: 30
parallelism: 8
temperature: 0.0
top_p: 0.95
extra: {}
supported_endpoint_types:
- chat
type: ifbench_aa_v2
target:
api_endpoint:
stream: false
```
:::
::::
# livecodebench
This page contains all evaluation tasks for the **livecodebench** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [codeexecution_v2](#livecodebench-codeexecution-v2)
- “Execute” a program on an input, evaluating code comprehension ability. The model is given a program and an input, and the output should be the result.
* - [codeexecution_v2_cot](#livecodebench-codeexecution-v2-cot)
- “CoT. Execute” a program on an input, evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. Chain-of-Thought version of the task.
* - [codegeneration_notfast](#livecodebench-codegeneration-notfast)
- Not fast version of code generation (v2).
* - [codegeneration_release_latest](#livecodebench-codegeneration-release-latest)
- Code generation latest version
* - [codegeneration_release_v1](#livecodebench-codegeneration-release-v1)
- The initial release of the dataset (v1) with problems released between May 2023 and Mar 2024 containing 400 problems.
* - [codegeneration_release_v2](#livecodebench-codegeneration-release-v2)
- The updated release of the dataset (v2) with problems released between May 2023 and May 2024 containing 511 problems.
* - [codegeneration_release_v3](#livecodebench-codegeneration-release-v3)
- The updated release of the dataset (v3) with problems released between May 2023 and Jul 2024 containing 612 problems.
* - [codegeneration_release_v4](#livecodebench-codegeneration-release-v4)
- The updated release of the dataset (v4) with problems released between May 2023 and Sep 2024 containing 713 problems.
* - [codegeneration_release_v5](#livecodebench-codegeneration-release-v5)
- The updated release of the dataset (v5) with problems released between May 2023 and Jan 2025 containing 880 problems.
* - [codegeneration_release_v6](#livecodebench-codegeneration-release-v6)
- The updated release of the dataset (v6) with problems released between May 2023 and Apr 2025 containing 1055 problems.
* - [livecodebench_0724_0125](#livecodebench-livecodebench-0724-0125)
- - Code generation evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. - The data period and sampling parameters used by Artificial Analaysis (https://artificialanalysis.ai/methodology/intelligence-benchmarking)
* - [livecodebench_0824_0225](#livecodebench-livecodebench-0824-0225)
- ['Code generation evaluating code comprehension ability. The model is given a program and an input, and the output should be the result.', 'The data period and sampling parameters used by NeMo Alignment team.']
* - [livecodebench_aa_v2](#livecodebench-livecodebench-aa-v2)
- - Code generation evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. - The data period and sampling parameters used by Artificial Analaysis (https://artificialanalysis.ai/methodology/intelligence-benchmarking)
* - [testoutputprediction](#livecodebench-testoutputprediction)
- Solve the natural language task on a specified input, evaluating the ability to generate testing outputs. The model is given the natural language problem description and an input, and the output should be the output for the problem.
```
(livecodebench-codeexecution-v2)=
## codeexecution_v2
“Execute” a program on an input, evaluating code comprehension ability. The model is given a program and an input, and the output should be the result.
::::{tab-set}
:::{tab-item} Container
**Harness:** `livecodebench`
**Container:**
```
nvcr.io/nvidia/eval-factory/livecodebench:26.01
```
**Container Digest:**
```
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
```
**Container Arch:** `multiarch`
**Task Type:** `codeexecution_v2`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codeexecution
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: null
end_date: null
cot_code_execution: false
release_version: release_v2
supported_endpoint_types:
- chat
type: codeexecution_v2
target:
api_endpoint: {}
```
:::
::::
---
(livecodebench-codeexecution-v2-cot)=
## codeexecution_v2_cot
“CoT. Execute” a program on an input, evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. Chain-of-Thought version of the task.
::::{tab-set}
:::{tab-item} Container
**Harness:** `livecodebench`
**Container:**
```
nvcr.io/nvidia/eval-factory/livecodebench:26.01
```
**Container Digest:**
```
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
```
**Container Arch:** `multiarch`
**Task Type:** `codeexecution_v2_cot`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codeexecution
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: null
end_date: null
cot_code_execution: true
release_version: release_v2
supported_endpoint_types:
- chat
type: codeexecution_v2_cot
target:
api_endpoint: {}
```
:::
::::
---
(livecodebench-codegeneration-notfast)=
## codegeneration_notfast
Not fast version of code generation (v2).
::::{tab-set}
:::{tab-item} Container
**Harness:** `livecodebench`
**Container:**
```
nvcr.io/nvidia/eval-factory/livecodebench:26.01
```
**Container Digest:**
```
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
```
**Container Arch:** `multiarch`
**Task Type:** `codegeneration_notfast`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codegeneration
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: null
end_date: null
cot_code_execution: false
args: --not_fast
supported_endpoint_types:
- chat
type: codegeneration_notfast
target:
api_endpoint: {}
```
:::
::::
---
(livecodebench-codegeneration-release-latest)=
## codegeneration_release_latest
Code generation latest version
::::{tab-set}
:::{tab-item} Container
**Harness:** `livecodebench`
**Container:**
```
nvcr.io/nvidia/eval-factory/livecodebench:26.01
```
**Container Digest:**
```
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
```
**Container Arch:** `multiarch`
**Task Type:** `codegeneration_release_latest`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codegeneration
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: null
end_date: null
cot_code_execution: false
release_version: release_latest
supported_endpoint_types:
- chat
type: codegeneration_release_latest
target:
api_endpoint: {}
```
:::
::::
---
(livecodebench-codegeneration-release-v1)=
## codegeneration_release_v1
The initial release of the dataset (v1) with problems released between May 2023 and Mar 2024 containing 400 problems.
::::{tab-set}
:::{tab-item} Container
**Harness:** `livecodebench`
**Container:**
```
nvcr.io/nvidia/eval-factory/livecodebench:26.01
```
**Container Digest:**
```
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
```
**Container Arch:** `multiarch`
**Task Type:** `codegeneration_release_v1`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codegeneration
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: null
end_date: null
cot_code_execution: false
release_version: release_v1
supported_endpoint_types:
- chat
type: codegeneration_release_v1
target:
api_endpoint: {}
```
:::
::::
---
(livecodebench-codegeneration-release-v2)=
## codegeneration_release_v2
The updated release of the dataset (v2) with problems released between May 2023 and May 2024 containing 511 problems.
::::{tab-set}
:::{tab-item} Container
**Harness:** `livecodebench`
**Container:**
```
nvcr.io/nvidia/eval-factory/livecodebench:26.01
```
**Container Digest:**
```
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
```
**Container Arch:** `multiarch`
**Task Type:** `codegeneration_release_v2`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codegeneration
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: null
end_date: null
cot_code_execution: false
release_version: release_v2
supported_endpoint_types:
- chat
type: codegeneration_release_v2
target:
api_endpoint: {}
```
:::
::::
---
(livecodebench-codegeneration-release-v3)=
## codegeneration_release_v3
The updated release of the dataset (v3) with problems released between May 2023 and Jul 2024 containing 612 problems.
::::{tab-set}
:::{tab-item} Container
**Harness:** `livecodebench`
**Container:**
```
nvcr.io/nvidia/eval-factory/livecodebench:26.01
```
**Container Digest:**
```
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
```
**Container Arch:** `multiarch`
**Task Type:** `codegeneration_release_v3`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codegeneration
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: null
end_date: null
cot_code_execution: false
release_version: release_v3
supported_endpoint_types:
- chat
type: codegeneration_release_v3
target:
api_endpoint: {}
```
:::
::::
---
(livecodebench-codegeneration-release-v4)=
## codegeneration_release_v4
The updated release of the dataset (v4) with problems released between May 2023 and Sep 2024 containing 713 problems.
::::{tab-set}
:::{tab-item} Container
**Harness:** `livecodebench`
**Container:**
```
nvcr.io/nvidia/eval-factory/livecodebench:26.01
```
**Container Digest:**
```
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
```
**Container Arch:** `multiarch`
**Task Type:** `codegeneration_release_v4`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codegeneration
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: null
end_date: null
cot_code_execution: false
release_version: release_v4
supported_endpoint_types:
- chat
type: codegeneration_release_v4
target:
api_endpoint: {}
```
:::
::::
---
(livecodebench-codegeneration-release-v5)=
## codegeneration_release_v5
The updated release of the dataset (v5) with problems released between May 2023 and Jan 2025 containing 880 problems.
::::{tab-set}
:::{tab-item} Container
**Harness:** `livecodebench`
**Container:**
```
nvcr.io/nvidia/eval-factory/livecodebench:26.01
```
**Container Digest:**
```
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
```
**Container Arch:** `multiarch`
**Task Type:** `codegeneration_release_v5`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codegeneration
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: null
end_date: null
cot_code_execution: false
release_version: release_v5
supported_endpoint_types:
- chat
type: codegeneration_release_v5
target:
api_endpoint: {}
```
:::
::::
---
(livecodebench-codegeneration-release-v6)=
## codegeneration_release_v6
The updated release of the dataset (v6) with problems released between May 2023 and Apr 2025 containing 1055 problems.
::::{tab-set}
:::{tab-item} Container
**Harness:** `livecodebench`
**Container:**
```
nvcr.io/nvidia/eval-factory/livecodebench:26.01
```
**Container Digest:**
```
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
```
**Container Arch:** `multiarch`
**Task Type:** `codegeneration_release_v6`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codegeneration
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: null
end_date: null
cot_code_execution: false
release_version: release_v6
supported_endpoint_types:
- chat
type: codegeneration_release_v6
target:
api_endpoint: {}
```
:::
::::
---
(livecodebench-livecodebench-0724-0125)=
## livecodebench_0724_0125
- Code generation evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. - The data period and sampling parameters used by Artificial Analaysis (https://artificialanalysis.ai/methodology/intelligence-benchmarking)
::::{tab-set}
:::{tab-item} Container
**Harness:** `livecodebench`
**Container:**
```
nvcr.io/nvidia/eval-factory/livecodebench:26.01
```
**Container Digest:**
```
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
```
**Container Arch:** `multiarch`
**Task Type:** `livecodebench_0724_0125`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codegeneration
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 3
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: 2024-07-01
end_date: 2025-01-01
cot_code_execution: false
release_version: release_v5
supported_endpoint_types:
- chat
type: livecodebench_0724_0125
target:
api_endpoint: {}
```
:::
::::
---
(livecodebench-livecodebench-0824-0225)=
## livecodebench_0824_0225
['Code generation evaluating code comprehension ability. The model is given a program and an input, and the output should be the result.', 'The data period and sampling parameters used by NeMo Alignment team.']
::::{tab-set}
:::{tab-item} Container
**Harness:** `livecodebench`
**Container:**
```
nvcr.io/nvidia/eval-factory/livecodebench:26.01
```
**Container Digest:**
```
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
```
**Container Arch:** `multiarch`
**Task Type:** `livecodebench_0824_0225`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: codegeneration
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 3
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: 2024-08-01
end_date: 2025-02-01
cot_code_execution: false
release_version: release_v5
supported_endpoint_types:
- chat
type: livecodebench_0824_0225
target:
api_endpoint: {}
```
:::
::::
---
(livecodebench-livecodebench-aa-v2)=
## livecodebench_aa_v2
- Code generation evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. - The data period and sampling parameters used by Artificial Analaysis (https://artificialanalysis.ai/methodology/intelligence-benchmarking)
::::{tab-set}
:::{tab-item} Container
**Harness:** `livecodebench`
**Container:**
```
nvcr.io/nvidia/eval-factory/livecodebench:26.01
```
**Container Digest:**
```
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
```
**Container Arch:** `multiarch`
**Task Type:** `livecodebench_aa_v2`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 16384
max_retries: 30
parallelism: 10
task: codegeneration
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 3
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: 2024-07-01
end_date: 2025-01-01
cot_code_execution: false
release_version: release_v5
supported_endpoint_types:
- chat
type: livecodebench_aa_v2
target:
api_endpoint: {}
```
:::
::::
---
(livecodebench-testoutputprediction)=
## testoutputprediction
Solve the natural language task on a specified input, evaluating the ability to generate testing outputs. The model is given the natural language problem description and an input, and the output should be the output for the problem.
::::{tab-set}
:::{tab-item} Container
**Harness:** `livecodebench`
**Container:**
```
nvcr.io/nvidia/eval-factory/livecodebench:26.01
```
**Container Digest:**
```
sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e
```
**Container Arch:** `multiarch`
**Task Type:** `testoutputprediction`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} livecodebench --model {{target.api_endpoint.model_id}} \
--scenario {{config.params.task}} \
--release_version {{config.params.extra.release_version}} \
--url {{target.api_endpoint.url}} \
--temperature {{config.params.temperature}} \
--top_p {{config.params.top_p}} \
--evaluate \
--codegen_n {{config.params.extra.n_samples}} \
--use_cache \
--cache_batch_size {{config.params.extra.cache_batch_size}} \
--num_process_evaluate {{config.params.extra.num_process_evaluate}} \
--n {{config.params.extra.n_samples}} \
--max_tokens {{config.params.max_new_tokens}} \
--out_dir {{config.output_dir}} \
--multiprocess {{config.params.parallelism}} \
--max_retries {{config.params.max_retries}} \
--timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: livecodebench
pkg_name: livecodebench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
task: testoutputprediction
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
num_process_evaluate: 5
cache_batch_size: 10
support_system_role: false
start_date: null
end_date: null
cot_code_execution: false
release_version: release_latest
supported_endpoint_types:
- chat
type: testoutputprediction
target:
api_endpoint: {}
```
:::
::::
# lm-evaluation-harness
This page contains all evaluation tasks for the **lm-evaluation-harness** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [adlr_agieval_en_cot](#lm-evaluation-harness-adlr-agieval-en-cot)
- Version of the AGIEval-EN-CoT benchmark used by NVIDIA Applied Deep Learning Research team (ADLR).
* - [adlr_arc_challenge_llama_25_shot](#lm-evaluation-harness-adlr-arc-challenge-llama-25-shot)
- ARC-Challenge-Llama version used by NVIDIA Applied Deep Learning Research team (ADLR).
* - [adlr_commonsense_qa_7_shot](#lm-evaluation-harness-adlr-commonsense-qa-7-shot)
- CommonsenseQA version used by NVIDIA Applied Deep Learning Research team (ADLR).
* - [adlr_global_mmlu_lite_5_shot](#lm-evaluation-harness-adlr-global-mmlu-lite-5-shot)
- Global-MMLU subset (8 languages - es, de, fr, zh, it, ja, pt, ko) used by NVIDIA Applied Deep Learning Research team (ADLR).
* - [adlr_gpqa_diamond_cot_5_shot](#lm-evaluation-harness-adlr-gpqa-diamond-cot-5-shot)
- Version of the GPQA-Diamond-CoT benchmark used by NVIDIA Applied Deep Learning Research team (ADLR).
* - [adlr_gsm8k_cot_8_shot](#lm-evaluation-harness-adlr-gsm8k-cot-8-shot)
- GSM8K-CoT version used by NVIDIA Applied Deep Learning Research team (ADLR).
* - [adlr_humaneval_greedy](#lm-evaluation-harness-adlr-humaneval-greedy)
- HumanEval Greedy version used by NVIDIA Applied Deep Learning Research team (ADLR).
* - [adlr_humaneval_sampled](#lm-evaluation-harness-adlr-humaneval-sampled)
- HumanEval Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR).
* - [adlr_math_500_4_shot_sampled](#lm-evaluation-harness-adlr-math-500-4-shot-sampled)
- MATH-500 Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR).
* - [adlr_mbpp_sanitized_3_shot_greedy](#lm-evaluation-harness-adlr-mbpp-sanitized-3-shot-greedy)
- MBPP Greedy version used by NVIDIA Applied Deep Learning Research team (ADLR).
* - [adlr_mbpp_sanitized_3_shot_sampled](#lm-evaluation-harness-adlr-mbpp-sanitized-3-shot-sampled)
- MBPP Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR).
* - [adlr_mgsm_native_cot_8_shot](#lm-evaluation-harness-adlr-mgsm-native-cot-8-shot)
- MGSM native CoT subset (6 languages - es, de, fr, zh, ja, ru) used by NVIDIA Applied Deep Learning Research team (ADLR).
* - [adlr_minerva_math_nemo_4_shot](#lm-evaluation-harness-adlr-minerva-math-nemo-4-shot)
- Minerva-Math version used by NVIDIA Applied Deep Learning Research team (ADLR).
* - [adlr_mmlu](#lm-evaluation-harness-adlr-mmlu)
- MMLU version used by NVIDIA Applied Deep Learning Research team (ADLR).
* - [adlr_mmlu_pro_5_shot_base](#lm-evaluation-harness-adlr-mmlu-pro-5-shot-base)
- MMLU-Pro 5-shot base version used by NVIDIA Applied Deep Learning Research team (ADLR).
* - [adlr_race](#lm-evaluation-harness-adlr-race)
- RACE version used by NVIDIA Applied Deep Learning Research team (ADLR).
* - [adlr_truthfulqa_mc2](#lm-evaluation-harness-adlr-truthfulqa-mc2)
- TruthfulQA-MC2 version used by NVIDIA Applied Deep Learning Research team (ADLR).
* - [adlr_winogrande_5_shot](#lm-evaluation-harness-adlr-winogrande-5-shot)
- Winogrande version used by NVIDIA Applied Deep Learning Research team (ADLR).
* - [agieval](#lm-evaluation-harness-agieval)
- AGIEval - A Human-Centric Benchmark for Evaluating Foundation Models
* - [arc_challenge](#lm-evaluation-harness-arc-challenge)
- The ARC challenge dataset consists of 2,590 multiple-choice science exam questions.
* - [arc_challenge_chat](#lm-evaluation-harness-arc-challenge-chat)
- - The ARC challenge dataset consists of 2,590 multiple-choice science exam questions. - This variant applies a chat template and defaults to zero-shot evaluation.
* - [arc_multilingual](#lm-evaluation-harness-arc-multilingual)
- The multilingual versions of the ARC challenge dataset.
* - [bbh](#lm-evaluation-harness-bbh)
- The BIG-Bench Hard (BBH) benchmark is a part of the BIG-Bench evaluation suite, focusing on 23 particularly difficult tasks that current language models struggle with.
* - [bbh_instruct](#lm-evaluation-harness-bbh-instruct)
- - The BIG-Bench Hard (BBH) benchmark is a part of the BIG-Bench evaluation suite, focusing on 23 particularly difficult tasks that current language models struggle with. - This variant aaplies chat template and defaults to zero-shot evaluation.
* - [bbq_chat](#lm-evaluation-harness-bbq-chat)
- The BBQ (Bias Benchmark for QA) is a benchmark designed to measure social biases in question answering systems. It contains ambiguous questions spanning 9 categories - disability, gender, nationality, physical appearance, race/ethnicity, religion, sexual orientation, socioeconomic status, and age (chat endpoint).
* - [bbq_completions](#lm-evaluation-harness-bbq-completions)
- The BBQ (Bias Benchmark for QA) is a benchmark designed to measure social biases in question answering systems. It contains ambiguous questions spanning 9 categories - disability, gender, nationality, physical appearance, race/ethnicity, religion, sexual orientation, socioeconomic status, and age (completions endpoint).
* - [commonsense_qa](#lm-evaluation-harness-commonsense-qa)
- - CommonsenseQA is a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers. - It contains 12,102 questions with one correct answer and four distractor answers.
* - [global_mmlu](#lm-evaluation-harness-global-mmlu)
- - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - It is designed for efficient evaluation of multilingual models in 15 languages (including English).
* - [global_mmlu_ar](#lm-evaluation-harness-global-mmlu-ar)
- - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the AR subset.
* - [global_mmlu_bn](#lm-evaluation-harness-global-mmlu-bn)
- - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the BN subset.
* - [global_mmlu_de](#lm-evaluation-harness-global-mmlu-de)
- - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the DE subset.
* - [global_mmlu_en](#lm-evaluation-harness-global-mmlu-en)
- - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the EN subset.
* - [global_mmlu_es](#lm-evaluation-harness-global-mmlu-es)
- - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ES subset.
* - [global_mmlu_fr](#lm-evaluation-harness-global-mmlu-fr)
- - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the FR subset.
* - [global_mmlu_full](#lm-evaluation-harness-global-mmlu-full)
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English.
* - [global_mmlu_full_am](#lm-evaluation-harness-global-mmlu-full-am)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the AM subset.
* - [global_mmlu_full_ar](#lm-evaluation-harness-global-mmlu-full-ar)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the AR subset.
* - [global_mmlu_full_bn](#lm-evaluation-harness-global-mmlu-full-bn)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the BN subset.
* - [global_mmlu_full_cs](#lm-evaluation-harness-global-mmlu-full-cs)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the CS subset.
* - [global_mmlu_full_de](#lm-evaluation-harness-global-mmlu-full-de)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the DE subset.
* - [global_mmlu_full_el](#lm-evaluation-harness-global-mmlu-full-el)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the EL subset.
* - [global_mmlu_full_en](#lm-evaluation-harness-global-mmlu-full-en)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the EN subset.
* - [global_mmlu_full_es](#lm-evaluation-harness-global-mmlu-full-es)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ES subset.
* - [global_mmlu_full_fa](#lm-evaluation-harness-global-mmlu-full-fa)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FA subset.
* - [global_mmlu_full_fil](#lm-evaluation-harness-global-mmlu-full-fil)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FIL subset.
* - [global_mmlu_full_fr](#lm-evaluation-harness-global-mmlu-full-fr)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FR subset.
* - [global_mmlu_full_ha](#lm-evaluation-harness-global-mmlu-full-ha)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HA subset.
* - [global_mmlu_full_he](#lm-evaluation-harness-global-mmlu-full-he)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HE subset.
* - [global_mmlu_full_hi](#lm-evaluation-harness-global-mmlu-full-hi)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HI subset.
* - [global_mmlu_full_id](#lm-evaluation-harness-global-mmlu-full-id)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ID subset.
* - [global_mmlu_full_ig](#lm-evaluation-harness-global-mmlu-full-ig)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the IG subset.
* - [global_mmlu_full_it](#lm-evaluation-harness-global-mmlu-full-it)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the IT subset.
* - [global_mmlu_full_ja](#lm-evaluation-harness-global-mmlu-full-ja)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the JA subset.
* - [global_mmlu_full_ko](#lm-evaluation-harness-global-mmlu-full-ko)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the KO subset.
* - [global_mmlu_full_ky](#lm-evaluation-harness-global-mmlu-full-ky)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the KY subset.
* - [global_mmlu_full_lt](#lm-evaluation-harness-global-mmlu-full-lt)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the LT subset.
* - [global_mmlu_full_mg](#lm-evaluation-harness-global-mmlu-full-mg)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the MG subset.
* - [global_mmlu_full_ms](#lm-evaluation-harness-global-mmlu-full-ms)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the MS subset.
* - [global_mmlu_full_ne](#lm-evaluation-harness-global-mmlu-full-ne)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NE subset.
* - [global_mmlu_full_nl](#lm-evaluation-harness-global-mmlu-full-nl)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NL subset.
* - [global_mmlu_full_ny](#lm-evaluation-harness-global-mmlu-full-ny)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NY subset.
* - [global_mmlu_full_pl](#lm-evaluation-harness-global-mmlu-full-pl)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the PL subset.
* - [global_mmlu_full_pt](#lm-evaluation-harness-global-mmlu-full-pt)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the PT subset.
* - [global_mmlu_full_ro](#lm-evaluation-harness-global-mmlu-full-ro)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the RO subset.
* - [global_mmlu_full_ru](#lm-evaluation-harness-global-mmlu-full-ru)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the RU subset.
* - [global_mmlu_full_si](#lm-evaluation-harness-global-mmlu-full-si)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SI subset.
* - [global_mmlu_full_sn](#lm-evaluation-harness-global-mmlu-full-sn)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SN subset.
* - [global_mmlu_full_so](#lm-evaluation-harness-global-mmlu-full-so)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SO subset.
* - [global_mmlu_full_sr](#lm-evaluation-harness-global-mmlu-full-sr)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SR subset.
* - [global_mmlu_full_sv](#lm-evaluation-harness-global-mmlu-full-sv)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SV subset.
* - [global_mmlu_full_sw](#lm-evaluation-harness-global-mmlu-full-sw)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SW subset.
* - [global_mmlu_full_te](#lm-evaluation-harness-global-mmlu-full-te)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the TE subset.
* - [global_mmlu_full_tr](#lm-evaluation-harness-global-mmlu-full-tr)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the TR subset.
* - [global_mmlu_full_uk](#lm-evaluation-harness-global-mmlu-full-uk)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the UK subset.
* - [global_mmlu_full_vi](#lm-evaluation-harness-global-mmlu-full-vi)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the VI subset.
* - [global_mmlu_full_yo](#lm-evaluation-harness-global-mmlu-full-yo)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the YO subset.
* - [global_mmlu_full_zh](#lm-evaluation-harness-global-mmlu-full-zh)
- - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ZH subset.
* - [global_mmlu_hi](#lm-evaluation-harness-global-mmlu-hi)
- - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the HI subset.
* - [global_mmlu_id](#lm-evaluation-harness-global-mmlu-id)
- - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ID subset.
* - [global_mmlu_it](#lm-evaluation-harness-global-mmlu-it)
- - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the IT subset.
* - [global_mmlu_ja](#lm-evaluation-harness-global-mmlu-ja)
- - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the JA subset.
* - [global_mmlu_ko](#lm-evaluation-harness-global-mmlu-ko)
- - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the KO subset.
* - [global_mmlu_pt](#lm-evaluation-harness-global-mmlu-pt)
- - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the PT subset.
* - [global_mmlu_sw](#lm-evaluation-harness-global-mmlu-sw)
- - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the SW subset.
* - [global_mmlu_yo](#lm-evaluation-harness-global-mmlu-yo)
- - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the YO subset.
* - [global_mmlu_zh](#lm-evaluation-harness-global-mmlu-zh)
- - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ZH subset.
* - [gpqa](#lm-evaluation-harness-gpqa)
- The GPQA (Graduate-Level Google-Proof Q&A) benchmark is a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry.
* - [gpqa_diamond_cot](#lm-evaluation-harness-gpqa-diamond-cot)
- - The GPQA (Graduate-Level Google-Proof Q&A) benchmark is a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry. - This variant uses the Diamond subset and defaults to zero-shot chain-of-thought evaluation.
* - [gsm8k](#lm-evaluation-harness-gsm8k)
- The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems.
* - [gsm8k_cot_instruct](#lm-evaluation-harness-gsm8k-cot-instruct)
- - The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation with custom instructions.
* - [gsm8k_cot_llama](#lm-evaluation-harness-gsm8k-cot-llama)
- - The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought evaluation - implementation taken from llama.
* - [gsm8k_cot_zeroshot](#lm-evaluation-harness-gsm8k-cot-zeroshot)
- - The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation.
* - [gsm8k_cot_zeroshot_llama](#lm-evaluation-harness-gsm8k-cot-zeroshot-llama)
- - The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation - implementation taken from llama.
* - [hellaswag](#lm-evaluation-harness-hellaswag)
- The HellaSwag benchmark tests a language model's commonsense reasoning by having it choose the most logical ending for a given story.
* - [hellaswag_multilingual](#lm-evaluation-harness-hellaswag-multilingual)
- The multilingual versions of the HellaSwag benchmark.
* - [humaneval_instruct](#lm-evaluation-harness-humaneval-instruct)
- - The HumanEval benchmark measures functional correctness for synthesizing programs from docstrings. - Implementation taken from llama.
* - [ifeval](#lm-evaluation-harness-ifeval)
- IFEval is a dataset designed to test a model's ability to follow explicit instructions, such as "include keyword x" or "use format y." The focus is on the model's adherence to formatting instructions rather than the content generated, allowing for the use of strict and rigorous metrics.
* - [m_mmlu_id_str_chat](#lm-evaluation-harness-m-mmlu-id-str-chat)
- - The MMLU (Massive Multitask Language Understanding) benchmark translated to Indonesian with string-based evaluation (chat endpoint).
* - [m_mmlu_id_str_completions](#lm-evaluation-harness-m-mmlu-id-str-completions)
- - The MMLU (Massive Multitask Language Understanding) benchmark translated to Indonesian with string-based evaluation (completions endpoint).
* - [mbpp_plus_chat](#lm-evaluation-harness-mbpp-plus-chat)
- MBPP EvalPlus is an extension of the MBPP benchmark with 35x more test cases (chat endpoint).
* - [mbpp_plus_completions](#lm-evaluation-harness-mbpp-plus-completions)
- MBPP EvalPlus is an extension of the MBPP benchmark with 35x more test cases (completions endpoint).
* - [mgsm](#lm-evaluation-harness-mgsm)
- - The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages.
* - [mgsm_cot_chat](#lm-evaluation-harness-mgsm-cot-chat)
- - The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages. - This variant uses the chat endpoint and defaults to chain-of-thought evaluation.
* - [mgsm_cot_completions](#lm-evaluation-harness-mgsm-cot-completions)
- - The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages. - This variant uses the completions endpoint and defaults to chain-of-thought evaluation.
* - [mmlu](#lm-evaluation-harness-mmlu)
- - The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses text generation.
* - [mmlu_cot_0_shot_chat](#lm-evaluation-harness-mmlu-cot-0-shot-chat)
- - The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant defaults to chain-of-thought zero-shot evaluation.
* - [mmlu_instruct](#lm-evaluation-harness-mmlu-instruct)
- - The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the chat endpoint, defaults to zero-shot evaluation and instructs the model to produce a single letter response.
* - [mmlu_instruct_completions](#lm-evaluation-harness-mmlu-instruct-completions)
- - The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the completions endpoint, defaults to zero-shot evaluation and instructs the model to produce a single letter response.
* - [mmlu_logits](#lm-evaluation-harness-mmlu-logits)
- - The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the logits of the model to evaluate the accuracy.
* - [mmlu_pro](#lm-evaluation-harness-mmlu-pro)
- MMLU-Pro is a refined version of the MMLU dataset with 10 choices instead of 4 (completions endpoint).
* - [mmlu_pro_instruct](#lm-evaluation-harness-mmlu-pro-instruct)
- - MMLU-Pro is a refined version of the MMLU dataset with 10 choices instead of 4. - This variant applies a chat template and defaults to zero-shot evaluation.
* - [mmlu_prox_chat](#lm-evaluation-harness-mmlu-prox-chat)
- A Multilingual Benchmark for Advanced Large Language Model Evaluation (chat endpoint)
* - [mmlu_prox_completions](#lm-evaluation-harness-mmlu-prox-completions)
- A Multilingual Benchmark for Advanced Large Language Model Evaluation (completions endpoint)
* - [mmlu_prox_de_chat](#lm-evaluation-harness-mmlu-prox-de-chat)
- A Multilingual Benchmark for Advanced Large Language Model Evaluation - German dataset (chat endpoint)
* - [mmlu_prox_de_completions](#lm-evaluation-harness-mmlu-prox-de-completions)
- A Multilingual Benchmark for Advanced Large Language Model Evaluation - German dataset (completions endpoint)
* - [mmlu_prox_es_chat](#lm-evaluation-harness-mmlu-prox-es-chat)
- A Multilingual Benchmark for Advanced Large Language Model Evaluation - Spanish dataset (chat endpoint)
* - [mmlu_prox_es_completions](#lm-evaluation-harness-mmlu-prox-es-completions)
- A Multilingual Benchmark for Advanced Large Language Model Evaluation - Spanish dataset (completions endpoint)
* - [mmlu_prox_fr_chat](#lm-evaluation-harness-mmlu-prox-fr-chat)
- A Multilingual Benchmark for Advanced Large Language Model Evaluation - French dataset (chat endpoint)
* - [mmlu_prox_fr_completions](#lm-evaluation-harness-mmlu-prox-fr-completions)
- A Multilingual Benchmark for Advanced Large Language Model Evaluation - French dataset (completions endpoint)
* - [mmlu_prox_it_chat](#lm-evaluation-harness-mmlu-prox-it-chat)
- A Multilingual Benchmark for Advanced Large Language Model Evaluation - Italian dataset (chat endpoint)
* - [mmlu_prox_it_completions](#lm-evaluation-harness-mmlu-prox-it-completions)
- A Multilingual Benchmark for Advanced Large Language Model Evaluation - Italian dataset (completions endpoint)
* - [mmlu_prox_ja_chat](#lm-evaluation-harness-mmlu-prox-ja-chat)
- A Multilingual Benchmark for Advanced Large Language Model Evaluation - Japanese dataset (chat endpoint)
* - [mmlu_prox_ja_completions](#lm-evaluation-harness-mmlu-prox-ja-completions)
- A Multilingual Benchmark for Advanced Large Language Model Evaluation - Japanese dataset (completions endpoint)
* - [mmlu_redux](#lm-evaluation-harness-mmlu-redux)
- MMLU-Redux is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects.
* - [mmlu_redux_instruct](#lm-evaluation-harness-mmlu-redux-instruct)
- - MMLU-Redux is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. - This variant applies a chat template and defaults to zero-shot evaluation.
* - [musr](#lm-evaluation-harness-musr)
- The MuSR (Multistep Soft Reasoning) benchmark evaluates the reasoning capabilities of large language models through complex, multistep tasks specified in natural language narratives.
* - [openbookqa](#lm-evaluation-harness-openbookqa)
- - OpenBookQA is a question-answering dataset modeled after open book exams for assessing human understanding of a subject. - Answering OpenBookQA questions requires additional broad common knowledge, not contained in the book. - The questions, by design, are answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm.
* - [piqa](#lm-evaluation-harness-piqa)
- - Physical Interaction: Question Answering (PIQA) is a physical commonsense reasoning benchmark designed to investigate the physical knowledge of large language models.
* - [social_iqa](#lm-evaluation-harness-social-iqa)
- - Social IQa contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations.
* - [truthfulqa](#lm-evaluation-harness-truthfulqa)
- - The TruthfulQA benchmark measures the truthfulness of language models in generating answers to questions. - It consists of 817 questions across 38 categories, such as health, law, finance, and politics, designed to test whether models can avoid generating false answers that mimic common human misconceptions.
* - [wikilingua](#lm-evaluation-harness-wikilingua)
- - The WikiLingua benchmark is a large-scale, multilingual dataset designed for evaluating cross-lingual abstractive summarization systems.
* - [wikitext](#lm-evaluation-harness-wikitext)
- - The WikiText language modeling dataset is a collection of over 100 million tokens extracted from verified Good and Featured articles on Wikipedia. - This task measures perplexity on the WikiText-2 dataset via rolling loglikelihoods.
* - [winogrande](#lm-evaluation-harness-winogrande)
- WinoGrande is a collection of 44k problems formulated as a fill-in-a-blank task with binary options testing commonsense reasoning.
```
(lm-evaluation-harness-adlr-agieval-en-cot)=
## adlr_agieval_en_cot
Version of the AGIEval-EN-CoT benchmark used by NVIDIA Applied Deep Learning Research team (ADLR).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `adlr_agieval_en_cot`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_agieval_en_cot
temperature: 0.0
request_timeout: 30
top_p: 1.0e-05
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: adlr_agieval_en_cot
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-adlr-arc-challenge-llama-25-shot)=
## adlr_arc_challenge_llama_25_shot
ARC-Challenge-Llama version used by NVIDIA Applied Deep Learning Research team (ADLR).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `adlr_arc_challenge_llama_25_shot`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_arc_challenge_llama
temperature: 1.0
request_timeout: 30
top_p: 1.0
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 25
supported_endpoint_types:
- completions
type: adlr_arc_challenge_llama_25_shot
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-adlr-commonsense-qa-7-shot)=
## adlr_commonsense_qa_7_shot
CommonsenseQA version used by NVIDIA Applied Deep Learning Research team (ADLR).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `adlr_commonsense_qa_7_shot`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: commonsense_qa
temperature: 1.0
request_timeout: 30
top_p: 1.0
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 7
supported_endpoint_types:
- completions
type: adlr_commonsense_qa_7_shot
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-adlr-global-mmlu-lite-5-shot)=
## adlr_global_mmlu_lite_5_shot
Global-MMLU subset (8 languages - es, de, fr, zh, it, ja, pt, ko) used by NVIDIA Applied Deep Learning Research team (ADLR).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `adlr_global_mmlu_lite_5_shot`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_global_mmlu
temperature: 1.0
request_timeout: 30
top_p: 1.0
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 5
supported_endpoint_types:
- completions
type: adlr_global_mmlu_lite_5_shot
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-adlr-gpqa-diamond-cot-5-shot)=
## adlr_gpqa_diamond_cot_5_shot
Version of the GPQA-Diamond-CoT benchmark used by NVIDIA Applied Deep Learning Research team (ADLR).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `adlr_gpqa_diamond_cot_5_shot`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_gpqa_diamond_cot_5_shot
temperature: 0.0
request_timeout: 30
top_p: 1.0e-05
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 5
supported_endpoint_types:
- completions
type: adlr_gpqa_diamond_cot_5_shot
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-adlr-gsm8k-cot-8-shot)=
## adlr_gsm8k_cot_8_shot
GSM8K-CoT version used by NVIDIA Applied Deep Learning Research team (ADLR).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `adlr_gsm8k_cot_8_shot`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_gsm8k_fewshot_cot
temperature: 0.0
request_timeout: 30
top_p: 1.0e-05
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 8
supported_endpoint_types:
- completions
type: adlr_gsm8k_cot_8_shot
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-adlr-humaneval-greedy)=
## adlr_humaneval_greedy
HumanEval Greedy version used by NVIDIA Applied Deep Learning Research team (ADLR).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `adlr_humaneval_greedy`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_humaneval_greedy
temperature: 0.0
request_timeout: 30
top_p: 1.0e-05
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: adlr_humaneval_greedy
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-adlr-humaneval-sampled)=
## adlr_humaneval_sampled
HumanEval Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `adlr_humaneval_sampled`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_humaneval_sampled
temperature: 0.6
request_timeout: 30
top_p: 0.95
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: adlr_humaneval_sampled
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-adlr-math-500-4-shot-sampled)=
## adlr_math_500_4_shot_sampled
MATH-500 Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `adlr_math_500_4_shot_sampled`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_math_500_4_shot_sampled
temperature: 0.7
request_timeout: 30
top_p: 1.0
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 4
supported_endpoint_types:
- completions
type: adlr_math_500_4_shot_sampled
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-adlr-mbpp-sanitized-3-shot-greedy)=
## adlr_mbpp_sanitized_3_shot_greedy
MBPP Greedy version used by NVIDIA Applied Deep Learning Research team (ADLR).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `adlr_mbpp_sanitized_3_shot_greedy`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_mbpp_sanitized_3_shot_greedy
temperature: 0.0
request_timeout: 30
top_p: 1.0e-05
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 3
supported_endpoint_types:
- completions
type: adlr_mbpp_sanitized_3_shot_greedy
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-adlr-mbpp-sanitized-3-shot-sampled)=
## adlr_mbpp_sanitized_3_shot_sampled
MBPP Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `adlr_mbpp_sanitized_3_shot_sampled`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_mbpp_sanitized_3shot_sampled
temperature: 0.6
request_timeout: 30
top_p: 0.95
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 3
supported_endpoint_types:
- completions
type: adlr_mbpp_sanitized_3_shot_sampled
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-adlr-mgsm-native-cot-8-shot)=
## adlr_mgsm_native_cot_8_shot
MGSM native CoT subset (6 languages - es, de, fr, zh, ja, ru) used by NVIDIA Applied Deep Learning Research team (ADLR).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `adlr_mgsm_native_cot_8_shot`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_mgsm_native_cot_8_shot
temperature: 0.0
request_timeout: 30
top_p: 1.0e-05
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 8
supported_endpoint_types:
- completions
type: adlr_mgsm_native_cot_8_shot
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-adlr-minerva-math-nemo-4-shot)=
## adlr_minerva_math_nemo_4_shot
Minerva-Math version used by NVIDIA Applied Deep Learning Research team (ADLR).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `adlr_minerva_math_nemo_4_shot`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_minerva_math_nemo
temperature: 0.0
request_timeout: 30
top_p: 1.0e-05
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 4
supported_endpoint_types:
- completions
type: adlr_minerva_math_nemo_4_shot
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-adlr-mmlu)=
## adlr_mmlu
MMLU version used by NVIDIA Applied Deep Learning Research team (ADLR).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `adlr_mmlu`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_str
temperature: 0.0
request_timeout: 30
top_p: 1.0e-05
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 5
args: --trust_remote_code
supported_endpoint_types:
- completions
type: adlr_mmlu
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-adlr-mmlu-pro-5-shot-base)=
## adlr_mmlu_pro_5_shot_base
MMLU-Pro 5-shot base version used by NVIDIA Applied Deep Learning Research team (ADLR).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `adlr_mmlu_pro_5_shot_base`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_mmlu_pro_5_shot_base
temperature: 0.0
request_timeout: 30
top_p: 1.0e-05
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 5
supported_endpoint_types:
- completions
type: adlr_mmlu_pro_5_shot_base
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-adlr-race)=
## adlr_race
RACE version used by NVIDIA Applied Deep Learning Research team (ADLR).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `adlr_race`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_race
temperature: 1.0
request_timeout: 30
top_p: 1.0
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: adlr_race
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-adlr-truthfulqa-mc2)=
## adlr_truthfulqa_mc2
TruthfulQA-MC2 version used by NVIDIA Applied Deep Learning Research team (ADLR).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `adlr_truthfulqa_mc2`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: adlr_truthfulqa_mc2
temperature: 1.0
request_timeout: 30
top_p: 1.0
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: adlr_truthfulqa_mc2
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-adlr-winogrande-5-shot)=
## adlr_winogrande_5_shot
Winogrande version used by NVIDIA Applied Deep Learning Research team (ADLR).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `adlr_winogrande_5_shot`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: winogrande
temperature: 1.0
request_timeout: 30
top_p: 1.0
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 5
supported_endpoint_types:
- completions
type: adlr_winogrande_5_shot
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-agieval)=
## agieval
AGIEval - A Human-Centric Benchmark for Evaluating Foundation Models
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `agieval`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: agieval
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: agieval
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-arc-challenge)=
## arc_challenge
The ARC challenge dataset consists of 2,590 multiple-choice science exam questions.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `arc_challenge`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: arc_challenge
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: arc_challenge
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-arc-challenge-chat)=
## arc_challenge_chat
- The ARC challenge dataset consists of 2,590 multiple-choice science exam questions. - This variant applies a chat template and defaults to zero-shot evaluation.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `arc_challenge_chat`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: arc_challenge_chat
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 0
supported_endpoint_types:
- chat
type: arc_challenge_chat
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-arc-multilingual)=
## arc_multilingual
The multilingual versions of the ARC challenge dataset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `arc_multilingual`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: arc_multilingual
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: arc_multilingual
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-bbh)=
## bbh
The BIG-Bench Hard (BBH) benchmark is a part of the BIG-Bench evaluation suite, focusing on 23 particularly difficult tasks that current language models struggle with.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `bbh`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: leaderboard_bbh
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: bbh
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-bbh-instruct)=
## bbh_instruct
- The BIG-Bench Hard (BBH) benchmark is a part of the BIG-Bench evaluation suite, focusing on 23 particularly difficult tasks that current language models struggle with. - This variant aaplies chat template and defaults to zero-shot evaluation.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `bbh_instruct`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: bbh_zeroshot
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: bbh_instruct
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-bbq-chat)=
## bbq_chat
The BBQ (Bias Benchmark for QA) is a benchmark designed to measure social biases in question answering systems. It contains ambiguous questions spanning 9 categories - disability, gender, nationality, physical appearance, race/ethnicity, religion, sexual orientation, socioeconomic status, and age (chat endpoint).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `bbq_chat`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: bbq_generate
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: bbq_chat
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-bbq-completions)=
## bbq_completions
The BBQ (Bias Benchmark for QA) is a benchmark designed to measure social biases in question answering systems. It contains ambiguous questions spanning 9 categories - disability, gender, nationality, physical appearance, race/ethnicity, religion, sexual orientation, socioeconomic status, and age (completions endpoint).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `bbq_completions`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: bbq_generate
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: bbq_completions
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-commonsense-qa)=
## commonsense_qa
- CommonsenseQA is a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers. - It contains 12,102 questions with one correct answer and four distractor answers.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `commonsense_qa`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: commonsense_qa
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 7
supported_endpoint_types:
- completions
type: commonsense_qa
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu)=
## global_mmlu
- Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - It is designed for efficient evaluation of multilingual models in 15 languages (including English).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-ar)=
## global_mmlu_ar
- Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the AR subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_ar`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_ar
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_ar
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-bn)=
## global_mmlu_bn
- Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the BN subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_bn`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_bn
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_bn
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-de)=
## global_mmlu_de
- Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the DE subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_de`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_de
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_de
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-en)=
## global_mmlu_en
- Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the EN subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_en`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_en
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_en
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-es)=
## global_mmlu_es
- Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ES subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_es`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_es
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_es
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-fr)=
## global_mmlu_fr
- Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the FR subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_fr`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_fr
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_fr
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full)=
## global_mmlu_full
Global-MMLU is a multilingual evaluation set spanning 42 languages, including English.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-am)=
## global_mmlu_full_am
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the AM subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_am`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_am
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_am
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-ar)=
## global_mmlu_full_ar
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the AR subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_ar`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_ar
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_ar
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-bn)=
## global_mmlu_full_bn
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the BN subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_bn`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_bn
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_bn
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-cs)=
## global_mmlu_full_cs
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the CS subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_cs`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_cs
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_cs
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-de)=
## global_mmlu_full_de
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the DE subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_de`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_de
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_de
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-el)=
## global_mmlu_full_el
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the EL subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_el`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_el
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_el
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-en)=
## global_mmlu_full_en
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the EN subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_en`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_en
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_en
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-es)=
## global_mmlu_full_es
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ES subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_es`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_es
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_es
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-fa)=
## global_mmlu_full_fa
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FA subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_fa`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_fa
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_fa
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-fil)=
## global_mmlu_full_fil
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FIL subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_fil`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_fil
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_fil
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-fr)=
## global_mmlu_full_fr
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FR subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_fr`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_fr
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_fr
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-ha)=
## global_mmlu_full_ha
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HA subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_ha`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_ha
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_ha
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-he)=
## global_mmlu_full_he
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HE subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_he`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_he
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_he
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-hi)=
## global_mmlu_full_hi
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HI subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_hi`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_hi
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_hi
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-id)=
## global_mmlu_full_id
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ID subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_id`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_id
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_id
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-ig)=
## global_mmlu_full_ig
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the IG subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_ig`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_ig
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_ig
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-it)=
## global_mmlu_full_it
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the IT subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_it`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_it
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_it
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-ja)=
## global_mmlu_full_ja
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the JA subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_ja`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_ja
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_ja
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-ko)=
## global_mmlu_full_ko
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the KO subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_ko`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_ko
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_ko
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-ky)=
## global_mmlu_full_ky
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the KY subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_ky`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_ky
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_ky
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-lt)=
## global_mmlu_full_lt
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the LT subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_lt`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_lt
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_lt
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-mg)=
## global_mmlu_full_mg
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the MG subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_mg`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_mg
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_mg
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-ms)=
## global_mmlu_full_ms
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the MS subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_ms`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_ms
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_ms
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-ne)=
## global_mmlu_full_ne
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NE subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_ne`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_ne
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_ne
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-nl)=
## global_mmlu_full_nl
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NL subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_nl`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_nl
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_nl
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-ny)=
## global_mmlu_full_ny
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NY subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_ny`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_ny
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_ny
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-pl)=
## global_mmlu_full_pl
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the PL subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_pl`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_pl
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_pl
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-pt)=
## global_mmlu_full_pt
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the PT subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_pt`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_pt
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_pt
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-ro)=
## global_mmlu_full_ro
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the RO subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_ro`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_ro
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_ro
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-ru)=
## global_mmlu_full_ru
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the RU subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_ru`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_ru
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_ru
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-si)=
## global_mmlu_full_si
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SI subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_si`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_si
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_si
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-sn)=
## global_mmlu_full_sn
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SN subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_sn`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_sn
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_sn
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-so)=
## global_mmlu_full_so
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SO subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_so`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_so
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_so
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-sr)=
## global_mmlu_full_sr
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SR subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_sr`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_sr
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_sr
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-sv)=
## global_mmlu_full_sv
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SV subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_sv`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_sv
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_sv
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-sw)=
## global_mmlu_full_sw
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SW subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_sw`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_sw
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_sw
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-te)=
## global_mmlu_full_te
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the TE subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_te`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_te
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_te
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-tr)=
## global_mmlu_full_tr
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the TR subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_tr`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_tr
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_tr
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-uk)=
## global_mmlu_full_uk
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the UK subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_uk`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_uk
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_uk
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-vi)=
## global_mmlu_full_vi
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the VI subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_vi`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_vi
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_vi
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-yo)=
## global_mmlu_full_yo
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the YO subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_yo`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_yo
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_yo
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-full-zh)=
## global_mmlu_full_zh
- Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ZH subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_full_zh`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_full_zh
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_full_zh
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-hi)=
## global_mmlu_hi
- Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the HI subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_hi`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_hi
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_hi
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-id)=
## global_mmlu_id
- Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ID subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_id`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_id
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_id
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-it)=
## global_mmlu_it
- Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the IT subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_it`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_it
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_it
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-ja)=
## global_mmlu_ja
- Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the JA subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_ja`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_ja
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_ja
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-ko)=
## global_mmlu_ko
- Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the KO subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_ko`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_ko
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_ko
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-pt)=
## global_mmlu_pt
- Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the PT subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_pt`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_pt
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_pt
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-sw)=
## global_mmlu_sw
- Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the SW subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_sw`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_sw
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_sw
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-yo)=
## global_mmlu_yo
- Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the YO subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_yo`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_yo
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_yo
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-global-mmlu-zh)=
## global_mmlu_zh
- Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ZH subset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `global_mmlu_zh`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: global_mmlu_zh
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: global_mmlu_zh
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-gpqa)=
## gpqa
The GPQA (Graduate-Level Google-Proof Q&A) benchmark is a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `gpqa`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: leaderboard_gpqa
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: gpqa
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-gpqa-diamond-cot)=
## gpqa_diamond_cot
- The GPQA (Graduate-Level Google-Proof Q&A) benchmark is a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry. - This variant uses the Diamond subset and defaults to zero-shot chain-of-thought evaluation.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `gpqa_diamond_cot`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: gpqa_diamond_cot_zeroshot
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: gpqa_diamond_cot
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-gsm8k)=
## gsm8k
The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `gsm8k`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: gsm8k
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: gsm8k
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-gsm8k-cot-instruct)=
## gsm8k_cot_instruct
- The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation with custom instructions.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `gsm8k_cot_instruct`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: gsm8k_zeroshot_cot
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
args: --add_instruction
supported_endpoint_types:
- chat
type: gsm8k_cot_instruct
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-gsm8k-cot-llama)=
## gsm8k_cot_llama
- The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought evaluation - implementation taken from llama.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `gsm8k_cot_llama`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: gsm8k_cot_llama
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: gsm8k_cot_llama
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-gsm8k-cot-zeroshot)=
## gsm8k_cot_zeroshot
- The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `gsm8k_cot_zeroshot`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: gsm8k_cot_zeroshot
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: gsm8k_cot_zeroshot
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-gsm8k-cot-zeroshot-llama)=
## gsm8k_cot_zeroshot_llama
- The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation - implementation taken from llama.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `gsm8k_cot_zeroshot_llama`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: gsm8k_cot_llama
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 0
supported_endpoint_types:
- chat
type: gsm8k_cot_zeroshot_llama
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-hellaswag)=
## hellaswag
The HellaSwag benchmark tests a language model's commonsense reasoning by having it choose the most logical ending for a given story.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `hellaswag`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: hellaswag
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 10
supported_endpoint_types:
- completions
type: hellaswag
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-hellaswag-multilingual)=
## hellaswag_multilingual
The multilingual versions of the HellaSwag benchmark.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `hellaswag_multilingual`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: hellaswag_multilingual
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 10
supported_endpoint_types:
- completions
type: hellaswag_multilingual
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-humaneval-instruct)=
## humaneval_instruct
- The HumanEval benchmark measures functional correctness for synthesizing programs from docstrings. - Implementation taken from llama.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `humaneval_instruct`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: humaneval_instruct
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: humaneval_instruct
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-ifeval)=
## ifeval
IFEval is a dataset designed to test a model's ability to follow explicit instructions, such as "include keyword x" or "use format y." The focus is on the model's adherence to formatting instructions rather than the content generated, allowing for the use of strict and rigorous metrics.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `ifeval`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: ifeval
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: ifeval
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-m-mmlu-id-str-chat)=
## m_mmlu_id_str_chat
- The MMLU (Massive Multitask Language Understanding) benchmark translated to Indonesian with string-based evaluation (chat endpoint).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `m_mmlu_id_str_chat`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: m_mmlu_id_str
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 0
args: --trust_remote_code
supported_endpoint_types:
- chat
type: m_mmlu_id_str_chat
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-m-mmlu-id-str-completions)=
## m_mmlu_id_str_completions
- The MMLU (Massive Multitask Language Understanding) benchmark translated to Indonesian with string-based evaluation (completions endpoint).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `m_mmlu_id_str_completions`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: m_mmlu_id_str
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 0
args: --trust_remote_code
supported_endpoint_types:
- completions
type: m_mmlu_id_str_completions
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mbpp-plus-chat)=
## mbpp_plus_chat
MBPP EvalPlus is an extension of the MBPP benchmark with 35x more test cases (chat endpoint).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mbpp_plus_chat`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mbpp_plus
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
args: --confirm_run_unsafe_code
supported_endpoint_types:
- chat
type: mbpp_plus_chat
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mbpp-plus-completions)=
## mbpp_plus_completions
MBPP EvalPlus is an extension of the MBPP benchmark with 35x more test cases (completions endpoint).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mbpp_plus_completions`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mbpp_plus
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
args: --confirm_run_unsafe_code
supported_endpoint_types:
- completions
type: mbpp_plus_completions
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mgsm)=
## mgsm
- The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mgsm`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mgsm_direct
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: mgsm
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mgsm-cot-chat)=
## mgsm_cot_chat
- The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages. - This variant uses the chat endpoint and defaults to chain-of-thought evaluation.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mgsm_cot_chat`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: mgsm_cot_native
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 0
supported_endpoint_types:
- chat
type: mgsm_cot_chat
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mgsm-cot-completions)=
## mgsm_cot_completions
- The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages. - This variant uses the completions endpoint and defaults to chain-of-thought evaluation.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mgsm_cot_completions`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: mgsm_cot_native
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 0
supported_endpoint_types:
- completions
type: mgsm_cot_completions
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mmlu)=
## mmlu
- The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses text generation.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_str
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 5
args: --trust_remote_code
supported_endpoint_types:
- completions
type: mmlu
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mmlu-cot-0-shot-chat)=
## mmlu_cot_0_shot_chat
- The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant defaults to chain-of-thought zero-shot evaluation.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_cot_0_shot_chat`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_cot_0_shot_chat
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
args: --trust_remote_code
supported_endpoint_types:
- chat
type: mmlu_cot_0_shot_chat
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mmlu-instruct)=
## mmlu_instruct
- The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the chat endpoint, defaults to zero-shot evaluation and instructs the model to produce a single letter response.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_instruct`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_str
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 0
args: --trust_remote_code --add_instruction
supported_endpoint_types:
- chat
type: mmlu_instruct
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mmlu-instruct-completions)=
## mmlu_instruct_completions
- The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the completions endpoint, defaults to zero-shot evaluation and instructs the model to produce a single letter response.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_instruct_completions`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_str
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 0
args: --trust_remote_code --add_instruction
supported_endpoint_types:
- completions
type: mmlu_instruct_completions
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mmlu-logits)=
## mmlu_logits
- The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the logits of the model to evaluate the accuracy.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_logits`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 5
supported_endpoint_types:
- completions
type: mmlu_logits
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mmlu-pro)=
## mmlu_pro
MMLU-Pro is a refined version of the MMLU dataset with 10 choices instead of 4 (completions endpoint).
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_pro`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_pro
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 5
supported_endpoint_types:
- completions
type: mmlu_pro
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mmlu-pro-instruct)=
## mmlu_pro_instruct
- MMLU-Pro is a refined version of the MMLU dataset with 10 choices instead of 4. - This variant applies a chat template and defaults to zero-shot evaluation.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_pro_instruct`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
task: mmlu_pro
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 0
supported_endpoint_types:
- chat
type: mmlu_pro_instruct
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mmlu-prox-chat)=
## mmlu_prox_chat
A Multilingual Benchmark for Advanced Large Language Model Evaluation (chat endpoint)
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_prox_chat`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: mmlu_prox_chat
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mmlu-prox-completions)=
## mmlu_prox_completions
A Multilingual Benchmark for Advanced Large Language Model Evaluation (completions endpoint)
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_prox_completions`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: mmlu_prox_completions
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mmlu-prox-de-chat)=
## mmlu_prox_de_chat
A Multilingual Benchmark for Advanced Large Language Model Evaluation - German dataset (chat endpoint)
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_prox_de_chat`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox_de
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: mmlu_prox_de_chat
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mmlu-prox-de-completions)=
## mmlu_prox_de_completions
A Multilingual Benchmark for Advanced Large Language Model Evaluation - German dataset (completions endpoint)
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_prox_de_completions`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox_de
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: mmlu_prox_de_completions
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mmlu-prox-es-chat)=
## mmlu_prox_es_chat
A Multilingual Benchmark for Advanced Large Language Model Evaluation - Spanish dataset (chat endpoint)
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_prox_es_chat`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox_es
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: mmlu_prox_es_chat
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mmlu-prox-es-completions)=
## mmlu_prox_es_completions
A Multilingual Benchmark for Advanced Large Language Model Evaluation - Spanish dataset (completions endpoint)
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_prox_es_completions`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox_es
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: mmlu_prox_es_completions
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mmlu-prox-fr-chat)=
## mmlu_prox_fr_chat
A Multilingual Benchmark for Advanced Large Language Model Evaluation - French dataset (chat endpoint)
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_prox_fr_chat`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox_fr
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: mmlu_prox_fr_chat
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mmlu-prox-fr-completions)=
## mmlu_prox_fr_completions
A Multilingual Benchmark for Advanced Large Language Model Evaluation - French dataset (completions endpoint)
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_prox_fr_completions`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox_fr
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: mmlu_prox_fr_completions
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mmlu-prox-it-chat)=
## mmlu_prox_it_chat
A Multilingual Benchmark for Advanced Large Language Model Evaluation - Italian dataset (chat endpoint)
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_prox_it_chat`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox_it
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: mmlu_prox_it_chat
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mmlu-prox-it-completions)=
## mmlu_prox_it_completions
A Multilingual Benchmark for Advanced Large Language Model Evaluation - Italian dataset (completions endpoint)
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_prox_it_completions`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox_it
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: mmlu_prox_it_completions
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mmlu-prox-ja-chat)=
## mmlu_prox_ja_chat
A Multilingual Benchmark for Advanced Large Language Model Evaluation - Japanese dataset (chat endpoint)
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_prox_ja_chat`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox_ja
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- chat
type: mmlu_prox_ja_chat
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mmlu-prox-ja-completions)=
## mmlu_prox_ja_completions
A Multilingual Benchmark for Advanced Large Language Model Evaluation - Japanese dataset (completions endpoint)
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_prox_ja_completions`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_prox_ja
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: mmlu_prox_ja_completions
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mmlu-redux)=
## mmlu_redux
MMLU-Redux is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_redux`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: mmlu_redux
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: mmlu_redux
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-mmlu-redux-instruct)=
## mmlu_redux_instruct
- MMLU-Redux is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. - This variant applies a chat template and defaults to zero-shot evaluation.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_redux_instruct`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_new_tokens: 8192
max_retries: 5
parallelism: 10
task: mmlu_redux
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 0
args: --add_instruction
supported_endpoint_types:
- chat
type: mmlu_redux_instruct
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-musr)=
## musr
The MuSR (Multistep Soft Reasoning) benchmark evaluates the reasoning capabilities of large language models through complex, multistep tasks specified in natural language narratives.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `musr`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: leaderboard_musr
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: musr
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-openbookqa)=
## openbookqa
- OpenBookQA is a question-answering dataset modeled after open book exams for assessing human understanding of a subject. - Answering OpenBookQA questions requires additional broad common knowledge, not contained in the book. - The questions, by design, are answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `openbookqa`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: openbookqa
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: openbookqa
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-piqa)=
## piqa
- Physical Interaction: Question Answering (PIQA) is a physical commonsense reasoning benchmark designed to investigate the physical knowledge of large language models.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `piqa`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: piqa
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: piqa
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-social-iqa)=
## social_iqa
- Social IQa contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `social_iqa`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: social_iqa
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
args: --trust_remote_code
supported_endpoint_types:
- completions
type: social_iqa
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-truthfulqa)=
## truthfulqa
- The TruthfulQA benchmark measures the truthfulness of language models in generating answers to questions. - It consists of 817 questions across 38 categories, such as health, law, finance, and politics, designed to test whether models can avoid generating false answers that mimic common human misconceptions.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `truthfulqa`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: truthfulqa
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
supported_endpoint_types:
- completions
type: truthfulqa
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-wikilingua)=
## wikilingua
- The WikiLingua benchmark is a large-scale, multilingual dataset designed for evaluating cross-lingual abstractive summarization systems.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `wikilingua`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: wikilingua
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
args: --trust_remote_code
supported_endpoint_types:
- chat
type: wikilingua
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-wikitext)=
## wikitext
- The WikiText language modeling dataset is a collection of over 100 million tokens extracted from verified Good and Featured articles on Wikipedia. - This task measures perplexity on the WikiText-2 dataset via rolling loglikelihoods.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `wikitext`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: wikitext
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
args: --trust_remote_code
supported_endpoint_types:
- completions
type: wikitext
target:
api_endpoint:
stream: false
```
:::
::::
---
(lm-evaluation-harness-winogrande)=
## winogrande
WinoGrande is a collection of 44k problems formulated as a fill-in-a-blank task with binary options testing commonsense reasoning.
::::{tab-set}
:::{tab-item} Container
**Harness:** `lm-evaluation-harness`
**Container:**
```
nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01
```
**Container Digest:**
```
sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26
```
**Container Arch:** `multiarch`
**Task Type:** `winogrande`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: lm-evaluation-harness
pkg_name: lm_evaluation_harness
config:
params:
max_retries: 5
parallelism: 10
task: winogrande
temperature: 1.0e-07
request_timeout: 30
top_p: 0.9999999
extra:
tokenizer: null
tokenizer_backend: None
downsampling_ratio: null
tokenized_requests: false
num_fewshot: 5
supported_endpoint_types:
- completions
type: winogrande
target:
api_endpoint:
stream: false
```
:::
::::
# mmath
This page contains all evaluation tasks for the **mmath** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [mmath_ar](#mmath-mmath-ar)
- Arabic mmath
* - [mmath_en](#mmath-mmath-en)
- English mmath
* - [mmath_es](#mmath-mmath-es)
- Spanish mmath
* - [mmath_fr](#mmath-mmath-fr)
- French mmath
* - [mmath_ja](#mmath-mmath-ja)
- Japanese mmath
* - [mmath_ko](#mmath-mmath-ko)
- Korean mmath
* - [mmath_pt](#mmath-mmath-pt)
- Portuguese mmath
* - [mmath_th](#mmath-mmath-th)
- Thai mmath
* - [mmath_vi](#mmath-mmath-vi)
- Vietnamese mmath
* - [mmath_zh](#mmath-mmath-zh)
- Chinese mmath
```
(mmath-mmath-ar)=
## mmath_ar
Arabic mmath
::::{tab-set}
:::{tab-item} Container
**Harness:** `mmath`
**Container:**
```
nvcr.io/nvidia/eval-factory/mmath:26.01
```
**Container Digest:**
```
sha256:da033bf95efd05af58d2ab06feb2344dbca60678f3075a4bf7f53899901c5efc
```
**Container Arch:** `multiarch`
**Task Type:** `mmath_ar`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} mmath --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --lang {{config.params.extra.language}} --output-dir {{config.output_dir}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} --max-tokens {{config.params.max_new_tokens}} --temperature {{config.params.temperature}} --top-p {{config.params.top_p}} --request-timeout {{config.params.request_timeout}} --n-samples {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mmath
pkg_name: mmath
config:
params:
max_new_tokens: 32768
max_retries: 5
parallelism: 8
temperature: 0.6
request_timeout: 3600
top_p: 0.95
extra:
language: ar
n_samples: 4
supported_endpoint_types:
- chat
type: mmath_ar
target:
api_endpoint:
stream: false
```
:::
::::
---
(mmath-mmath-en)=
## mmath_en
English mmath
::::{tab-set}
:::{tab-item} Container
**Harness:** `mmath`
**Container:**
```
nvcr.io/nvidia/eval-factory/mmath:26.01
```
**Container Digest:**
```
sha256:da033bf95efd05af58d2ab06feb2344dbca60678f3075a4bf7f53899901c5efc
```
**Container Arch:** `multiarch`
**Task Type:** `mmath_en`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} mmath --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --lang {{config.params.extra.language}} --output-dir {{config.output_dir}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} --max-tokens {{config.params.max_new_tokens}} --temperature {{config.params.temperature}} --top-p {{config.params.top_p}} --request-timeout {{config.params.request_timeout}} --n-samples {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mmath
pkg_name: mmath
config:
params:
max_new_tokens: 32768
max_retries: 5
parallelism: 8
temperature: 0.6
request_timeout: 3600
top_p: 0.95
extra:
language: en
n_samples: 4
supported_endpoint_types:
- chat
type: mmath_en
target:
api_endpoint:
stream: false
```
:::
::::
---
(mmath-mmath-es)=
## mmath_es
Spanish mmath
::::{tab-set}
:::{tab-item} Container
**Harness:** `mmath`
**Container:**
```
nvcr.io/nvidia/eval-factory/mmath:26.01
```
**Container Digest:**
```
sha256:da033bf95efd05af58d2ab06feb2344dbca60678f3075a4bf7f53899901c5efc
```
**Container Arch:** `multiarch`
**Task Type:** `mmath_es`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} mmath --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --lang {{config.params.extra.language}} --output-dir {{config.output_dir}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} --max-tokens {{config.params.max_new_tokens}} --temperature {{config.params.temperature}} --top-p {{config.params.top_p}} --request-timeout {{config.params.request_timeout}} --n-samples {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mmath
pkg_name: mmath
config:
params:
max_new_tokens: 32768
max_retries: 5
parallelism: 8
temperature: 0.6
request_timeout: 3600
top_p: 0.95
extra:
language: es
n_samples: 4
supported_endpoint_types:
- chat
type: mmath_es
target:
api_endpoint:
stream: false
```
:::
::::
---
(mmath-mmath-fr)=
## mmath_fr
French mmath
::::{tab-set}
:::{tab-item} Container
**Harness:** `mmath`
**Container:**
```
nvcr.io/nvidia/eval-factory/mmath:26.01
```
**Container Digest:**
```
sha256:da033bf95efd05af58d2ab06feb2344dbca60678f3075a4bf7f53899901c5efc
```
**Container Arch:** `multiarch`
**Task Type:** `mmath_fr`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} mmath --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --lang {{config.params.extra.language}} --output-dir {{config.output_dir}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} --max-tokens {{config.params.max_new_tokens}} --temperature {{config.params.temperature}} --top-p {{config.params.top_p}} --request-timeout {{config.params.request_timeout}} --n-samples {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mmath
pkg_name: mmath
config:
params:
max_new_tokens: 32768
max_retries: 5
parallelism: 8
temperature: 0.6
request_timeout: 3600
top_p: 0.95
extra:
language: fr
n_samples: 4
supported_endpoint_types:
- chat
type: mmath_fr
target:
api_endpoint:
stream: false
```
:::
::::
---
(mmath-mmath-ja)=
## mmath_ja
Japanese mmath
::::{tab-set}
:::{tab-item} Container
**Harness:** `mmath`
**Container:**
```
nvcr.io/nvidia/eval-factory/mmath:26.01
```
**Container Digest:**
```
sha256:da033bf95efd05af58d2ab06feb2344dbca60678f3075a4bf7f53899901c5efc
```
**Container Arch:** `multiarch`
**Task Type:** `mmath_ja`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} mmath --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --lang {{config.params.extra.language}} --output-dir {{config.output_dir}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} --max-tokens {{config.params.max_new_tokens}} --temperature {{config.params.temperature}} --top-p {{config.params.top_p}} --request-timeout {{config.params.request_timeout}} --n-samples {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mmath
pkg_name: mmath
config:
params:
max_new_tokens: 32768
max_retries: 5
parallelism: 8
temperature: 0.6
request_timeout: 3600
top_p: 0.95
extra:
language: en
n_samples: 4
supported_endpoint_types:
- chat
type: mmath_ja
target:
api_endpoint:
stream: false
```
:::
::::
---
(mmath-mmath-ko)=
## mmath_ko
Korean mmath
::::{tab-set}
:::{tab-item} Container
**Harness:** `mmath`
**Container:**
```
nvcr.io/nvidia/eval-factory/mmath:26.01
```
**Container Digest:**
```
sha256:da033bf95efd05af58d2ab06feb2344dbca60678f3075a4bf7f53899901c5efc
```
**Container Arch:** `multiarch`
**Task Type:** `mmath_ko`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} mmath --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --lang {{config.params.extra.language}} --output-dir {{config.output_dir}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} --max-tokens {{config.params.max_new_tokens}} --temperature {{config.params.temperature}} --top-p {{config.params.top_p}} --request-timeout {{config.params.request_timeout}} --n-samples {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mmath
pkg_name: mmath
config:
params:
max_new_tokens: 32768
max_retries: 5
parallelism: 8
temperature: 0.6
request_timeout: 3600
top_p: 0.95
extra:
language: ko
n_samples: 4
supported_endpoint_types:
- chat
type: mmath_ko
target:
api_endpoint:
stream: false
```
:::
::::
---
(mmath-mmath-pt)=
## mmath_pt
Portuguese mmath
::::{tab-set}
:::{tab-item} Container
**Harness:** `mmath`
**Container:**
```
nvcr.io/nvidia/eval-factory/mmath:26.01
```
**Container Digest:**
```
sha256:da033bf95efd05af58d2ab06feb2344dbca60678f3075a4bf7f53899901c5efc
```
**Container Arch:** `multiarch`
**Task Type:** `mmath_pt`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} mmath --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --lang {{config.params.extra.language}} --output-dir {{config.output_dir}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} --max-tokens {{config.params.max_new_tokens}} --temperature {{config.params.temperature}} --top-p {{config.params.top_p}} --request-timeout {{config.params.request_timeout}} --n-samples {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mmath
pkg_name: mmath
config:
params:
max_new_tokens: 32768
max_retries: 5
parallelism: 8
temperature: 0.6
request_timeout: 3600
top_p: 0.95
extra:
language: pt
n_samples: 4
supported_endpoint_types:
- chat
type: mmath_pt
target:
api_endpoint:
stream: false
```
:::
::::
---
(mmath-mmath-th)=
## mmath_th
Thai mmath
::::{tab-set}
:::{tab-item} Container
**Harness:** `mmath`
**Container:**
```
nvcr.io/nvidia/eval-factory/mmath:26.01
```
**Container Digest:**
```
sha256:da033bf95efd05af58d2ab06feb2344dbca60678f3075a4bf7f53899901c5efc
```
**Container Arch:** `multiarch`
**Task Type:** `mmath_th`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} mmath --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --lang {{config.params.extra.language}} --output-dir {{config.output_dir}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} --max-tokens {{config.params.max_new_tokens}} --temperature {{config.params.temperature}} --top-p {{config.params.top_p}} --request-timeout {{config.params.request_timeout}} --n-samples {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mmath
pkg_name: mmath
config:
params:
max_new_tokens: 32768
max_retries: 5
parallelism: 8
temperature: 0.6
request_timeout: 3600
top_p: 0.95
extra:
language: th
n_samples: 4
supported_endpoint_types:
- chat
type: mmath_th
target:
api_endpoint:
stream: false
```
:::
::::
---
(mmath-mmath-vi)=
## mmath_vi
Vietnamese mmath
::::{tab-set}
:::{tab-item} Container
**Harness:** `mmath`
**Container:**
```
nvcr.io/nvidia/eval-factory/mmath:26.01
```
**Container Digest:**
```
sha256:da033bf95efd05af58d2ab06feb2344dbca60678f3075a4bf7f53899901c5efc
```
**Container Arch:** `multiarch`
**Task Type:** `mmath_vi`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} mmath --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --lang {{config.params.extra.language}} --output-dir {{config.output_dir}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} --max-tokens {{config.params.max_new_tokens}} --temperature {{config.params.temperature}} --top-p {{config.params.top_p}} --request-timeout {{config.params.request_timeout}} --n-samples {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mmath
pkg_name: mmath
config:
params:
max_new_tokens: 32768
max_retries: 5
parallelism: 8
temperature: 0.6
request_timeout: 3600
top_p: 0.95
extra:
language: vi
n_samples: 4
supported_endpoint_types:
- chat
type: mmath_vi
target:
api_endpoint:
stream: false
```
:::
::::
---
(mmath-mmath-zh)=
## mmath_zh
Chinese mmath
::::{tab-set}
:::{tab-item} Container
**Harness:** `mmath`
**Container:**
```
nvcr.io/nvidia/eval-factory/mmath:26.01
```
**Container Digest:**
```
sha256:da033bf95efd05af58d2ab06feb2344dbca60678f3075a4bf7f53899901c5efc
```
**Container Arch:** `multiarch`
**Task Type:** `mmath_zh`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} mmath --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --lang {{config.params.extra.language}} --output-dir {{config.output_dir}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} --max-tokens {{config.params.max_new_tokens}} --temperature {{config.params.temperature}} --top-p {{config.params.top_p}} --request-timeout {{config.params.request_timeout}} --n-samples {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mmath
pkg_name: mmath
config:
params:
max_new_tokens: 32768
max_retries: 5
parallelism: 8
temperature: 0.6
request_timeout: 3600
top_p: 0.95
extra:
language: zh
n_samples: 4
supported_endpoint_types:
- chat
type: mmath_zh
target:
api_endpoint:
stream: false
```
:::
::::
# mtbench
This page contains all evaluation tasks for the **mtbench** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [mtbench](#mtbench-mtbench)
- Standard MT-Bench
* - [mtbench-cor1](#mtbench-mtbench-cor1)
- Corrected MT-Bench
```
(mtbench-mtbench)=
## mtbench
Standard MT-Bench
::::{tab-set}
:::{tab-item} Container
**Harness:** `mtbench`
**Container:**
```
nvcr.io/nvidia/eval-factory/mtbench:26.01
```
**Container Digest:**
```
sha256:69c930de81fdc8d3a55824fc7ebee9632c858ba56234f43ad9d1590e7fc861b1
```
**Container Arch:** `multiarch`
**Task Type:** `mtbench`
:::
:::{tab-item} Command
```bash
mtbench-evaluator {% if target.api_endpoint.model_id is not none %} --model {{target.api_endpoint.model_id}}{% endif %} {% if target.api_endpoint.url is not none %} --url {{target.api_endpoint.url}}{% endif %} {% if target.api_endpoint.api_key_name is not none %} --api_key {{target.api_endpoint.api_key_name}}{% endif %} {% if config.params.request_timeout is not none %} --timeout {{config.params.request_timeout}}{% endif %} {% if config.params.max_retries is not none %} --max_retries {{config.params.max_retries}}{% endif %} {% if config.params.parallelism is not none %} --parallelism {{config.params.parallelism}}{% endif %} {% if config.params.max_new_tokens is not none %} --max_tokens {{config.params.max_new_tokens}}{% endif %} --workdir {{config.output_dir}} {% if config.params.temperature is not none %} --temperature {{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %} --top_p {{config.params.top_p}}{% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.limit_samples is not none %}--first_n {{config.params.limit_samples}}{% endif %} --generate --judge {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key_name is not none %} --judge_api_key_name {{config.params.extra.judge.api_key_name}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mtbench
pkg_name: mtbench
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
request_timeout: 30
extra:
judge:
url: null
model_id: gpt-4
api_key_name: null
request_timeout: 60
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 2048
supported_endpoint_types:
- chat
type: mtbench
target:
api_endpoint: {}
```
:::
::::
---
(mtbench-mtbench-cor1)=
## mtbench-cor1
Corrected MT-Bench
::::{tab-set}
:::{tab-item} Container
**Harness:** `mtbench`
**Container:**
```
nvcr.io/nvidia/eval-factory/mtbench:26.01
```
**Container Digest:**
```
sha256:69c930de81fdc8d3a55824fc7ebee9632c858ba56234f43ad9d1590e7fc861b1
```
**Container Arch:** `multiarch`
**Task Type:** `mtbench-cor1`
:::
:::{tab-item} Command
```bash
mtbench-evaluator {% if target.api_endpoint.model_id is not none %} --model {{target.api_endpoint.model_id}}{% endif %} {% if target.api_endpoint.url is not none %} --url {{target.api_endpoint.url}}{% endif %} {% if target.api_endpoint.api_key_name is not none %} --api_key {{target.api_endpoint.api_key_name}}{% endif %} {% if config.params.request_timeout is not none %} --timeout {{config.params.request_timeout}}{% endif %} {% if config.params.max_retries is not none %} --max_retries {{config.params.max_retries}}{% endif %} {% if config.params.parallelism is not none %} --parallelism {{config.params.parallelism}}{% endif %} {% if config.params.max_new_tokens is not none %} --max_tokens {{config.params.max_new_tokens}}{% endif %} --workdir {{config.output_dir}} {% if config.params.temperature is not none %} --temperature {{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %} --top_p {{config.params.top_p}}{% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.limit_samples is not none %}--first_n {{config.params.limit_samples}}{% endif %} --generate --judge {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key_name is not none %} --judge_api_key_name {{config.params.extra.judge.api_key_name}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mtbench
pkg_name: mtbench
config:
params:
max_new_tokens: 1024
max_retries: 5
parallelism: 10
request_timeout: 30
extra:
judge:
url: null
model_id: gpt-4
api_key_name: null
request_timeout: 60
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 2048
args: --judge_reference_model gpt-4-0125-preview
supported_endpoint_types:
- chat
type: mtbench-cor1
target:
api_endpoint: {}
```
:::
::::
# mteb
This page contains all evaluation tasks for the **mteb** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [MMTEB](#mteb-mmteb)
- MMTEB
* - [MTEB](#mteb-mteb)
- MTEB
* - [MTEB_NL_RETRIEVAL](#mteb-mteb-nl-retrieval)
- MTEB_NL_RETRIEVAL
* - [MTEB_VDR](#mteb-mteb-vdr)
- MTEB Visual Document Retrieval benchmark
* - [RTEB](#mteb-rteb)
- RTEB
* - [ViDoReV1](#mteb-vidorev1)
- ViDoReV1
* - [ViDoReV2](#mteb-vidorev2)
- ViDoReV2
* - [ViDoReV3](#mteb-vidorev3)
- ViDoReV3
* - [ViDoReV3_Text](#mteb-vidorev3-text)
- ViDoReV3 Text (text_image markdown only)
* - [ViDoReV3_Text_Image](#mteb-vidorev3-text-image)
- ViDoReV3 Text+Image (text_image markdown + images)
* - [custom_beir_task](#mteb-custom-beir-task)
- Custom BEIR-formatted text retrieval benchmark
* - [fiqa](#mteb-fiqa)
- Financial Opinion Mining and Question Answering
* - [hotpotqa](#mteb-hotpotqa)
- HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems.
* - [miracl](#mteb-miracl)
- MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset that focuses on search across 18 different languages.
* - [miracl_lite](#mteb-miracl-lite)
- MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset that focuses on search across 18 different languages.
* - [mldr](#mteb-mldr)
- MLDR
* - [mlqa](#mteb-mlqa)
- MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
* - [nano_fiqa](#mteb-nano-fiqa)
- NanoFiQA2018 is a smaller subset of the Financial Opinion Mining and Question Answering dataset.
* - [nq](#mteb-nq)
- Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question answering systems.
```
(mteb-mmteb)=
## MMTEB
MMTEB
::::{tab-set}
:::{tab-item} Container
**Harness:** `mteb`
**Container:**
```
nvcr.io/nvidia/eval-factory/mteb:26.01
```
**Container Digest:**
```
sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb
```
**Container Arch:** `multiarch`
**Task Type:** `MMTEB`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mteb
pkg_name: mteb
config:
params:
max_retries: 10
parallelism: 20
task: MTEB(Multilingual, v2)
request_timeout: 300
extra:
query_prompt_template: null
document_prompt_template: null
ranker:
model_id: null
url: null
api_key: null
endpoint_type: nim
top_k: 40
truncate: END
batch_size: 128
eval_split: test
dataset_path: null
cache_path: null
args: null
version_lite: null
language: null
supported_endpoint_types:
- embedding
type: MMTEB
target:
api_endpoint: {}
```
:::
::::
---
(mteb-mteb)=
## MTEB
MTEB
::::{tab-set}
:::{tab-item} Container
**Harness:** `mteb`
**Container:**
```
nvcr.io/nvidia/eval-factory/mteb:26.01
```
**Container Digest:**
```
sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb
```
**Container Arch:** `multiarch`
**Task Type:** `MTEB`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mteb
pkg_name: mteb
config:
params:
max_retries: 10
parallelism: 20
task: MTEB(eng, v2)
request_timeout: 300
extra:
query_prompt_template: null
document_prompt_template: null
ranker:
model_id: null
url: null
api_key: null
endpoint_type: nim
top_k: 40
truncate: END
batch_size: 128
eval_split: test
dataset_path: null
cache_path: null
args: null
version_lite: null
language: null
supported_endpoint_types:
- embedding
type: MTEB
target:
api_endpoint: {}
```
:::
::::
---
(mteb-mteb-nl-retrieval)=
## MTEB_NL_RETRIEVAL
MTEB_NL_RETRIEVAL
::::{tab-set}
:::{tab-item} Container
**Harness:** `mteb`
**Container:**
```
nvcr.io/nvidia/eval-factory/mteb:26.01
```
**Container Digest:**
```
sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb
```
**Container Arch:** `multiarch`
**Task Type:** `MTEB_NL_RETRIEVAL`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mteb
pkg_name: mteb
config:
params:
max_retries: 10
parallelism: 20
task: MTEB(nld, v1, retrieval)
request_timeout: 300
extra:
query_prompt_template: null
document_prompt_template: null
ranker:
model_id: null
url: null
api_key: null
endpoint_type: nim
top_k: 40
truncate: END
batch_size: 128
eval_split: test
dataset_path: null
cache_path: null
args: null
version_lite: null
language: null
supported_endpoint_types:
- embedding
type: MTEB_NL_RETRIEVAL
target:
api_endpoint: {}
```
:::
::::
---
(mteb-mteb-vdr)=
## MTEB_VDR
MTEB Visual Document Retrieval benchmark
::::{tab-set}
:::{tab-item} Container
**Harness:** `mteb`
**Container:**
```
nvcr.io/nvidia/eval-factory/mteb:26.01
```
**Container Digest:**
```
sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb
```
**Container Arch:** `multiarch`
**Task Type:** `MTEB_VDR`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mteb
pkg_name: mteb
config:
params:
max_retries: 10
parallelism: 20
task: VisualDocumentRetrieval
request_timeout: 300
extra:
query_prompt_template: null
document_prompt_template: null
ranker:
model_id: null
url: null
api_key: null
endpoint_type: nim
top_k: 40
truncate: END
batch_size: 128
eval_split: test
dataset_path: null
cache_path: null
args: null
version_lite: null
language: null
supported_endpoint_types:
- embedding
type: MTEB_VDR
target:
api_endpoint: {}
```
:::
::::
---
(mteb-rteb)=
## RTEB
RTEB
::::{tab-set}
:::{tab-item} Container
**Harness:** `mteb`
**Container:**
```
nvcr.io/nvidia/eval-factory/mteb:26.01
```
**Container Digest:**
```
sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb
```
**Container Arch:** `multiarch`
**Task Type:** `RTEB`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mteb
pkg_name: mteb
config:
params:
max_retries: 10
parallelism: 20
task: RTEB(beta)
request_timeout: 300
extra:
query_prompt_template: null
document_prompt_template: null
ranker:
model_id: null
url: null
api_key: null
endpoint_type: nim
top_k: 40
truncate: END
batch_size: 128
eval_split: test
dataset_path: null
cache_path: null
args: null
version_lite: null
language: null
supported_endpoint_types:
- embedding
type: RTEB
target:
api_endpoint: {}
```
:::
::::
---
(mteb-vidorev1)=
## ViDoReV1
ViDoReV1
::::{tab-set}
:::{tab-item} Container
**Harness:** `mteb`
**Container:**
```
nvcr.io/nvidia/eval-factory/mteb:26.01
```
**Container Digest:**
```
sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb
```
**Container Arch:** `multiarch`
**Task Type:** `ViDoReV1`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mteb
pkg_name: mteb
config:
params:
max_retries: 10
parallelism: 20
task: ViDoRe(v1)
request_timeout: 300
extra:
query_prompt_template: null
document_prompt_template: null
ranker:
model_id: null
url: null
api_key: null
endpoint_type: nim
top_k: 40
truncate: END
batch_size: 128
eval_split: test
dataset_path: null
cache_path: null
args: null
version_lite: null
language: null
supported_endpoint_types:
- embedding
type: ViDoReV1
target:
api_endpoint: {}
```
:::
::::
---
(mteb-vidorev2)=
## ViDoReV2
ViDoReV2
::::{tab-set}
:::{tab-item} Container
**Harness:** `mteb`
**Container:**
```
nvcr.io/nvidia/eval-factory/mteb:26.01
```
**Container Digest:**
```
sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb
```
**Container Arch:** `multiarch`
**Task Type:** `ViDoReV2`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mteb
pkg_name: mteb
config:
params:
max_retries: 10
parallelism: 20
task: ViDoRe(v2)
request_timeout: 300
extra:
query_prompt_template: null
document_prompt_template: null
ranker:
model_id: null
url: null
api_key: null
endpoint_type: nim
top_k: 40
truncate: END
batch_size: 128
eval_split: test
dataset_path: null
cache_path: null
args: null
version_lite: null
language: null
supported_endpoint_types:
- embedding
type: ViDoReV2
target:
api_endpoint: {}
```
:::
::::
---
(mteb-vidorev3)=
## ViDoReV3
ViDoReV3
::::{tab-set}
:::{tab-item} Container
**Harness:** `mteb`
**Container:**
```
nvcr.io/nvidia/eval-factory/mteb:26.01
```
**Container Digest:**
```
sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb
```
**Container Arch:** `multiarch`
**Task Type:** `ViDoReV3`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mteb
pkg_name: mteb
config:
params:
max_retries: 10
parallelism: 20
task: ViDoRe(v3)
request_timeout: 300
extra:
query_prompt_template: null
document_prompt_template: null
ranker:
model_id: null
url: null
api_key: null
endpoint_type: nim
top_k: 40
truncate: END
batch_size: 128
eval_split: test
dataset_path: null
cache_path: null
args: null
version_lite: null
language: null
supported_endpoint_types:
- embedding
type: ViDoReV3
target:
api_endpoint: {}
```
:::
::::
---
(mteb-vidorev3-text)=
## ViDoReV3_Text
ViDoReV3 Text (text_image markdown only)
::::{tab-set}
:::{tab-item} Container
**Harness:** `mteb`
**Container:**
```
nvcr.io/nvidia/eval-factory/mteb:26.01
```
**Container Digest:**
```
sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb
```
**Container Arch:** `multiarch`
**Task Type:** `ViDoReV3_Text`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mteb
pkg_name: mteb
config:
params:
max_retries: 10
parallelism: 20
task: ViDoRe(v3, Text)
request_timeout: 300
extra:
query_prompt_template: null
document_prompt_template: null
ranker:
model_id: null
url: null
api_key: null
endpoint_type: nim
top_k: 40
truncate: END
batch_size: 128
eval_split: test
dataset_path: null
cache_path: null
args: null
version_lite: null
language: null
supported_endpoint_types:
- embedding
type: ViDoReV3_Text
target:
api_endpoint: {}
```
:::
::::
---
(mteb-vidorev3-text-image)=
## ViDoReV3_Text_Image
ViDoReV3 Text+Image (text_image markdown + images)
::::{tab-set}
:::{tab-item} Container
**Harness:** `mteb`
**Container:**
```
nvcr.io/nvidia/eval-factory/mteb:26.01
```
**Container Digest:**
```
sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb
```
**Container Arch:** `multiarch`
**Task Type:** `ViDoReV3_Text_Image`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mteb
pkg_name: mteb
config:
params:
max_retries: 10
parallelism: 20
task: ViDoRe(v3, Text+Image)
request_timeout: 300
extra:
query_prompt_template: null
document_prompt_template: null
ranker:
model_id: null
url: null
api_key: null
endpoint_type: nim
top_k: 40
truncate: END
batch_size: 128
eval_split: test
dataset_path: null
cache_path: null
args: null
version_lite: null
language: null
supported_endpoint_types:
- embedding
type: ViDoReV3_Text_Image
target:
api_endpoint: {}
```
:::
::::
---
(mteb-custom-beir-task)=
## custom_beir_task
Custom BEIR-formatted text retrieval benchmark
::::{tab-set}
:::{tab-item} Container
**Harness:** `mteb`
**Container:**
```
nvcr.io/nvidia/eval-factory/mteb:26.01
```
**Container Digest:**
```
sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb
```
**Container Arch:** `multiarch`
**Task Type:** `custom_beir_task`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mteb
pkg_name: mteb
config:
params:
max_retries: 10
parallelism: 20
task: custom_beir_task
request_timeout: 300
extra:
query_prompt_template: null
document_prompt_template: null
ranker:
model_id: null
url: null
api_key: null
endpoint_type: nim
top_k: 40
truncate: END
batch_size: 128
eval_split: test
dataset_path: null
cache_path: null
args: null
version_lite: null
language: null
supported_endpoint_types:
- embedding
type: custom_beir_task
target:
api_endpoint: {}
```
:::
::::
---
(mteb-fiqa)=
## fiqa
Financial Opinion Mining and Question Answering
::::{tab-set}
:::{tab-item} Container
**Harness:** `mteb`
**Container:**
```
nvcr.io/nvidia/eval-factory/mteb:26.01
```
**Container Digest:**
```
sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb
```
**Container Arch:** `multiarch`
**Task Type:** `fiqa`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mteb
pkg_name: mteb
config:
params:
max_retries: 10
parallelism: 20
task: FiQA2018
request_timeout: 300
extra:
query_prompt_template: null
document_prompt_template: null
ranker:
model_id: null
url: null
api_key: null
endpoint_type: nim
top_k: 40
truncate: END
batch_size: 128
eval_split: test
dataset_path: null
cache_path: null
args: null
version_lite: null
language: null
supported_endpoint_types:
- embedding
type: fiqa
target:
api_endpoint: {}
```
:::
::::
---
(mteb-hotpotqa)=
## hotpotqa
HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems.
::::{tab-set}
:::{tab-item} Container
**Harness:** `mteb`
**Container:**
```
nvcr.io/nvidia/eval-factory/mteb:26.01
```
**Container Digest:**
```
sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb
```
**Container Arch:** `multiarch`
**Task Type:** `hotpotqa`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mteb
pkg_name: mteb
config:
params:
max_retries: 10
parallelism: 20
task: HotpotQA
request_timeout: 300
extra:
query_prompt_template: null
document_prompt_template: null
ranker:
model_id: null
url: null
api_key: null
endpoint_type: nim
top_k: 40
truncate: END
batch_size: 128
eval_split: test
dataset_path: null
cache_path: null
args: null
version_lite: null
language: null
supported_endpoint_types:
- embedding
type: hotpotqa
target:
api_endpoint: {}
```
:::
::::
---
(mteb-miracl)=
## miracl
MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset that focuses on search across 18 different languages.
::::{tab-set}
:::{tab-item} Container
**Harness:** `mteb`
**Container:**
```
nvcr.io/nvidia/eval-factory/mteb:26.01
```
**Container Digest:**
```
sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb
```
**Container Arch:** `multiarch`
**Task Type:** `miracl`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mteb
pkg_name: mteb
config:
params:
max_retries: 10
parallelism: 20
task: MIRACLRetrieval
request_timeout: 300
extra:
query_prompt_template: null
document_prompt_template: null
ranker:
model_id: null
url: null
api_key: null
endpoint_type: nim
top_k: 40
truncate: END
batch_size: 128
eval_split: test
dataset_path: null
cache_path: null
args: null
version_lite: null
language: null
supported_endpoint_types:
- embedding
type: miracl
target:
api_endpoint: {}
```
:::
::::
---
(mteb-miracl-lite)=
## miracl_lite
MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset that focuses on search across 18 different languages.
::::{tab-set}
:::{tab-item} Container
**Harness:** `mteb`
**Container:**
```
nvcr.io/nvidia/eval-factory/mteb:26.01
```
**Container Digest:**
```
sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb
```
**Container Arch:** `multiarch`
**Task Type:** `miracl_lite`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mteb
pkg_name: mteb
config:
params:
max_retries: 10
parallelism: 20
task: MIRACLRetrieval
request_timeout: 300
extra:
query_prompt_template: null
document_prompt_template: null
ranker:
model_id: null
url: null
api_key: null
endpoint_type: nim
top_k: 40
truncate: END
batch_size: 128
eval_split: test
dataset_path: null
cache_path: null
args: null
version_lite: true
language: null
supported_endpoint_types:
- embedding
type: miracl_lite
target:
api_endpoint: {}
```
:::
::::
---
(mteb-mldr)=
## mldr
MLDR
::::{tab-set}
:::{tab-item} Container
**Harness:** `mteb`
**Container:**
```
nvcr.io/nvidia/eval-factory/mteb:26.01
```
**Container Digest:**
```
sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb
```
**Container Arch:** `multiarch`
**Task Type:** `mldr`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mteb
pkg_name: mteb
config:
params:
max_retries: 10
parallelism: 20
task: MultiLongDocRetrieval
request_timeout: 300
extra:
query_prompt_template: null
document_prompt_template: null
ranker:
model_id: null
url: null
api_key: null
endpoint_type: nim
top_k: 40
truncate: END
batch_size: 128
eval_split: test
dataset_path: null
cache_path: null
args: null
version_lite: null
language: null
supported_endpoint_types:
- embedding
type: mldr
target:
api_endpoint: {}
```
:::
::::
---
(mteb-mlqa)=
## mlqa
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
::::{tab-set}
:::{tab-item} Container
**Harness:** `mteb`
**Container:**
```
nvcr.io/nvidia/eval-factory/mteb:26.01
```
**Container Digest:**
```
sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb
```
**Container Arch:** `multiarch`
**Task Type:** `mlqa`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mteb
pkg_name: mteb
config:
params:
max_retries: 10
parallelism: 20
task: MLQARetrieval
request_timeout: 300
extra:
query_prompt_template: null
document_prompt_template: null
ranker:
model_id: null
url: null
api_key: null
endpoint_type: nim
top_k: 40
truncate: END
batch_size: 128
eval_split: test
dataset_path: null
cache_path: null
args: null
version_lite: null
language: null
supported_endpoint_types:
- embedding
type: mlqa
target:
api_endpoint: {}
```
:::
::::
---
(mteb-nano-fiqa)=
## nano_fiqa
NanoFiQA2018 is a smaller subset of the Financial Opinion Mining and Question Answering dataset.
::::{tab-set}
:::{tab-item} Container
**Harness:** `mteb`
**Container:**
```
nvcr.io/nvidia/eval-factory/mteb:26.01
```
**Container Digest:**
```
sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb
```
**Container Arch:** `multiarch`
**Task Type:** `nano_fiqa`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mteb
pkg_name: mteb
config:
params:
max_retries: 10
parallelism: 20
task: NanoFiQA2018Retrieval
request_timeout: 300
extra:
query_prompt_template: null
document_prompt_template: null
ranker:
model_id: null
url: null
api_key: null
endpoint_type: nim
top_k: 40
truncate: END
batch_size: 128
eval_split: train
dataset_path: null
cache_path: null
args: null
version_lite: null
language: null
supported_endpoint_types:
- embedding
type: nano_fiqa
target:
api_endpoint: {}
```
:::
::::
---
(mteb-nq)=
## nq
Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question answering systems.
::::{tab-set}
:::{tab-item} Container
**Harness:** `mteb`
**Container:**
```
nvcr.io/nvidia/eval-factory/mteb:26.01
```
**Container Digest:**
```
sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb
```
**Container Arch:** `multiarch`
**Task Type:** `nq`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: mteb
pkg_name: mteb
config:
params:
max_retries: 10
parallelism: 20
task: NQ
request_timeout: 300
extra:
query_prompt_template: null
document_prompt_template: null
ranker:
model_id: null
url: null
api_key: null
endpoint_type: nim
top_k: 40
truncate: END
batch_size: 128
eval_split: test
dataset_path: null
cache_path: null
args: null
version_lite: null
language: null
supported_endpoint_types:
- embedding
type: nq
target:
api_endpoint: {}
```
:::
::::
# nemo_skills
This page contains all evaluation tasks for the **nemo_skills** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [ns_aa_lcr](#nemo-skills-ns-aa-lcr)
- AA-LCR
* - [ns_aime2024](#nemo-skills-ns-aime2024)
- AIME2024
* - [ns_aime2025](#nemo-skills-ns-aime2025)
- AIME2025
* - [ns_bfcl_v3](#nemo-skills-ns-bfcl-v3)
- BFCLv3
* - [ns_bfcl_v4](#nemo-skills-ns-bfcl-v4)
- BFCLv4
* - [ns_gpqa](#nemo-skills-ns-gpqa)
- GPQA Diamond
* - [ns_hle](#nemo-skills-ns-hle)
- HumanityLastExam
* - [ns_hle_aa](#nemo-skills-ns-hle-aa)
- HumanityLastExam aligned with AA
* - [ns_hmmt_feb2025](#nemo-skills-ns-hmmt-feb2025)
- HMMT February 2025 (MathArena/hmmt_feb_2025)
* - [ns_ifbench](#nemo-skills-ns-ifbench)
- IFBench - Instruction Following Benchmark
* - [ns_ifeval](#nemo-skills-ns-ifeval)
- IFEval - Instruction-Following Evaluation for Large Language Models
* - [ns_livecodebench](#nemo-skills-ns-livecodebench)
- LiveCodeBench v6
* - [ns_livecodebench_aa](#nemo-skills-ns-livecodebench-aa)
- LiveCodeBench with AA custom prompt format (315 problems from July 2024 to Dec 2024, release_v5)
* - [ns_livecodebench_v5](#nemo-skills-ns-livecodebench-v5)
- LiveCodeBench v5
* - [ns_mmlu](#nemo-skills-ns-mmlu)
- MMLU
* - [ns_mmlu_pro](#nemo-skills-ns-mmlu-pro)
- MMLU-PRO
* - [ns_mmlu_prox](#nemo-skills-ns-mmlu-prox)
- MMLU-ProX
* - [ns_ruler](#nemo-skills-ns-ruler)
- RULER - Long Context Understanding
* - [ns_scicode](#nemo-skills-ns-scicode)
- SciCode
* - [ns_wmt24pp](#nemo-skills-ns-wmt24pp)
- WMT24++
```
(nemo-skills-ns-aa-lcr)=
## ns_aa_lcr
AA-LCR
::::{tab-set}
:::{tab-item} Container
**Harness:** `nemo_skills`
**Container:**
```
nvcr.io/nvidia/eval-factory/nemo-skills:26.01
```
**Container Digest:**
```
sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a
```
**Container Arch:** `multiarch`
**Task Type:** `ns_aa_lcr`
:::
:::{tab-item} Command
```bash
cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: nemo_skills
pkg_name: nemo_skills
config:
params:
parallelism: 16
task: aalcr
extra:
use_sandbox: false
num_repeats: null
prompt_config: null
args: null
system_message: null
dataset_split: null
judge_support: true
judge:
url: null
model_id: null
api_key: null
generation_type: null
random_seed: 1234
temperature: 0.0
top_p: 1.0
max_new_tokens: 4096
args: null
parallelism: null
ruler:
data_dir: null
cluster: null
setup: null
max_seq_length: null
tokenizer_path: null
template_tokens: null
num_samples: null
tasks: null
supported_endpoint_types:
- chat
type: ns_aa_lcr
target: {}
```
:::
::::
---
(nemo-skills-ns-aime2024)=
## ns_aime2024
AIME2024
::::{tab-set}
:::{tab-item} Container
**Harness:** `nemo_skills`
**Container:**
```
nvcr.io/nvidia/eval-factory/nemo-skills:26.01
```
**Container Digest:**
```
sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a
```
**Container Arch:** `multiarch`
**Task Type:** `ns_aime2024`
:::
:::{tab-item} Command
```bash
cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: nemo_skills
pkg_name: nemo_skills
config:
params:
parallelism: 16
task: aime24
extra:
use_sandbox: false
num_repeats: null
prompt_config: null
args: null
system_message: null
dataset_split: null
judge_support: true
judge:
url: null
model_id: null
api_key: null
generation_type: math_judge
random_seed: 1234
temperature: null
top_p: null
max_new_tokens: null
args: null
parallelism: null
ruler:
data_dir: null
cluster: null
setup: null
max_seq_length: null
tokenizer_path: null
template_tokens: null
num_samples: null
tasks: null
supported_endpoint_types:
- chat
type: ns_aime2024
target: {}
```
:::
::::
---
(nemo-skills-ns-aime2025)=
## ns_aime2025
AIME2025
::::{tab-set}
:::{tab-item} Container
**Harness:** `nemo_skills`
**Container:**
```
nvcr.io/nvidia/eval-factory/nemo-skills:26.01
```
**Container Digest:**
```
sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a
```
**Container Arch:** `multiarch`
**Task Type:** `ns_aime2025`
:::
:::{tab-item} Command
```bash
cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: nemo_skills
pkg_name: nemo_skills
config:
params:
parallelism: 16
task: aime25
extra:
use_sandbox: false
num_repeats: null
prompt_config: null
args: null
system_message: null
dataset_split: null
judge_support: true
judge:
url: null
model_id: null
api_key: null
generation_type: math_judge
random_seed: 1234
temperature: null
top_p: null
max_new_tokens: null
args: null
parallelism: null
ruler:
data_dir: null
cluster: null
setup: null
max_seq_length: null
tokenizer_path: null
template_tokens: null
num_samples: null
tasks: null
supported_endpoint_types:
- chat
type: ns_aime2025
target: {}
```
:::
::::
---
(nemo-skills-ns-bfcl-v3)=
## ns_bfcl_v3
BFCLv3
::::{tab-set}
:::{tab-item} Container
**Harness:** `nemo_skills`
**Container:**
```
nvcr.io/nvidia/eval-factory/nemo-skills:26.01
```
**Container Digest:**
```
sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a
```
**Container Arch:** `multiarch`
**Task Type:** `ns_bfcl_v3`
:::
:::{tab-item} Command
```bash
cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: nemo_skills
pkg_name: nemo_skills
config:
params:
parallelism: 16
task: bfcl_v3
extra:
use_sandbox: false
num_repeats: null
prompt_config: null
args: ++use_client_parsing=False
system_message: null
dataset_split: null
judge_support: false
judge:
url: null
model_id: null
api_key: null
generation_type: null
random_seed: 1234
temperature: null
top_p: null
max_new_tokens: null
args: null
parallelism: null
ruler:
data_dir: null
cluster: null
setup: null
max_seq_length: null
tokenizer_path: null
template_tokens: null
num_samples: null
tasks: null
supported_endpoint_types:
- chat
type: ns_bfcl_v3
target: {}
```
:::
::::
---
(nemo-skills-ns-bfcl-v4)=
## ns_bfcl_v4
BFCLv4
::::{tab-set}
:::{tab-item} Container
**Harness:** `nemo_skills`
**Container:**
```
nvcr.io/nvidia/eval-factory/nemo-skills:26.01
```
**Container Digest:**
```
sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a
```
**Container Arch:** `multiarch`
**Task Type:** `ns_bfcl_v4`
:::
:::{tab-item} Command
```bash
cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: nemo_skills
pkg_name: nemo_skills
config:
params:
parallelism: 16
task: bfcl_v4
extra:
use_sandbox: false
num_repeats: null
prompt_config: null
args: ++use_client_parsing=False
system_message: null
dataset_split: null
judge_support: false
judge:
url: null
model_id: null
api_key: null
generation_type: null
random_seed: 1234
temperature: null
top_p: null
max_new_tokens: null
args: null
parallelism: null
ruler:
data_dir: null
cluster: null
setup: null
max_seq_length: null
tokenizer_path: null
template_tokens: null
num_samples: null
tasks: null
supported_endpoint_types:
- chat
type: ns_bfcl_v4
target: {}
```
:::
::::
---
(nemo-skills-ns-gpqa)=
## ns_gpqa
GPQA Diamond
::::{tab-set}
:::{tab-item} Container
**Harness:** `nemo_skills`
**Container:**
```
nvcr.io/nvidia/eval-factory/nemo-skills:26.01
```
**Container Digest:**
```
sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a
```
**Container Arch:** `multiarch`
**Task Type:** `ns_gpqa`
:::
:::{tab-item} Command
```bash
cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: nemo_skills
pkg_name: nemo_skills
config:
params:
parallelism: 16
task: gpqa
extra:
use_sandbox: false
num_repeats: null
prompt_config: null
args: null
system_message: null
dataset_split: null
judge_support: false
judge:
url: null
model_id: null
api_key: null
generation_type: null
random_seed: 1234
temperature: null
top_p: null
max_new_tokens: null
args: null
parallelism: null
ruler:
data_dir: null
cluster: null
setup: null
max_seq_length: null
tokenizer_path: null
template_tokens: null
num_samples: null
tasks: null
supported_endpoint_types:
- chat
type: ns_gpqa
target: {}
```
:::
::::
---
(nemo-skills-ns-hle)=
## ns_hle
HumanityLastExam
::::{tab-set}
:::{tab-item} Container
**Harness:** `nemo_skills`
**Container:**
```
nvcr.io/nvidia/eval-factory/nemo-skills:26.01
```
**Container Digest:**
```
sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a
```
**Container Arch:** `multiarch`
**Task Type:** `ns_hle`
:::
:::{tab-item} Command
```bash
cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: nemo_skills
pkg_name: nemo_skills
config:
params:
parallelism: 16
task: hle
extra:
use_sandbox: false
num_repeats: null
prompt_config: null
args: null
system_message: null
dataset_split: null
judge_support: false
judge:
url: null
model_id: null
api_key: null
generation_type: null
random_seed: 1234
temperature: null
top_p: null
max_new_tokens: null
args: null
parallelism: null
ruler:
data_dir: null
cluster: null
setup: null
max_seq_length: null
tokenizer_path: null
template_tokens: null
num_samples: null
tasks: null
supported_endpoint_types:
- chat
type: ns_hle
target: {}
```
:::
::::
---
(nemo-skills-ns-hle-aa)=
## ns_hle_aa
HumanityLastExam aligned with AA
::::{tab-set}
:::{tab-item} Container
**Harness:** `nemo_skills`
**Container:**
```
nvcr.io/nvidia/eval-factory/nemo-skills:26.01
```
**Container Digest:**
```
sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a
```
**Container Arch:** `multiarch`
**Task Type:** `ns_hle_aa`
:::
:::{tab-item} Command
```bash
cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: nemo_skills
pkg_name: nemo_skills
config:
params:
parallelism: 16
task: hle
extra:
use_sandbox: false
num_repeats: 1
prompt_config: null
args: null
system_message: null
dataset_split: null
judge_support: true
judge:
url: https://inference-api.nvidia.com/v1
model_id: us/azure/openai/gpt-4.1
api_key: null
generation_type: null
random_seed: 1234
temperature: null
top_p: null
max_new_tokens: null
args: null
parallelism: null
ruler:
data_dir: null
cluster: null
setup: null
max_seq_length: null
tokenizer_path: null
template_tokens: null
num_samples: null
tasks: null
supported_endpoint_types:
- chat
type: ns_hle_aa
target: {}
```
:::
::::
---
(nemo-skills-ns-hmmt-feb2025)=
## ns_hmmt_feb2025
HMMT February 2025 (MathArena/hmmt_feb_2025)
::::{tab-set}
:::{tab-item} Container
**Harness:** `nemo_skills`
**Container:**
```
nvcr.io/nvidia/eval-factory/nemo-skills:26.01
```
**Container Digest:**
```
sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a
```
**Container Arch:** `multiarch`
**Task Type:** `ns_hmmt_feb2025`
:::
:::{tab-item} Command
```bash
cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: nemo_skills
pkg_name: nemo_skills
config:
params:
parallelism: 16
task: hmmt_feb25
extra:
use_sandbox: false
num_repeats: null
prompt_config: null
args: null
system_message: null
dataset_split: null
judge_support: true
judge:
url: null
model_id: null
api_key: null
generation_type: math_judge
random_seed: 1234
temperature: null
top_p: null
max_new_tokens: null
args: null
parallelism: null
ruler:
data_dir: null
cluster: null
setup: null
max_seq_length: null
tokenizer_path: null
template_tokens: null
num_samples: null
tasks: null
supported_endpoint_types:
- chat
type: ns_hmmt_feb2025
target: {}
```
:::
::::
---
(nemo-skills-ns-ifbench)=
## ns_ifbench
IFBench - Instruction Following Benchmark
::::{tab-set}
:::{tab-item} Container
**Harness:** `nemo_skills`
**Container:**
```
nvcr.io/nvidia/eval-factory/nemo-skills:26.01
```
**Container Digest:**
```
sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a
```
**Container Arch:** `multiarch`
**Task Type:** `ns_ifbench`
:::
:::{tab-item} Command
```bash
cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: nemo_skills
pkg_name: nemo_skills
config:
params:
parallelism: 16
task: ifbench
extra:
use_sandbox: false
num_repeats: null
prompt_config: null
args: null
system_message: null
dataset_split: null
judge_support: false
judge:
url: null
model_id: null
api_key: null
generation_type: null
random_seed: 1234
temperature: null
top_p: null
max_new_tokens: null
args: null
parallelism: null
ruler:
data_dir: null
cluster: null
setup: null
max_seq_length: null
tokenizer_path: null
template_tokens: null
num_samples: null
tasks: null
supported_endpoint_types:
- chat
type: ns_ifbench
target: {}
```
:::
::::
---
(nemo-skills-ns-ifeval)=
## ns_ifeval
IFEval - Instruction-Following Evaluation for Large Language Models
::::{tab-set}
:::{tab-item} Container
**Harness:** `nemo_skills`
**Container:**
```
nvcr.io/nvidia/eval-factory/nemo-skills:26.01
```
**Container Digest:**
```
sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a
```
**Container Arch:** `multiarch`
**Task Type:** `ns_ifeval`
:::
:::{tab-item} Command
```bash
cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: nemo_skills
pkg_name: nemo_skills
config:
params:
parallelism: 16
task: ifeval
extra:
use_sandbox: false
num_repeats: null
prompt_config: null
args: null
system_message: null
dataset_split: null
judge_support: false
judge:
url: null
model_id: null
api_key: null
generation_type: null
random_seed: 1234
temperature: null
top_p: null
max_new_tokens: null
args: null
parallelism: null
ruler:
data_dir: null
cluster: null
setup: null
max_seq_length: null
tokenizer_path: null
template_tokens: null
num_samples: null
tasks: null
supported_endpoint_types:
- chat
type: ns_ifeval
target: {}
```
:::
::::
---
(nemo-skills-ns-livecodebench)=
## ns_livecodebench
LiveCodeBench v6
::::{tab-set}
:::{tab-item} Container
**Harness:** `nemo_skills`
**Container:**
```
nvcr.io/nvidia/eval-factory/nemo-skills:26.01
```
**Container Digest:**
```
sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a
```
**Container Arch:** `multiarch`
**Task Type:** `ns_livecodebench`
:::
:::{tab-item} Command
```bash
cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: nemo_skills
pkg_name: nemo_skills
config:
params:
parallelism: 16
task: livecodebench
extra:
use_sandbox: false
num_repeats: null
prompt_config: null
args: null
system_message: null
dataset_split: test_v6_2408_2505
judge_support: false
judge:
url: null
model_id: null
api_key: null
generation_type: null
random_seed: 1234
temperature: null
top_p: null
max_new_tokens: null
args: null
parallelism: null
ruler:
data_dir: null
cluster: null
setup: null
max_seq_length: null
tokenizer_path: null
template_tokens: null
num_samples: null
tasks: null
supported_endpoint_types:
- chat
type: ns_livecodebench
target: {}
```
:::
::::
---
(nemo-skills-ns-livecodebench-aa)=
## ns_livecodebench_aa
LiveCodeBench with AA custom prompt format (315 problems from July 2024 to Dec 2024, release_v5)
::::{tab-set}
:::{tab-item} Container
**Harness:** `nemo_skills`
**Container:**
```
nvcr.io/nvidia/eval-factory/nemo-skills:26.01
```
**Container Digest:**
```
sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a
```
**Container Arch:** `multiarch`
**Task Type:** `ns_livecodebench_aa`
:::
:::{tab-item} Command
```bash
cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: nemo_skills
pkg_name: nemo_skills
config:
params:
parallelism: 16
task: livecodebench
extra:
use_sandbox: false
num_repeats: 3
prompt_config: /nemo_run/code/eval_factory_prompts/livecodebench-aa.yaml
args: null
system_message: null
dataset_split: test_v5_2407_2412
judge_support: false
judge:
url: null
model_id: null
api_key: null
generation_type: null
random_seed: 1234
temperature: null
top_p: null
max_new_tokens: null
args: null
parallelism: null
ruler:
data_dir: null
cluster: null
setup: null
max_seq_length: null
tokenizer_path: null
template_tokens: null
num_samples: null
tasks: null
supported_endpoint_types:
- chat
type: ns_livecodebench_aa
target: {}
```
:::
::::
---
(nemo-skills-ns-livecodebench-v5)=
## ns_livecodebench_v5
LiveCodeBench v5
::::{tab-set}
:::{tab-item} Container
**Harness:** `nemo_skills`
**Container:**
```
nvcr.io/nvidia/eval-factory/nemo-skills:26.01
```
**Container Digest:**
```
sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a
```
**Container Arch:** `multiarch`
**Task Type:** `ns_livecodebench_v5`
:::
:::{tab-item} Command
```bash
cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: nemo_skills
pkg_name: nemo_skills
config:
params:
parallelism: 16
task: livecodebench
extra:
use_sandbox: false
num_repeats: null
prompt_config: null
args: null
system_message: null
dataset_split: test_v5_2407_2412
judge_support: false
judge:
url: null
model_id: null
api_key: null
generation_type: null
random_seed: 1234
temperature: null
top_p: null
max_new_tokens: null
args: null
parallelism: null
ruler:
data_dir: null
cluster: null
setup: null
max_seq_length: null
tokenizer_path: null
template_tokens: null
num_samples: null
tasks: null
supported_endpoint_types:
- chat
type: ns_livecodebench_v5
target: {}
```
:::
::::
---
(nemo-skills-ns-mmlu)=
## ns_mmlu
MMLU
::::{tab-set}
:::{tab-item} Container
**Harness:** `nemo_skills`
**Container:**
```
nvcr.io/nvidia/eval-factory/nemo-skills:26.01
```
**Container Digest:**
```
sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a
```
**Container Arch:** `multiarch`
**Task Type:** `ns_mmlu`
:::
:::{tab-item} Command
```bash
cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: nemo_skills
pkg_name: nemo_skills
config:
params:
parallelism: 16
task: mmlu
extra:
use_sandbox: false
num_repeats: null
prompt_config: null
args: null
system_message: null
dataset_split: null
judge_support: false
judge:
url: null
model_id: null
api_key: null
generation_type: null
random_seed: 1234
temperature: null
top_p: null
max_new_tokens: null
args: null
parallelism: null
ruler:
data_dir: null
cluster: null
setup: null
max_seq_length: null
tokenizer_path: null
template_tokens: null
num_samples: null
tasks: null
supported_endpoint_types:
- chat
type: ns_mmlu
target: {}
```
:::
::::
---
(nemo-skills-ns-mmlu-pro)=
## ns_mmlu_pro
MMLU-PRO
::::{tab-set}
:::{tab-item} Container
**Harness:** `nemo_skills`
**Container:**
```
nvcr.io/nvidia/eval-factory/nemo-skills:26.01
```
**Container Digest:**
```
sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a
```
**Container Arch:** `multiarch`
**Task Type:** `ns_mmlu_pro`
:::
:::{tab-item} Command
```bash
cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: nemo_skills
pkg_name: nemo_skills
config:
params:
parallelism: 16
task: mmlu-pro
extra:
use_sandbox: false
num_repeats: null
prompt_config: null
args: null
system_message: null
dataset_split: null
judge_support: false
judge:
url: null
model_id: null
api_key: null
generation_type: null
random_seed: 1234
temperature: null
top_p: null
max_new_tokens: null
args: null
parallelism: null
ruler:
data_dir: null
cluster: null
setup: null
max_seq_length: null
tokenizer_path: null
template_tokens: null
num_samples: null
tasks: null
supported_endpoint_types:
- chat
type: ns_mmlu_pro
target: {}
```
:::
::::
---
(nemo-skills-ns-mmlu-prox)=
## ns_mmlu_prox
MMLU-ProX
::::{tab-set}
:::{tab-item} Container
**Harness:** `nemo_skills`
**Container:**
```
nvcr.io/nvidia/eval-factory/nemo-skills:26.01
```
**Container Digest:**
```
sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a
```
**Container Arch:** `multiarch`
**Task Type:** `ns_mmlu_prox`
:::
:::{tab-item} Command
```bash
cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: nemo_skills
pkg_name: nemo_skills
config:
params:
parallelism: 16
task: mmlu-prox
extra:
use_sandbox: false
num_repeats: null
prompt_config: null
args: null
system_message: null
dataset_split: null
judge_support: false
judge:
url: null
model_id: null
api_key: null
generation_type: null
random_seed: 1234
temperature: null
top_p: null
max_new_tokens: null
args: null
parallelism: null
ruler:
data_dir: null
cluster: null
setup: null
max_seq_length: null
tokenizer_path: null
template_tokens: null
num_samples: null
tasks: null
supported_endpoint_types:
- chat
type: ns_mmlu_prox
target: {}
```
:::
::::
---
(nemo-skills-ns-ruler)=
## ns_ruler
RULER - Long Context Understanding
::::{tab-set}
:::{tab-item} Container
**Harness:** `nemo_skills`
**Container:**
```
nvcr.io/nvidia/eval-factory/nemo-skills:26.01
```
**Container Digest:**
```
sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a
```
**Container Arch:** `multiarch`
**Task Type:** `ns_ruler`
:::
:::{tab-item} Command
```bash
cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: nemo_skills
pkg_name: nemo_skills
config:
params:
parallelism: 16
task: ruler.evaluation_128k
extra:
use_sandbox: false
num_repeats: null
prompt_config: null
args: null
system_message: null
dataset_split: null
judge_support: false
judge:
url: null
model_id: null
api_key: null
generation_type: null
random_seed: 1234
temperature: null
top_p: null
max_new_tokens: null
args: null
parallelism: null
ruler:
data_dir: /workspace/ruler_data
cluster: local
setup: evaluation_128k
max_seq_length: 131072
tokenizer_path: null
template_tokens: 50
num_samples: null
tasks: null
supported_endpoint_types:
- completions
type: ns_ruler
target: {}
```
:::
::::
---
(nemo-skills-ns-scicode)=
## ns_scicode
SciCode
::::{tab-set}
:::{tab-item} Container
**Harness:** `nemo_skills`
**Container:**
```
nvcr.io/nvidia/eval-factory/nemo-skills:26.01
```
**Container Digest:**
```
sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a
```
**Container Arch:** `multiarch`
**Task Type:** `ns_scicode`
:::
:::{tab-item} Command
```bash
cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: nemo_skills
pkg_name: nemo_skills
config:
params:
parallelism: 16
task: scicode
extra:
use_sandbox: true
num_repeats: null
prompt_config: null
args: null
system_message: null
dataset_split: null
judge_support: false
judge:
url: null
model_id: null
api_key: null
generation_type: null
random_seed: 1234
temperature: null
top_p: null
max_new_tokens: null
args: null
parallelism: null
ruler:
data_dir: null
cluster: null
setup: null
max_seq_length: null
tokenizer_path: null
template_tokens: null
num_samples: null
tasks: null
supported_endpoint_types:
- chat
type: ns_scicode
target: {}
```
:::
::::
---
(nemo-skills-ns-wmt24pp)=
## ns_wmt24pp
WMT24++
::::{tab-set}
:::{tab-item} Container
**Harness:** `nemo_skills`
**Container:**
```
nvcr.io/nvidia/eval-factory/nemo-skills:26.01
```
**Container Digest:**
```
sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a
```
**Container Arch:** `multiarch`
**Task Type:** `ns_wmt24pp`
:::
:::{tab-item} Command
```bash
cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: nemo_skills
pkg_name: nemo_skills
config:
params:
parallelism: 16
task: wmt24pp
extra:
use_sandbox: false
num_repeats: null
prompt_config: null
args: null
system_message: null
dataset_split: null
judge_support: false
judge:
url: null
model_id: null
api_key: null
generation_type: null
random_seed: 1234
temperature: null
top_p: null
max_new_tokens: null
args: null
parallelism: null
ruler:
data_dir: null
cluster: null
setup: null
max_seq_length: null
tokenizer_path: null
template_tokens: null
num_samples: null
tasks: null
supported_endpoint_types:
- chat
type: ns_wmt24pp
target: {}
```
:::
::::
# profbench
This page contains all evaluation tasks for the **profbench** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [llm_judge](#profbench-llm-judge)
- Run LLM judge on provided ProfBench reports and score them
* - [report_generation](#profbench-report-generation)
- Generate professional reports and evaluate them (full pipeline)
```
(profbench-llm-judge)=
## llm_judge
Run LLM judge on provided ProfBench reports and score them
::::{tab-set}
:::{tab-item} Container
**Harness:** `profbench`
**Container:**
```
nvcr.io/nvidia/eval-factory/profbench:26.01
```
**Container Digest:**
```
sha256:7b2766affe4c2070ec803a893f7bf1ff2fc735df562aa520ec910c9ef58d3598
```
**Container Arch:** `multiarch`
**Task Type:** `llm_judge`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} {% if config.params.extra.run_generation %}
python -m profbench.run_report_generation \
--model {{target.api_endpoint.model_id}} \
--library {{config.params.extra.library}} \
--timeout {{config.params.request_timeout}} \
--parallel {{config.params.parallelism}} \
--retry-attempts {{config.params.max_retries}} \
--folder {{config.output_dir}}{% if target.api_endpoint.url is not none %} --base-url {{target.api_endpoint.url}}{% endif %}{% if config.params.extra.version is not none %} --version {{config.params.extra.version}}{% endif %}{% if config.params.extra.web_search %} --web-search{% endif %}{% if config.params.extra.reasoning %} --reasoning{% endif %}{% if config.params.extra.reasoning_effort is not none %} --reasoning-effort {{config.params.extra.reasoning_effort}}{% endif %}{% if config.params.limit_samples is not none %} --limit-samples {{config.params.limit_samples}}{% endif %}{% if config.params.temperature is not none %} --temperature {{config.params.temperature}}{% endif %}{% if config.params.top_p is not none %} --top-p {{config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %} --max-tokens {{config.params.max_new_tokens}}{% endif %} &&
GENERATION_OUTPUT=$(ls -t {{config.output_dir}}/*.jsonl | head -1) &&
{% endif %} {% if config.params.extra.run_judge_generated %}
python -m profbench.run_best_llm_judge_on_generated_reports \
--filename $GENERATION_OUTPUT \
--api-key $API_KEY \
--model {{target.api_endpoint.model_id}} \
--library {{config.params.extra.library}} \
--timeout {{config.params.request_timeout}} \
--parallel {{config.params.parallelism}} \
--retry-attempts {{config.params.max_retries}} \
--output-folder {{config.output_dir}}/judgements{% if target.api_endpoint.url is not none %} --base-url {{target.api_endpoint.url}}{% endif %}{% if config.params.limit_samples is not none %} --limit-samples {{config.params.limit_samples}}{% endif %}{% if config.params.temperature is not none %} --temperature {{config.params.temperature}}{% endif %}{% if config.params.top_p is not none %} --top-p {{config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %} --max-tokens {{config.params.max_new_tokens}}{% endif %} &&
JUDGE_OUTPUT=$(ls -t {{config.output_dir}}/judgements/*.jsonl | head -1) &&
python -m profbench.score_report_generation $JUDGE_OUTPUT
{% endif %} {% if config.params.extra.run_judge_provided %}
python -m profbench.run_llm_judge_on_provided_reports \
--model {{target.api_endpoint.model_id}} \
--library {{config.params.extra.library}} \
--timeout {{config.params.request_timeout}} \
--parallel {{config.params.parallelism}} \
--retry-attempts {{config.params.max_retries}} \
--folder {{config.output_dir}}{% if target.api_endpoint.url is not none %} --base-url {{target.api_endpoint.url}}{% endif %}{% if config.params.extra.reasoning %} --reasoning{% endif %}{% if config.params.extra.reasoning_effort is not none %} --reasoning-effort {{config.params.extra.reasoning_effort}}{% endif %}{% if config.params.extra.debug %} --debug{% endif %}{% if config.params.limit_samples is not none %} --limit-samples {{config.params.limit_samples}}{% endif %}{% if config.params.temperature is not none %} --temperature {{config.params.temperature}}{% endif %}{% if config.params.top_p is not none %} --top-p {{config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %} --max-tokens {{config.params.max_new_tokens}}{% endif %} &&
JUDGE_OUTPUT=$(ls -t {{config.output_dir}}/*.jsonl | head -1) &&
python -m profbench.score_llm_judge $JUDGE_OUTPUT
{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: profbench
pkg_name: profbench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
temperature: 0.0
request_timeout: 600
top_p: 1.0e-05
extra:
run_generation: false
run_judge_generated: false
run_judge_provided: true
library: openai
version: lite
web_search: false
reasoning: false
reasoning_effort: null
debug: false
supported_endpoint_types:
- chat
type: llm_judge
target:
api_endpoint: {}
```
:::
::::
---
(profbench-report-generation)=
## report_generation
Generate professional reports and evaluate them (full pipeline)
::::{tab-set}
:::{tab-item} Container
**Harness:** `profbench`
**Container:**
```
nvcr.io/nvidia/eval-factory/profbench:26.01
```
**Container Digest:**
```
sha256:7b2766affe4c2070ec803a893f7bf1ff2fc735df562aa520ec910c9ef58d3598
```
**Container Arch:** `multiarch`
**Task Type:** `report_generation`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}
export API_KEY=${{target.api_endpoint.api_key_name}} &&
{% endif %} {% if config.params.extra.run_generation %}
python -m profbench.run_report_generation \
--model {{target.api_endpoint.model_id}} \
--library {{config.params.extra.library}} \
--timeout {{config.params.request_timeout}} \
--parallel {{config.params.parallelism}} \
--retry-attempts {{config.params.max_retries}} \
--folder {{config.output_dir}}{% if target.api_endpoint.url is not none %} --base-url {{target.api_endpoint.url}}{% endif %}{% if config.params.extra.version is not none %} --version {{config.params.extra.version}}{% endif %}{% if config.params.extra.web_search %} --web-search{% endif %}{% if config.params.extra.reasoning %} --reasoning{% endif %}{% if config.params.extra.reasoning_effort is not none %} --reasoning-effort {{config.params.extra.reasoning_effort}}{% endif %}{% if config.params.limit_samples is not none %} --limit-samples {{config.params.limit_samples}}{% endif %}{% if config.params.temperature is not none %} --temperature {{config.params.temperature}}{% endif %}{% if config.params.top_p is not none %} --top-p {{config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %} --max-tokens {{config.params.max_new_tokens}}{% endif %} &&
GENERATION_OUTPUT=$(ls -t {{config.output_dir}}/*.jsonl | head -1) &&
{% endif %} {% if config.params.extra.run_judge_generated %}
python -m profbench.run_best_llm_judge_on_generated_reports \
--filename $GENERATION_OUTPUT \
--api-key $API_KEY \
--model {{target.api_endpoint.model_id}} \
--library {{config.params.extra.library}} \
--timeout {{config.params.request_timeout}} \
--parallel {{config.params.parallelism}} \
--retry-attempts {{config.params.max_retries}} \
--output-folder {{config.output_dir}}/judgements{% if target.api_endpoint.url is not none %} --base-url {{target.api_endpoint.url}}{% endif %}{% if config.params.limit_samples is not none %} --limit-samples {{config.params.limit_samples}}{% endif %}{% if config.params.temperature is not none %} --temperature {{config.params.temperature}}{% endif %}{% if config.params.top_p is not none %} --top-p {{config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %} --max-tokens {{config.params.max_new_tokens}}{% endif %} &&
JUDGE_OUTPUT=$(ls -t {{config.output_dir}}/judgements/*.jsonl | head -1) &&
python -m profbench.score_report_generation $JUDGE_OUTPUT
{% endif %} {% if config.params.extra.run_judge_provided %}
python -m profbench.run_llm_judge_on_provided_reports \
--model {{target.api_endpoint.model_id}} \
--library {{config.params.extra.library}} \
--timeout {{config.params.request_timeout}} \
--parallel {{config.params.parallelism}} \
--retry-attempts {{config.params.max_retries}} \
--folder {{config.output_dir}}{% if target.api_endpoint.url is not none %} --base-url {{target.api_endpoint.url}}{% endif %}{% if config.params.extra.reasoning %} --reasoning{% endif %}{% if config.params.extra.reasoning_effort is not none %} --reasoning-effort {{config.params.extra.reasoning_effort}}{% endif %}{% if config.params.extra.debug %} --debug{% endif %}{% if config.params.limit_samples is not none %} --limit-samples {{config.params.limit_samples}}{% endif %}{% if config.params.temperature is not none %} --temperature {{config.params.temperature}}{% endif %}{% if config.params.top_p is not none %} --top-p {{config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %} --max-tokens {{config.params.max_new_tokens}}{% endif %} &&
JUDGE_OUTPUT=$(ls -t {{config.output_dir}}/*.jsonl | head -1) &&
python -m profbench.score_llm_judge $JUDGE_OUTPUT
{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: profbench
pkg_name: profbench
config:
params:
max_new_tokens: 4096
max_retries: 5
parallelism: 10
temperature: 0.0
request_timeout: 600
top_p: 1.0e-05
extra:
run_generation: true
run_judge_generated: true
run_judge_provided: false
library: openai
version: lite
web_search: false
reasoning: false
reasoning_effort: null
debug: false
supported_endpoint_types:
- chat
type: report_generation
target:
api_endpoint: {}
```
:::
::::
# ruler
This page contains all evaluation tasks for the **ruler** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [ruler-128k-chat](#ruler-ruler-128k-chat)
- RULER with context length of 128k (chat mode)
* - [ruler-128k-completions](#ruler-ruler-128k-completions)
- RULER with context length of 128k (completions mode)
* - [ruler-16k-chat](#ruler-ruler-16k-chat)
- RULER with context length of 16k (chat mode)
* - [ruler-16k-completions](#ruler-ruler-16k-completions)
- RULER with context length of 16k (completions mode)
* - [ruler-1m-chat](#ruler-ruler-1m-chat)
- RULER with context length of 1M (chat mode)
* - [ruler-1m-completions](#ruler-ruler-1m-completions)
- RULER with context length of 1M (completions mode)
* - [ruler-256k-chat](#ruler-ruler-256k-chat)
- RULER with context length of 256k (chat mode)
* - [ruler-256k-completions](#ruler-ruler-256k-completions)
- RULER with context length of 256k (completions mode)
* - [ruler-32k-chat](#ruler-ruler-32k-chat)
- RULER with context length of 32k (chat mode)
* - [ruler-32k-completions](#ruler-ruler-32k-completions)
- RULER with context length of 32k (completions mode)
* - [ruler-4k-chat](#ruler-ruler-4k-chat)
- RULER with context length of 4k (chat mode)
* - [ruler-4k-completions](#ruler-ruler-4k-completions)
- RULER with context length of 4k (completions mode)
* - [ruler-512k-chat](#ruler-ruler-512k-chat)
- RULER with context length of 512k (chat mode)
* - [ruler-512k-completions](#ruler-ruler-512k-completions)
- RULER with context length of 512k (completions mode)
* - [ruler-64k-chat](#ruler-ruler-64k-chat)
- RULER with context length of 64k (chat mode)
* - [ruler-64k-completions](#ruler-ruler-64k-completions)
- RULER with context length of 64k (completions mode)
* - [ruler-8k-chat](#ruler-ruler-8k-chat)
- RULER with context length of 8k (chat mode)
* - [ruler-8k-completions](#ruler-ruler-8k-completions)
- RULER with context length of 8k (completions mode)
* - [ruler-chat](#ruler-ruler-chat)
- RULER (chat mode) without specified context length. A user must explicitly specify `max_seq_length` parameter.
* - [ruler-completions](#ruler-ruler-completions)
- RULER (completions mode) without specified context length. A user must explicitly specify `max_seq_length` parameter.
```
(ruler-ruler-128k-chat)=
## ruler-128k-chat
RULER with context length of 128k (chat mode)
::::{tab-set}
:::{tab-item} Container
**Harness:** `ruler`
**Container:**
```
nvcr.io/nvidia/eval-factory/long-context-eval:26.01
```
**Container Digest:**
```
sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689
```
**Container Arch:** `multiarch`
**Task Type:** `ruler-128k-chat`
:::
:::{tab-item} Command
```bash
python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: ruler
pkg_name: long_context_eval
config:
params:
parallelism: 1
temperature: 0.0
request_timeout: 300
top_p: 0.0001
extra:
tokenizer: null
tokenizer_backend: hf
max_seq_length: 128000
subtasks: all
supported_endpoint_types:
- chat
type: ruler-128k-chat
target:
api_endpoint: {}
```
:::
::::
---
(ruler-ruler-128k-completions)=
## ruler-128k-completions
RULER with context length of 128k (completions mode)
::::{tab-set}
:::{tab-item} Container
**Harness:** `ruler`
**Container:**
```
nvcr.io/nvidia/eval-factory/long-context-eval:26.01
```
**Container Digest:**
```
sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689
```
**Container Arch:** `multiarch`
**Task Type:** `ruler-128k-completions`
:::
:::{tab-item} Command
```bash
python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: ruler
pkg_name: long_context_eval
config:
params:
parallelism: 1
temperature: 0.0
request_timeout: 300
top_p: 0.0001
extra:
tokenizer: null
tokenizer_backend: hf
max_seq_length: 128000
subtasks: all
supported_endpoint_types:
- completions
type: ruler-128k-completions
target:
api_endpoint: {}
```
:::
::::
---
(ruler-ruler-16k-chat)=
## ruler-16k-chat
RULER with context length of 16k (chat mode)
::::{tab-set}
:::{tab-item} Container
**Harness:** `ruler`
**Container:**
```
nvcr.io/nvidia/eval-factory/long-context-eval:26.01
```
**Container Digest:**
```
sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689
```
**Container Arch:** `multiarch`
**Task Type:** `ruler-16k-chat`
:::
:::{tab-item} Command
```bash
python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: ruler
pkg_name: long_context_eval
config:
params:
parallelism: 1
temperature: 0.0
request_timeout: 300
top_p: 0.0001
extra:
tokenizer: null
tokenizer_backend: hf
max_seq_length: 16000
subtasks: all
supported_endpoint_types:
- chat
type: ruler-16k-chat
target:
api_endpoint: {}
```
:::
::::
---
(ruler-ruler-16k-completions)=
## ruler-16k-completions
RULER with context length of 16k (completions mode)
::::{tab-set}
:::{tab-item} Container
**Harness:** `ruler`
**Container:**
```
nvcr.io/nvidia/eval-factory/long-context-eval:26.01
```
**Container Digest:**
```
sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689
```
**Container Arch:** `multiarch`
**Task Type:** `ruler-16k-completions`
:::
:::{tab-item} Command
```bash
python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: ruler
pkg_name: long_context_eval
config:
params:
parallelism: 1
temperature: 0.0
request_timeout: 300
top_p: 0.0001
extra:
tokenizer: null
tokenizer_backend: hf
max_seq_length: 16000
subtasks: all
supported_endpoint_types:
- completions
type: ruler-16k-completions
target:
api_endpoint: {}
```
:::
::::
---
(ruler-ruler-1m-chat)=
## ruler-1m-chat
RULER with context length of 1M (chat mode)
::::{tab-set}
:::{tab-item} Container
**Harness:** `ruler`
**Container:**
```
nvcr.io/nvidia/eval-factory/long-context-eval:26.01
```
**Container Digest:**
```
sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689
```
**Container Arch:** `multiarch`
**Task Type:** `ruler-1m-chat`
:::
:::{tab-item} Command
```bash
python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: ruler
pkg_name: long_context_eval
config:
params:
parallelism: 1
temperature: 0.0
request_timeout: 300
top_p: 0.0001
extra:
tokenizer: null
tokenizer_backend: hf
max_seq_length: 1000000
subtasks: all
supported_endpoint_types:
- chat
type: ruler-1m-chat
target:
api_endpoint: {}
```
:::
::::
---
(ruler-ruler-1m-completions)=
## ruler-1m-completions
RULER with context length of 1M (completions mode)
::::{tab-set}
:::{tab-item} Container
**Harness:** `ruler`
**Container:**
```
nvcr.io/nvidia/eval-factory/long-context-eval:26.01
```
**Container Digest:**
```
sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689
```
**Container Arch:** `multiarch`
**Task Type:** `ruler-1m-completions`
:::
:::{tab-item} Command
```bash
python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: ruler
pkg_name: long_context_eval
config:
params:
parallelism: 1
temperature: 0.0
request_timeout: 300
top_p: 0.0001
extra:
tokenizer: null
tokenizer_backend: hf
max_seq_length: 1000000
subtasks: all
supported_endpoint_types:
- completions
type: ruler-1m-completions
target:
api_endpoint: {}
```
:::
::::
---
(ruler-ruler-256k-chat)=
## ruler-256k-chat
RULER with context length of 256k (chat mode)
::::{tab-set}
:::{tab-item} Container
**Harness:** `ruler`
**Container:**
```
nvcr.io/nvidia/eval-factory/long-context-eval:26.01
```
**Container Digest:**
```
sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689
```
**Container Arch:** `multiarch`
**Task Type:** `ruler-256k-chat`
:::
:::{tab-item} Command
```bash
python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: ruler
pkg_name: long_context_eval
config:
params:
parallelism: 1
temperature: 0.0
request_timeout: 300
top_p: 0.0001
extra:
tokenizer: null
tokenizer_backend: hf
max_seq_length: 256000
subtasks: all
supported_endpoint_types:
- chat
type: ruler-256k-chat
target:
api_endpoint: {}
```
:::
::::
---
(ruler-ruler-256k-completions)=
## ruler-256k-completions
RULER with context length of 256k (completions mode)
::::{tab-set}
:::{tab-item} Container
**Harness:** `ruler`
**Container:**
```
nvcr.io/nvidia/eval-factory/long-context-eval:26.01
```
**Container Digest:**
```
sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689
```
**Container Arch:** `multiarch`
**Task Type:** `ruler-256k-completions`
:::
:::{tab-item} Command
```bash
python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: ruler
pkg_name: long_context_eval
config:
params:
parallelism: 1
temperature: 0.0
request_timeout: 300
top_p: 0.0001
extra:
tokenizer: null
tokenizer_backend: hf
max_seq_length: 256000
subtasks: all
supported_endpoint_types:
- completions
type: ruler-256k-completions
target:
api_endpoint: {}
```
:::
::::
---
(ruler-ruler-32k-chat)=
## ruler-32k-chat
RULER with context length of 32k (chat mode)
::::{tab-set}
:::{tab-item} Container
**Harness:** `ruler`
**Container:**
```
nvcr.io/nvidia/eval-factory/long-context-eval:26.01
```
**Container Digest:**
```
sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689
```
**Container Arch:** `multiarch`
**Task Type:** `ruler-32k-chat`
:::
:::{tab-item} Command
```bash
python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: ruler
pkg_name: long_context_eval
config:
params:
parallelism: 1
temperature: 0.0
request_timeout: 300
top_p: 0.0001
extra:
tokenizer: null
tokenizer_backend: hf
max_seq_length: 32000
subtasks: all
supported_endpoint_types:
- chat
type: ruler-32k-chat
target:
api_endpoint: {}
```
:::
::::
---
(ruler-ruler-32k-completions)=
## ruler-32k-completions
RULER with context length of 32k (completions mode)
::::{tab-set}
:::{tab-item} Container
**Harness:** `ruler`
**Container:**
```
nvcr.io/nvidia/eval-factory/long-context-eval:26.01
```
**Container Digest:**
```
sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689
```
**Container Arch:** `multiarch`
**Task Type:** `ruler-32k-completions`
:::
:::{tab-item} Command
```bash
python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: ruler
pkg_name: long_context_eval
config:
params:
parallelism: 1
temperature: 0.0
request_timeout: 300
top_p: 0.0001
extra:
tokenizer: null
tokenizer_backend: hf
max_seq_length: 32000
subtasks: all
supported_endpoint_types:
- completions
type: ruler-32k-completions
target:
api_endpoint: {}
```
:::
::::
---
(ruler-ruler-4k-chat)=
## ruler-4k-chat
RULER with context length of 4k (chat mode)
::::{tab-set}
:::{tab-item} Container
**Harness:** `ruler`
**Container:**
```
nvcr.io/nvidia/eval-factory/long-context-eval:26.01
```
**Container Digest:**
```
sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689
```
**Container Arch:** `multiarch`
**Task Type:** `ruler-4k-chat`
:::
:::{tab-item} Command
```bash
python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: ruler
pkg_name: long_context_eval
config:
params:
parallelism: 1
temperature: 0.0
request_timeout: 300
top_p: 0.0001
extra:
tokenizer: null
tokenizer_backend: hf
max_seq_length: 4000
subtasks: all
supported_endpoint_types:
- chat
type: ruler-4k-chat
target:
api_endpoint: {}
```
:::
::::
---
(ruler-ruler-4k-completions)=
## ruler-4k-completions
RULER with context length of 4k (completions mode)
::::{tab-set}
:::{tab-item} Container
**Harness:** `ruler`
**Container:**
```
nvcr.io/nvidia/eval-factory/long-context-eval:26.01
```
**Container Digest:**
```
sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689
```
**Container Arch:** `multiarch`
**Task Type:** `ruler-4k-completions`
:::
:::{tab-item} Command
```bash
python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: ruler
pkg_name: long_context_eval
config:
params:
parallelism: 1
temperature: 0.0
request_timeout: 300
top_p: 0.0001
extra:
tokenizer: null
tokenizer_backend: hf
max_seq_length: 4000
subtasks: all
supported_endpoint_types:
- completions
type: ruler-4k-completions
target:
api_endpoint: {}
```
:::
::::
---
(ruler-ruler-512k-chat)=
## ruler-512k-chat
RULER with context length of 512k (chat mode)
::::{tab-set}
:::{tab-item} Container
**Harness:** `ruler`
**Container:**
```
nvcr.io/nvidia/eval-factory/long-context-eval:26.01
```
**Container Digest:**
```
sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689
```
**Container Arch:** `multiarch`
**Task Type:** `ruler-512k-chat`
:::
:::{tab-item} Command
```bash
python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: ruler
pkg_name: long_context_eval
config:
params:
parallelism: 1
temperature: 0.0
request_timeout: 300
top_p: 0.0001
extra:
tokenizer: null
tokenizer_backend: hf
max_seq_length: 512000
subtasks: all
supported_endpoint_types:
- chat
type: ruler-512k-chat
target:
api_endpoint: {}
```
:::
::::
---
(ruler-ruler-512k-completions)=
## ruler-512k-completions
RULER with context length of 512k (completions mode)
::::{tab-set}
:::{tab-item} Container
**Harness:** `ruler`
**Container:**
```
nvcr.io/nvidia/eval-factory/long-context-eval:26.01
```
**Container Digest:**
```
sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689
```
**Container Arch:** `multiarch`
**Task Type:** `ruler-512k-completions`
:::
:::{tab-item} Command
```bash
python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: ruler
pkg_name: long_context_eval
config:
params:
parallelism: 1
temperature: 0.0
request_timeout: 300
top_p: 0.0001
extra:
tokenizer: null
tokenizer_backend: hf
max_seq_length: 512000
subtasks: all
supported_endpoint_types:
- completions
type: ruler-512k-completions
target:
api_endpoint: {}
```
:::
::::
---
(ruler-ruler-64k-chat)=
## ruler-64k-chat
RULER with context length of 64k (chat mode)
::::{tab-set}
:::{tab-item} Container
**Harness:** `ruler`
**Container:**
```
nvcr.io/nvidia/eval-factory/long-context-eval:26.01
```
**Container Digest:**
```
sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689
```
**Container Arch:** `multiarch`
**Task Type:** `ruler-64k-chat`
:::
:::{tab-item} Command
```bash
python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: ruler
pkg_name: long_context_eval
config:
params:
parallelism: 1
temperature: 0.0
request_timeout: 300
top_p: 0.0001
extra:
tokenizer: null
tokenizer_backend: hf
max_seq_length: 64000
subtasks: all
supported_endpoint_types:
- chat
type: ruler-64k-chat
target:
api_endpoint: {}
```
:::
::::
---
(ruler-ruler-64k-completions)=
## ruler-64k-completions
RULER with context length of 64k (completions mode)
::::{tab-set}
:::{tab-item} Container
**Harness:** `ruler`
**Container:**
```
nvcr.io/nvidia/eval-factory/long-context-eval:26.01
```
**Container Digest:**
```
sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689
```
**Container Arch:** `multiarch`
**Task Type:** `ruler-64k-completions`
:::
:::{tab-item} Command
```bash
python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: ruler
pkg_name: long_context_eval
config:
params:
parallelism: 1
temperature: 0.0
request_timeout: 300
top_p: 0.0001
extra:
tokenizer: null
tokenizer_backend: hf
max_seq_length: 64000
subtasks: all
supported_endpoint_types:
- completions
type: ruler-64k-completions
target:
api_endpoint: {}
```
:::
::::
---
(ruler-ruler-8k-chat)=
## ruler-8k-chat
RULER with context length of 8k (chat mode)
::::{tab-set}
:::{tab-item} Container
**Harness:** `ruler`
**Container:**
```
nvcr.io/nvidia/eval-factory/long-context-eval:26.01
```
**Container Digest:**
```
sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689
```
**Container Arch:** `multiarch`
**Task Type:** `ruler-8k-chat`
:::
:::{tab-item} Command
```bash
python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: ruler
pkg_name: long_context_eval
config:
params:
parallelism: 1
temperature: 0.0
request_timeout: 300
top_p: 0.0001
extra:
tokenizer: null
tokenizer_backend: hf
max_seq_length: 8000
subtasks: all
supported_endpoint_types:
- chat
type: ruler-8k-chat
target:
api_endpoint: {}
```
:::
::::
---
(ruler-ruler-8k-completions)=
## ruler-8k-completions
RULER with context length of 8k (completions mode)
::::{tab-set}
:::{tab-item} Container
**Harness:** `ruler`
**Container:**
```
nvcr.io/nvidia/eval-factory/long-context-eval:26.01
```
**Container Digest:**
```
sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689
```
**Container Arch:** `multiarch`
**Task Type:** `ruler-8k-completions`
:::
:::{tab-item} Command
```bash
python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: ruler
pkg_name: long_context_eval
config:
params:
parallelism: 1
temperature: 0.0
request_timeout: 300
top_p: 0.0001
extra:
tokenizer: null
tokenizer_backend: hf
max_seq_length: 8000
subtasks: all
supported_endpoint_types:
- completions
type: ruler-8k-completions
target:
api_endpoint: {}
```
:::
::::
---
(ruler-ruler-chat)=
## ruler-chat
RULER (chat mode) without specified context length. A user must explicitly specify `max_seq_length` parameter.
::::{tab-set}
:::{tab-item} Container
**Harness:** `ruler`
**Container:**
```
nvcr.io/nvidia/eval-factory/long-context-eval:26.01
```
**Container Digest:**
```
sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689
```
**Container Arch:** `multiarch`
**Task Type:** `ruler-chat`
:::
:::{tab-item} Command
```bash
python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: ruler
pkg_name: long_context_eval
config:
params:
parallelism: 1
temperature: 0.0
request_timeout: 300
top_p: 0.0001
extra:
tokenizer: null
tokenizer_backend: hf
max_seq_length: null
subtasks: all
supported_endpoint_types:
- chat
type: ruler-chat
target:
api_endpoint: {}
```
:::
::::
---
(ruler-ruler-completions)=
## ruler-completions
RULER (completions mode) without specified context length. A user must explicitly specify `max_seq_length` parameter.
::::{tab-set}
:::{tab-item} Container
**Harness:** `ruler`
**Container:**
```
nvcr.io/nvidia/eval-factory/long-context-eval:26.01
```
**Container Digest:**
```
sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689
```
**Container Arch:** `multiarch`
**Task Type:** `ruler-completions`
:::
:::{tab-item} Command
```bash
python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: ruler
pkg_name: long_context_eval
config:
params:
parallelism: 1
temperature: 0.0
request_timeout: 300
top_p: 0.0001
extra:
tokenizer: null
tokenizer_backend: hf
max_seq_length: null
subtasks: all
supported_endpoint_types:
- completions
type: ruler-completions
target:
api_endpoint: {}
```
:::
::::
# safety_eval
This page contains all evaluation tasks for the **safety_eval** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [aegis_v2](#safety-eval-aegis-v2)
- Aegis V2 without evaluating reasoning traces. This version is used by the NeMo Safety Toolkit.
* - [aegis_v2_reasoning](#safety-eval-aegis-v2-reasoning)
- Aegis V2 with evaluating reasoning traces.
* - [wildguard](#safety-eval-wildguard)
- Wildguard
```
(safety-eval-aegis-v2)=
## aegis_v2
Aegis V2 without evaluating reasoning traces. This version is used by the NeMo Safety Toolkit.
::::{tab-set}
:::{tab-item} Container
**Harness:** `safety_eval`
**Container:**
```
nvcr.io/nvidia/eval-factory/safety-harness:25.11
```
**Container Digest:**
```
sha256:08eeb3f5c3282522ca30da7d3ddc2cab1a48909be05ba561a0dae9a299c637f0
```
**Container Arch:** `multiarch`
**Task Type:** `aegis_v2`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key is not none %}export API_KEY=${{target.api_endpoint.api_key}} && {% endif %} {% if config.params.extra.judge.api_key is not none %}export JUDGE_API_KEY=${{config.params.extra.judge.api_key}} && {% endif %} safety-eval --model-name {{target.api_endpoint.model_id}} --model-url {{target.api_endpoint.url}} --model-type {{target.api_endpoint.type}} --judge-url {{config.params.extra.judge.url}} --results-dir {{config.output_dir}} --eval {{config.params.task}} --mut-inference-params max_tokens={{config.params.max_new_tokens}},temperature={{config.params.temperature}},top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},concurrency={{config.params.parallelism}},retries={{config.params.max_retries}} --judge-inference-params concurrency={{config.params.extra.judge.parallelism}},retries={{config.params.max_retries}} {% if config.params.extra.dataset is defined and config.params.extra.dataset %} --dataset {{config.params.extra.dataset}}{% endif %} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}} {% endif %} {% if config.params.extra.judge.model_id is not none %} --judge-model-name {{config.params.extra.judge.model_id}} {% endif %} {% if config.type == "aegis_v2_reasoning" %} {% if config.params.extra.evaluate_reasoning_traces %} --evaluate-reasoning-traces {% endif %} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: safety_eval
pkg_name: safety_eval
config:
params:
max_new_tokens: 6144
max_retries: 5
parallelism: 8
task: aegis_v2
temperature: 0.6
request_timeout: 30
top_p: 0.95
extra:
judge:
url: null
model_id: null
api_key: null
parallelism: 32
request_timeout: 60
max_retries: 16
evaluate_reasoning_traces: false
supported_endpoint_types:
- chat
- completions
type: aegis_v2
target:
api_endpoint:
stream: false
```
:::
::::
---
(safety-eval-aegis-v2-reasoning)=
## aegis_v2_reasoning
Aegis V2 with evaluating reasoning traces.
::::{tab-set}
:::{tab-item} Container
**Harness:** `safety_eval`
**Container:**
```
nvcr.io/nvidia/eval-factory/safety-harness:25.11
```
**Container Digest:**
```
sha256:08eeb3f5c3282522ca30da7d3ddc2cab1a48909be05ba561a0dae9a299c637f0
```
**Container Arch:** `multiarch`
**Task Type:** `aegis_v2_reasoning`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key is not none %}export API_KEY=${{target.api_endpoint.api_key}} && {% endif %} {% if config.params.extra.judge.api_key is not none %}export JUDGE_API_KEY=${{config.params.extra.judge.api_key}} && {% endif %} safety-eval --model-name {{target.api_endpoint.model_id}} --model-url {{target.api_endpoint.url}} --model-type {{target.api_endpoint.type}} --judge-url {{config.params.extra.judge.url}} --results-dir {{config.output_dir}} --eval {{config.params.task}} --mut-inference-params max_tokens={{config.params.max_new_tokens}},temperature={{config.params.temperature}},top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},concurrency={{config.params.parallelism}},retries={{config.params.max_retries}} --judge-inference-params concurrency={{config.params.extra.judge.parallelism}},retries={{config.params.max_retries}} {% if config.params.extra.dataset is defined and config.params.extra.dataset %} --dataset {{config.params.extra.dataset}}{% endif %} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}} {% endif %} {% if config.params.extra.judge.model_id is not none %} --judge-model-name {{config.params.extra.judge.model_id}} {% endif %} {% if config.type == "aegis_v2_reasoning" %} {% if config.params.extra.evaluate_reasoning_traces %} --evaluate-reasoning-traces {% endif %} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: safety_eval
pkg_name: safety_eval
config:
params:
max_new_tokens: 6144
max_retries: 5
parallelism: 8
task: aegis_v2
temperature: 0.6
request_timeout: 30
top_p: 0.95
extra:
judge:
url: null
model_id: null
api_key: null
parallelism: 32
request_timeout: 60
max_retries: 16
evaluate_reasoning_traces: true
supported_endpoint_types:
- chat
- completions
type: aegis_v2_reasoning
target:
api_endpoint:
stream: false
```
:::
::::
---
(safety-eval-wildguard)=
## wildguard
Wildguard
::::{tab-set}
:::{tab-item} Container
**Harness:** `safety_eval`
**Container:**
```
nvcr.io/nvidia/eval-factory/safety-harness:25.11
```
**Container Digest:**
```
sha256:08eeb3f5c3282522ca30da7d3ddc2cab1a48909be05ba561a0dae9a299c637f0
```
**Container Arch:** `multiarch`
**Task Type:** `wildguard`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key is not none %}export API_KEY=${{target.api_endpoint.api_key}} && {% endif %} {% if config.params.extra.judge.api_key is not none %}export JUDGE_API_KEY=${{config.params.extra.judge.api_key}} && {% endif %} safety-eval --model-name {{target.api_endpoint.model_id}} --model-url {{target.api_endpoint.url}} --model-type {{target.api_endpoint.type}} --judge-url {{config.params.extra.judge.url}} --results-dir {{config.output_dir}} --eval {{config.params.task}} --mut-inference-params max_tokens={{config.params.max_new_tokens}},temperature={{config.params.temperature}},top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},concurrency={{config.params.parallelism}},retries={{config.params.max_retries}} --judge-inference-params concurrency={{config.params.extra.judge.parallelism}},retries={{config.params.max_retries}} {% if config.params.extra.dataset is defined and config.params.extra.dataset %} --dataset {{config.params.extra.dataset}}{% endif %} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}} {% endif %} {% if config.params.extra.judge.model_id is not none %} --judge-model-name {{config.params.extra.judge.model_id}} {% endif %} {% if config.type == "aegis_v2_reasoning" %} {% if config.params.extra.evaluate_reasoning_traces %} --evaluate-reasoning-traces {% endif %} {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: safety_eval
pkg_name: safety_eval
config:
params:
max_new_tokens: 6144
max_retries: 5
parallelism: 8
task: wildguard
temperature: 0.6
request_timeout: 30
top_p: 0.95
extra:
judge:
url: null
model_id: null
api_key: null
parallelism: 32
request_timeout: 60
max_retries: 16
supported_endpoint_types:
- chat
- completions
type: wildguard
target:
api_endpoint:
stream: false
```
:::
::::
# scicode
This page contains all evaluation tasks for the **scicode** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [scicode](#scicode-scicode)
- - SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. - Includes default system prompt ("You are a helpful assistant.").
* - [scicode_aa_v2](#scicode-scicode-aa-v2)
- - SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. - This variant mimicks setup used by Artificial Analysis in their Intelligence Benchmark (v2). - It includes scientist-annotated background in the prompts and uses all available problems for evaluation (including "dev" set). - Does not include a default system prompt ("You are a helpful assistant.").
* - [scicode_background](#scicode-scicode-background)
- - SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. - This variant includes scientist-annotated background in the prompts. - Includes default system prompt ("You are a helpful assistant.").
```
(scicode-scicode)=
## scicode
- SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. - Includes default system prompt ("You are a helpful assistant.").
::::{tab-set}
:::{tab-item} Container
**Harness:** `scicode`
**Container:**
```
nvcr.io/nvidia/eval-factory/scicode:26.01
```
**Container Digest:**
```
sha256:f5c12499db7d8b415321c4242e5625ed69affdc1632056326790e5d55a4656e0
```
**Container Arch:** `multiarch`
**Task Type:** `scicode`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} scicode_eval --model {{target.api_endpoint.model_id}} --url {{target.api_endpoint.url}} --output-dir {{config.output_dir}}/scicode_results --log-dir {{config.output_dir}}/logs {% if config.params.temperature is not none %}--temperature={{config.params.temperature}}{% endif %} {% if config.params.limit_samples is not none %}--limit-samples={{config.params.limit_samples}}{% endif %} --n-samples={{config.params.extra.n_samples}} --extra-params top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},max_tokens={{config.params.max_new_tokens}},max_retries={{config.params.max_retries}},include_system_prompt={{config.params.extra.include_system_prompt}} {% if config.params.extra.with_background %}--with-background {% endif %} {% if config.params.extra.include_dev %}--include-dev{% endif %} {% if config.params.extra.eval_threads is not none %}--eval-threads={{config.params.extra.eval_threads}}{% endif %} {% if config.params.extra.regex_path is not none %}--regex-path={{config.params.extra.regex_path}}{% endif %} {% if config.params.extra.prompt_template_type is not none %}--prompt-template-type={{config.params.extra.prompt_template_type}}{% endif %} --concurrent-requests={{config.params.parallelism}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: scicode
pkg_name: scicode
config:
params:
max_new_tokens: 2048
max_retries: 2
parallelism: 1
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
with_background: false
include_dev: false
n_samples: 1
eval_threads: null
include_system_prompt: true
regex_path: null
prompt_template_type: null
supported_endpoint_types:
- chat
type: scicode
target:
api_endpoint:
stream: false
```
:::
::::
---
(scicode-scicode-aa-v2)=
## scicode_aa_v2
- SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. - This variant mimicks setup used by Artificial Analysis in their Intelligence Benchmark (v2). - It includes scientist-annotated background in the prompts and uses all available problems for evaluation (including "dev" set). - Does not include a default system prompt ("You are a helpful assistant.").
::::{tab-set}
:::{tab-item} Container
**Harness:** `scicode`
**Container:**
```
nvcr.io/nvidia/eval-factory/scicode:26.01
```
**Container Digest:**
```
sha256:f5c12499db7d8b415321c4242e5625ed69affdc1632056326790e5d55a4656e0
```
**Container Arch:** `multiarch`
**Task Type:** `scicode_aa_v2`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} scicode_eval --model {{target.api_endpoint.model_id}} --url {{target.api_endpoint.url}} --output-dir {{config.output_dir}}/scicode_results --log-dir {{config.output_dir}}/logs {% if config.params.temperature is not none %}--temperature={{config.params.temperature}}{% endif %} {% if config.params.limit_samples is not none %}--limit-samples={{config.params.limit_samples}}{% endif %} --n-samples={{config.params.extra.n_samples}} --extra-params top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},max_tokens={{config.params.max_new_tokens}},max_retries={{config.params.max_retries}},include_system_prompt={{config.params.extra.include_system_prompt}} {% if config.params.extra.with_background %}--with-background {% endif %} {% if config.params.extra.include_dev %}--include-dev{% endif %} {% if config.params.extra.eval_threads is not none %}--eval-threads={{config.params.extra.eval_threads}}{% endif %} {% if config.params.extra.regex_path is not none %}--regex-path={{config.params.extra.regex_path}}{% endif %} {% if config.params.extra.prompt_template_type is not none %}--prompt-template-type={{config.params.extra.prompt_template_type}}{% endif %} --concurrent-requests={{config.params.parallelism}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: scicode
pkg_name: scicode
config:
params:
max_new_tokens: 16384
max_retries: 30
parallelism: 1
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
with_background: true
include_dev: true
n_samples: 3
eval_threads: null
include_system_prompt: false
regex_path: aa_regex.txt
prompt_template_type: background_comment_template.txt
supported_endpoint_types:
- chat
type: scicode_aa_v2
target:
api_endpoint:
stream: false
```
:::
::::
---
(scicode-scicode-background)=
## scicode_background
- SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. - This variant includes scientist-annotated background in the prompts. - Includes default system prompt ("You are a helpful assistant.").
::::{tab-set}
:::{tab-item} Container
**Harness:** `scicode`
**Container:**
```
nvcr.io/nvidia/eval-factory/scicode:26.01
```
**Container Digest:**
```
sha256:f5c12499db7d8b415321c4242e5625ed69affdc1632056326790e5d55a4656e0
```
**Container Arch:** `multiarch`
**Task Type:** `scicode_background`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} scicode_eval --model {{target.api_endpoint.model_id}} --url {{target.api_endpoint.url}} --output-dir {{config.output_dir}}/scicode_results --log-dir {{config.output_dir}}/logs {% if config.params.temperature is not none %}--temperature={{config.params.temperature}}{% endif %} {% if config.params.limit_samples is not none %}--limit-samples={{config.params.limit_samples}}{% endif %} --n-samples={{config.params.extra.n_samples}} --extra-params top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},max_tokens={{config.params.max_new_tokens}},max_retries={{config.params.max_retries}},include_system_prompt={{config.params.extra.include_system_prompt}} {% if config.params.extra.with_background %}--with-background {% endif %} {% if config.params.extra.include_dev %}--include-dev{% endif %} {% if config.params.extra.eval_threads is not none %}--eval-threads={{config.params.extra.eval_threads}}{% endif %} {% if config.params.extra.regex_path is not none %}--regex-path={{config.params.extra.regex_path}}{% endif %} {% if config.params.extra.prompt_template_type is not none %}--prompt-template-type={{config.params.extra.prompt_template_type}}{% endif %} --concurrent-requests={{config.params.parallelism}}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: scicode
pkg_name: scicode
config:
params:
max_new_tokens: 2048
max_retries: 2
parallelism: 1
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
with_background: true
include_dev: false
n_samples: 1
eval_threads: null
include_system_prompt: true
regex_path: null
prompt_template_type: null
supported_endpoint_types:
- chat
type: scicode_background
target:
api_endpoint:
stream: false
```
:::
::::
# simple_evals
This page contains all evaluation tasks for the **simple_evals** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [AA_AIME_2024](#simple-evals-aa-aime-2024)
- AIME 2024 questions, math, using Artificial Analysis's setup.
* - [AA_math_test_500](#simple-evals-aa-math-test-500)
- Open Ai math test 500, using Artificial Analysis's setup.
* - [AIME_2024](#simple-evals-aime-2024)
- AIME 2024 questions, math
* - [AIME_2025](#simple-evals-aime-2025)
- AIME 2025 questions, math
* - [AIME_2025_aa_v2](#simple-evals-aime-2025-aa-v2)
- AIME 2025 questions, math - params aligned with Artificial Analysis Index v2
* - [aime_2024_nemo](#simple-evals-aime-2024-nemo)
- AIME 2024 questions, math, using NeMo's alignment template
* - [aime_2025_nemo](#simple-evals-aime-2025-nemo)
- AIME 2025 questions, math, using NeMo's alignment template
* - [browsecomp](#simple-evals-browsecomp)
- BrowseComp is a benchmark for measuring the ability for agents to browse the web.
* - [gpqa_diamond](#simple-evals-gpqa-diamond)
- gpqa_diamond 0-shot CoT
* - [gpqa_diamond_aa_v2](#simple-evals-gpqa-diamond-aa-v2)
- gpqa_diamond questions with custom regex extraction patterns for AA v2
* - [gpqa_diamond_aa_v2_llama_4](#simple-evals-gpqa-diamond-aa-v2-llama-4)
- gpqa_diamond questions with custom regex extraction patterns for Llama 4
* - [gpqa_diamond_aa_v3](#simple-evals-gpqa-diamond-aa-v3)
- GPQA Diamond with AA v3 methodology - multi-stage regex extraction for robust answer parsing
* - [gpqa_diamond_nemo](#simple-evals-gpqa-diamond-nemo)
- gpqa_diamond questions, reasoning, using NeMo's alignment template
* - [gpqa_extended](#simple-evals-gpqa-extended)
- gpqa_extended 0-shot CoT
* - [gpqa_main](#simple-evals-gpqa-main)
- gpqa_main 0-shot CoT
* - [healthbench](#simple-evals-healthbench)
- HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare.
* - [healthbench_consensus](#simple-evals-healthbench-consensus)
- HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare. The consensus subset measures 34 particularly important aspects of model behavior and has been validated by the consensus of multiple physicians.
* - [healthbench_hard](#simple-evals-healthbench-hard)
- HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare. The hard subset consists of 1000 examples chosen because they are difficult for current frontier models.
* - [humaneval](#simple-evals-humaneval)
- HumanEval evaluates the performance in Python code generation tasks. It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.
* - [humanevalplus](#simple-evals-humanevalplus)
- HumanEvalPlus is a dataset of 164 programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.
* - [math_test_500](#simple-evals-math-test-500)
- Open AI math test 500
* - [math_test_500_nemo](#simple-evals-math-test-500-nemo)
- math_test_500 questions, math, using NeMo's alignment template
* - [mgsm](#simple-evals-mgsm)
- MGSM is a benchmark of grade-school math problems. The same 250 problems from GSM8K are each translated via human annotators in 10 languages.
* - [mgsm_aa_v2](#simple-evals-mgsm-aa-v2)
- MGSM is a benchmark of grade-school math problems - params aligned with Artificial Analysis Index v2
* - [mmlu](#simple-evals-mmlu)
- MMLU 0-shot CoT
* - [mmlu_am](#simple-evals-mmlu-am)
- Global-MMLU 0-shot CoT in Amharic (am)
* - [mmlu_ar](#simple-evals-mmlu-ar)
- Global-MMLU 0-shot CoT in Arabic (ar)
* - [mmlu_ar-lite](#simple-evals-mmlu-ar-lite)
- Global-MMLU-Lite 0-shot CoT in Arabic (ar)
* - [mmlu_bn](#simple-evals-mmlu-bn)
- Global-MMLU 0-shot CoT in Bengali (bn)
* - [mmlu_bn-lite](#simple-evals-mmlu-bn-lite)
- Global-MMLU-Lite 0-shot CoT in Bengali (bn)
* - [mmlu_cs](#simple-evals-mmlu-cs)
- Global-MMLU 0-shot CoT in Czech (cs)
* - [mmlu_de](#simple-evals-mmlu-de)
- Global-MMLU 0-shot CoT in German (de)
* - [mmlu_de-lite](#simple-evals-mmlu-de-lite)
- Global-MMLU-Lite 0-shot CoT in German (de)
* - [mmlu_el](#simple-evals-mmlu-el)
- Global-MMLU 0-shot CoT in Greek (el)
* - [mmlu_en](#simple-evals-mmlu-en)
- Global-MMLU 0-shot CoT in English (en)
* - [mmlu_en-lite](#simple-evals-mmlu-en-lite)
- Global-MMLU-Lite 0-shot CoT in English (en)
* - [mmlu_es](#simple-evals-mmlu-es)
- Global-MMLU 0-shot CoT in Spanish (es)
* - [mmlu_es-lite](#simple-evals-mmlu-es-lite)
- Global-MMLU-Lite 0-shot CoT in Spanish (es)
* - [mmlu_fa](#simple-evals-mmlu-fa)
- Global-MMLU 0-shot CoT in Persian (fa)
* - [mmlu_fil](#simple-evals-mmlu-fil)
- Global-MMLU 0-shot CoT in Filipino (fil)
* - [mmlu_fr](#simple-evals-mmlu-fr)
- Global-MMLU 0-shot CoT in French (fr)
* - [mmlu_fr-lite](#simple-evals-mmlu-fr-lite)
- Global-MMLU-Lite 0-shot CoT in French (fr)
* - [mmlu_ha](#simple-evals-mmlu-ha)
- Global-MMLU 0-shot CoT in Hausa (ha)
* - [mmlu_he](#simple-evals-mmlu-he)
- Global-MMLU 0-shot CoT in Hebrew (he)
* - [mmlu_hi](#simple-evals-mmlu-hi)
- Global-MMLU 0-shot CoT in Hindi (hi)
* - [mmlu_hi-lite](#simple-evals-mmlu-hi-lite)
- Global-MMLU-Lite 0-shot CoT in Hindi (hi)
* - [mmlu_id](#simple-evals-mmlu-id)
- Global-MMLU 0-shot CoT in Indonesian (id)
* - [mmlu_id-lite](#simple-evals-mmlu-id-lite)
- Global-MMLU-Lite 0-shot CoT in Indonesian (id)
* - [mmlu_ig](#simple-evals-mmlu-ig)
- Global-MMLU 0-shot CoT in Igbo (ig)
* - [mmlu_it](#simple-evals-mmlu-it)
- Global-MMLU 0-shot CoT in Italian (it)
* - [mmlu_it-lite](#simple-evals-mmlu-it-lite)
- Global-MMLU-Lite 0-shot CoT in Italian (it)
* - [mmlu_ja](#simple-evals-mmlu-ja)
- Global-MMLU 0-shot CoT in Japanese (ja)
* - [mmlu_ja-lite](#simple-evals-mmlu-ja-lite)
- Global-MMLU-Lite 0-shot CoT in Japanese (ja)
* - [mmlu_ko](#simple-evals-mmlu-ko)
- Global-MMLU 0-shot CoT in Korean (ko)
* - [mmlu_ko-lite](#simple-evals-mmlu-ko-lite)
- Global-MMLU-Lite 0-shot CoT in Korean (ko)
* - [mmlu_ky](#simple-evals-mmlu-ky)
- Global-MMLU 0-shot CoT in Kyrgyz (ky)
* - [mmlu_llama_4](#simple-evals-mmlu-llama-4)
- MMLU questions with custom regex extraction patterns for Llama 4
* - [mmlu_lt](#simple-evals-mmlu-lt)
- Global-MMLU 0-shot CoT in Lithuanian (lt)
* - [mmlu_mg](#simple-evals-mmlu-mg)
- Global-MMLU 0-shot CoT in Malagasy (mg)
* - [mmlu_ms](#simple-evals-mmlu-ms)
- Global-MMLU 0-shot CoT in Malay (ms)
* - [mmlu_my-lite](#simple-evals-mmlu-my-lite)
- Global-MMLU-Lite 0-shot CoT in Malay (my)
* - [mmlu_ne](#simple-evals-mmlu-ne)
- Global-MMLU 0-shot CoT in Nepali (ne)
* - [mmlu_nl](#simple-evals-mmlu-nl)
- Global-MMLU 0-shot CoT in Dutch (nl)
* - [mmlu_ny](#simple-evals-mmlu-ny)
- Global-MMLU 0-shot CoT in Nyanja (ny)
* - [mmlu_pl](#simple-evals-mmlu-pl)
- Global-MMLU 0-shot CoT in Polish (pl)
* - [mmlu_pro](#simple-evals-mmlu-pro)
- MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines.
* - [mmlu_pro_aa_v2](#simple-evals-mmlu-pro-aa-v2)
- MMLU-Pro - params aligned with Artificial Analysis Index v2
* - [mmlu_pro_aa_v3](#simple-evals-mmlu-pro-aa-v3)
- MMLU-Pro with AA v3 methodology - multi-stage regex extraction with A-J options
* - [mmlu_pro_llama_4](#simple-evals-mmlu-pro-llama-4)
- MMLU-Pro questions with custom regex extraction patterns for Llama 4
* - [mmlu_pt](#simple-evals-mmlu-pt)
- Global-MMLU 0-shot CoT in Portuguese (pt)
* - [mmlu_pt-lite](#simple-evals-mmlu-pt-lite)
- Global-MMLU-Lite 0-shot CoT in Portuguese (pt)
* - [mmlu_ro](#simple-evals-mmlu-ro)
- Global-MMLU 0-shot CoT in Romanian (ro)
* - [mmlu_ru](#simple-evals-mmlu-ru)
- Global-MMLU 0-shot CoT in Russian (ru)
* - [mmlu_si](#simple-evals-mmlu-si)
- Global-MMLU 0-shot CoT in Sinhala (si)
* - [mmlu_sn](#simple-evals-mmlu-sn)
- Global-MMLU 0-shot CoT in Shona (sn)
* - [mmlu_so](#simple-evals-mmlu-so)
- Global-MMLU 0-shot CoT in Somali (so)
* - [mmlu_sr](#simple-evals-mmlu-sr)
- Global-MMLU 0-shot CoT in Serbian (sr)
* - [mmlu_sv](#simple-evals-mmlu-sv)
- Global-MMLU 0-shot CoT in Swedish (sv)
* - [mmlu_sw](#simple-evals-mmlu-sw)
- Global-MMLU 0-shot CoT in Swahili (sw)
* - [mmlu_sw-lite](#simple-evals-mmlu-sw-lite)
- Global-MMLU-Lite 0-shot CoT in Swahili (sw)
* - [mmlu_te](#simple-evals-mmlu-te)
- Global-MMLU 0-shot CoT in Telugu (te)
* - [mmlu_tr](#simple-evals-mmlu-tr)
- Global-MMLU 0-shot CoT in Turkish (tr)
* - [mmlu_uk](#simple-evals-mmlu-uk)
- Global-MMLU 0-shot CoT in Ukrainian (uk)
* - [mmlu_vi](#simple-evals-mmlu-vi)
- Global-MMLU 0-shot CoT in Vietnamese (vi)
* - [mmlu_yo](#simple-evals-mmlu-yo)
- Global-MMLU 0-shot CoT in Yoruba (yo)
* - [mmlu_yo-lite](#simple-evals-mmlu-yo-lite)
- Global-MMLU-Lite 0-shot CoT in Yoruba (yo)
* - [mmlu_zh-lite](#simple-evals-mmlu-zh-lite)
- Global-MMLU-Lite 0-shot CoT in Chinese (Simplified) (zh)
* - [simpleqa](#simple-evals-simpleqa)
- A factuality benchmark called SimpleQA that measures the ability for language models to answer short, fact-seeking questions.
```
(simple-evals-aa-aime-2024)=
## AA_AIME_2024
AIME 2024 questions, math, using Artificial Analysis's setup.
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `AA_AIME_2024`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: AA_AIME_2024
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: JUDGE_API_KEY
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: AA_AIME_2024
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-aa-math-test-500)=
## AA_math_test_500
Open Ai math test 500, using Artificial Analysis's setup.
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `AA_math_test_500`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: AA_math_test_500
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 3
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: JUDGE_API_KEY
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: AA_math_test_500
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-aime-2024)=
## AIME_2024
AIME 2024 questions, math
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `AIME_2024`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: AIME_2024
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: AIME_2024
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-aime-2025)=
## AIME_2025
AIME 2025 questions, math
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `AIME_2025`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: AIME_2025
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: JUDGE_API_KEY
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: AIME_2025
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-aime-2025-aa-v2)=
## AIME_2025_aa_v2
AIME 2025 questions, math - params aligned with Artificial Analysis Index v2
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `AIME_2025_aa_v2`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 30
parallelism: 10
task: AIME_2025
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: JUDGE_API_KEY
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: AIME_2025_aa_v2
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-aime-2024-nemo)=
## aime_2024_nemo
AIME 2024 questions, math, using NeMo's alignment template
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `aime_2024_nemo`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: aime_2024_nemo
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: aime_2024_nemo
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-aime-2025-nemo)=
## aime_2025_nemo
AIME 2025 questions, math, using NeMo's alignment template
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `aime_2025_nemo`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: aime_2025_nemo
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 10
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: aime_2025_nemo
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-browsecomp)=
## browsecomp
BrowseComp is a benchmark for measuring the ability for agents to browse the web.
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `browsecomp`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: browsecomp
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: JUDGE_API_KEY
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: browsecomp
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-gpqa-diamond)=
## gpqa_diamond
gpqa_diamond 0-shot CoT
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `gpqa_diamond`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: gpqa_diamond
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: gpqa_diamond
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-gpqa-diamond-aa-v2)=
## gpqa_diamond_aa_v2
gpqa_diamond questions with custom regex extraction patterns for AA v2
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `gpqa_diamond_aa_v2`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 30
parallelism: 10
task: gpqa_diamond
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 5
downsampling_ratio: null
add_system_prompt: false
custom_config:
extraction:
- regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])
match_group: 1
name: aa_v2_regex
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: gpqa_diamond_aa_v2
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-gpqa-diamond-aa-v2-llama-4)=
## gpqa_diamond_aa_v2_llama_4
gpqa_diamond questions with custom regex extraction patterns for Llama 4
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `gpqa_diamond_aa_v2_llama_4`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: gpqa_diamond
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 5
downsampling_ratio: null
add_system_prompt: false
custom_config:
extraction:
- regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])
match_group: 1
name: answer_colon_llama4
- regex: (?i)(?:the )?best? answer is\s*[\*\_,{}\.]*([A-D])(?![a-zA-Z0-9])
match_group: 1
name: answer_is_llama4
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: gpqa_diamond_aa_v2_llama_4
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-gpqa-diamond-aa-v3)=
## gpqa_diamond_aa_v3
GPQA Diamond with AA v3 methodology - multi-stage regex extraction for robust answer parsing
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `gpqa_diamond_aa_v3`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 30
parallelism: 10
task: gpqa_diamond
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 5
downsampling_ratio: null
add_system_prompt: false
custom_config:
prompt_template: 'Answer the following multiple choice question. The last line of your response should be in the following
format: ''Answer: A/B/C/D'' (e.g. ''Answer: A'').
{Question}
A) {A}
B) {B}
C) {C}
D) {D}
'
extraction:
- regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])
match_group: 1
name: primary_answer_format
- regex: \\boxed\{[^}]*([A-Z])[^}]*\}
match_group: 1
name: latex_boxed
- regex: answer is ([a-zA-Z])
match_group: 1
name: natural_language
- regex: answer is \(([a-zA-Z])\)
match_group: 1
name: with_parenthesis
- regex: ([A-Z])\)\s*[^A-Z]*
match_group: 1
name: choice_format
- regex: ([A-Z])\s+is\s+the\s+correct\s+answer
match_group: 1
name: explicit_statement
- regex: ([A-Z])\s*$
match_group: 1
name: standalone_letter_end
- regex: ([A-Z])\s*\.
match_group: 1
name: letter_with_period
- regex: ([A-Z])\s*[^\w]
match_group: 1
name: letter_nonword
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: gpqa_diamond_aa_v3
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-gpqa-diamond-nemo)=
## gpqa_diamond_nemo
gpqa_diamond questions, reasoning, using NeMo's alignment template
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `gpqa_diamond_nemo`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: gpqa_diamond_nemo
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 5
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: gpqa_diamond_nemo
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-gpqa-extended)=
## gpqa_extended
gpqa_extended 0-shot CoT
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `gpqa_extended`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: gpqa_extended
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: gpqa_extended
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-gpqa-main)=
## gpqa_main
gpqa_main 0-shot CoT
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `gpqa_main`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: gpqa_main
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: gpqa_main
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-healthbench)=
## healthbench
HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare.
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `healthbench`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: healthbench
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: JUDGE_API_KEY
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: healthbench
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-healthbench-consensus)=
## healthbench_consensus
HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare. The consensus subset measures 34 particularly important aspects of model behavior and has been validated by the consensus of multiple physicians.
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `healthbench_consensus`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: healthbench_consensus
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: JUDGE_API_KEY
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: healthbench_consensus
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-healthbench-hard)=
## healthbench_hard
HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare. The hard subset consists of 1000 examples chosen because they are difficult for current frontier models.
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `healthbench_hard`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: healthbench_hard
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: JUDGE_API_KEY
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: healthbench_hard
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-humaneval)=
## humaneval
HumanEval evaluates the performance in Python code generation tasks. It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `humaneval`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: humaneval
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: humaneval
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-humanevalplus)=
## humanevalplus
HumanEvalPlus is a dataset of 164 programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `humanevalplus`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: humanevalplus
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: humanevalplus
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-math-test-500)=
## math_test_500
Open AI math test 500
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `math_test_500`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: math_test_500
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: math_test_500
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-math-test-500-nemo)=
## math_test_500_nemo
math_test_500 questions, math, using NeMo's alignment template
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `math_test_500_nemo`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: math_test_500_nemo
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 3
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: math_test_500_nemo
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mgsm)=
## mgsm
MGSM is a benchmark of grade-school math problems. The same 250 problems from GSM8K are each translated via human annotators in 10 languages.
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mgsm`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mgsm
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mgsm
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mgsm-aa-v2)=
## mgsm_aa_v2
MGSM is a benchmark of grade-school math problems - params aligned with Artificial Analysis Index v2
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mgsm_aa_v2`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 30
parallelism: 10
task: mgsm
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mgsm_aa_v2
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu)=
## mmlu
MMLU 0-shot CoT
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-am)=
## mmlu_am
Global-MMLU 0-shot CoT in Amharic (am)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_am`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_am
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_am
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-ar)=
## mmlu_ar
Global-MMLU 0-shot CoT in Arabic (ar)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_ar`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_ar
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_ar
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-ar-lite)=
## mmlu_ar-lite
Global-MMLU-Lite 0-shot CoT in Arabic (ar)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_ar-lite`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_ar-lite
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_ar-lite
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-bn)=
## mmlu_bn
Global-MMLU 0-shot CoT in Bengali (bn)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_bn`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_bn
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_bn
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-bn-lite)=
## mmlu_bn-lite
Global-MMLU-Lite 0-shot CoT in Bengali (bn)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_bn-lite`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_bn-lite
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_bn-lite
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-cs)=
## mmlu_cs
Global-MMLU 0-shot CoT in Czech (cs)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_cs`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_cs
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_cs
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-de)=
## mmlu_de
Global-MMLU 0-shot CoT in German (de)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_de`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_de
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_de
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-de-lite)=
## mmlu_de-lite
Global-MMLU-Lite 0-shot CoT in German (de)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_de-lite`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_de-lite
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_de-lite
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-el)=
## mmlu_el
Global-MMLU 0-shot CoT in Greek (el)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_el`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_el
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_el
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-en)=
## mmlu_en
Global-MMLU 0-shot CoT in English (en)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_en`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_en
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_en
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-en-lite)=
## mmlu_en-lite
Global-MMLU-Lite 0-shot CoT in English (en)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_en-lite`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_en-lite
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_en-lite
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-es)=
## mmlu_es
Global-MMLU 0-shot CoT in Spanish (es)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_es`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_es
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_es
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-es-lite)=
## mmlu_es-lite
Global-MMLU-Lite 0-shot CoT in Spanish (es)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_es-lite`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_es-lite
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_es-lite
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-fa)=
## mmlu_fa
Global-MMLU 0-shot CoT in Persian (fa)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_fa`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_fa
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_fa
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-fil)=
## mmlu_fil
Global-MMLU 0-shot CoT in Filipino (fil)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_fil`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_fil
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_fil
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-fr)=
## mmlu_fr
Global-MMLU 0-shot CoT in French (fr)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_fr`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_fr
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_fr
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-fr-lite)=
## mmlu_fr-lite
Global-MMLU-Lite 0-shot CoT in French (fr)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_fr-lite`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_fr-lite
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_fr-lite
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-ha)=
## mmlu_ha
Global-MMLU 0-shot CoT in Hausa (ha)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_ha`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_ha
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_ha
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-he)=
## mmlu_he
Global-MMLU 0-shot CoT in Hebrew (he)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_he`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_he
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_he
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-hi)=
## mmlu_hi
Global-MMLU 0-shot CoT in Hindi (hi)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_hi`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_hi
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_hi
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-hi-lite)=
## mmlu_hi-lite
Global-MMLU-Lite 0-shot CoT in Hindi (hi)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_hi-lite`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_hi-lite
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_hi-lite
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-id)=
## mmlu_id
Global-MMLU 0-shot CoT in Indonesian (id)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_id`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_id
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_id
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-id-lite)=
## mmlu_id-lite
Global-MMLU-Lite 0-shot CoT in Indonesian (id)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_id-lite`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_id-lite
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_id-lite
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-ig)=
## mmlu_ig
Global-MMLU 0-shot CoT in Igbo (ig)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_ig`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_ig
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_ig
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-it)=
## mmlu_it
Global-MMLU 0-shot CoT in Italian (it)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_it`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_it
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_it
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-it-lite)=
## mmlu_it-lite
Global-MMLU-Lite 0-shot CoT in Italian (it)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_it-lite`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_it-lite
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_it-lite
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-ja)=
## mmlu_ja
Global-MMLU 0-shot CoT in Japanese (ja)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_ja`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_ja
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_ja
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-ja-lite)=
## mmlu_ja-lite
Global-MMLU-Lite 0-shot CoT in Japanese (ja)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_ja-lite`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_ja-lite
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_ja-lite
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-ko)=
## mmlu_ko
Global-MMLU 0-shot CoT in Korean (ko)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_ko`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_ko
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_ko
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-ko-lite)=
## mmlu_ko-lite
Global-MMLU-Lite 0-shot CoT in Korean (ko)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_ko-lite`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_ko-lite
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_ko-lite
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-ky)=
## mmlu_ky
Global-MMLU 0-shot CoT in Kyrgyz (ky)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_ky`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_ky
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_ky
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-llama-4)=
## mmlu_llama_4
MMLU questions with custom regex extraction patterns for Llama 4
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_llama_4`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config:
extraction:
- regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])
match_group: 1
name: answer_colon_llama4
- regex: (?i)(?:the )?best? answer is\s*[\*\_,{}\.]*([A-D])(?![a-zA-Z0-9])
match_group: 1
name: answer_is_llama4
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_llama_4
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-lt)=
## mmlu_lt
Global-MMLU 0-shot CoT in Lithuanian (lt)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_lt`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_lt
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_lt
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-mg)=
## mmlu_mg
Global-MMLU 0-shot CoT in Malagasy (mg)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_mg`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_mg
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_mg
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-ms)=
## mmlu_ms
Global-MMLU 0-shot CoT in Malay (ms)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_ms`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_ms
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_ms
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-my-lite)=
## mmlu_my-lite
Global-MMLU-Lite 0-shot CoT in Malay (my)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_my-lite`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_my-lite
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_my-lite
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-ne)=
## mmlu_ne
Global-MMLU 0-shot CoT in Nepali (ne)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_ne`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_ne
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_ne
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-nl)=
## mmlu_nl
Global-MMLU 0-shot CoT in Dutch (nl)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_nl`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_nl
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_nl
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-ny)=
## mmlu_ny
Global-MMLU 0-shot CoT in Nyanja (ny)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_ny`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_ny
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_ny
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-pl)=
## mmlu_pl
Global-MMLU 0-shot CoT in Polish (pl)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_pl`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_pl
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_pl
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-pro)=
## mmlu_pro
MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines.
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_pro`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_pro
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_pro
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-pro-aa-v2)=
## mmlu_pro_aa_v2
MMLU-Pro - params aligned with Artificial Analysis Index v2
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_pro_aa_v2`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 30
parallelism: 10
task: mmlu_pro
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_pro_aa_v2
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-pro-aa-v3)=
## mmlu_pro_aa_v3
MMLU-Pro with AA v3 methodology - multi-stage regex extraction with A-J options
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_pro_aa_v3`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 30
parallelism: 10
task: mmlu_pro
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config:
prompt_template: 'Answer the following multiple choice question. The last line of your response should be in the following
format: ''Answer: A/B/C/D/E/F/G/H/I/J'' (e.g. ''Answer: A'').
{Question}
A) {A}
B) {B}
C) {C}
D) {D}
E) {E}
F) {F}
G) {G}
H) {H}
I) {I}
J) {J}
'
extraction:
- regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])
match_group: 1
name: primary_answer_format
- regex: \\boxed\{[^}]*([A-Z])[^}]*\}
match_group: 1
name: latex_boxed
- regex: answer is ([a-zA-Z])
match_group: 1
name: natural_language
- regex: answer is \(([a-zA-Z])\)
match_group: 1
name: with_parenthesis
- regex: ([A-Z])\)\s*[^A-Z]*
match_group: 1
name: choice_format
- regex: ([A-Z])\s+is\s+the\s+correct\s+answer
match_group: 1
name: explicit_statement
- regex: ([A-Z])\s*$
match_group: 1
name: standalone_letter_end
- regex: ([A-Z])\s*\.
match_group: 1
name: letter_with_period
- regex: ([A-Z])\s*[^\w]
match_group: 1
name: letter_nonword
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_pro_aa_v3
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-pro-llama-4)=
## mmlu_pro_llama_4
MMLU-Pro questions with custom regex extraction patterns for Llama 4
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_pro_llama_4`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_pro
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config:
extraction:
- regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])
match_group: 1
name: answer_colon_llama4
- regex: (?i)(?:the )?best? answer is\s*[\*\_,{}\.]*([A-D])(?![a-zA-Z0-9])
match_group: 1
name: answer_is_llama4
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_pro_llama_4
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-pt)=
## mmlu_pt
Global-MMLU 0-shot CoT in Portuguese (pt)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_pt`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_pt
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_pt
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-pt-lite)=
## mmlu_pt-lite
Global-MMLU-Lite 0-shot CoT in Portuguese (pt)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_pt-lite`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_pt-lite
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_pt-lite
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-ro)=
## mmlu_ro
Global-MMLU 0-shot CoT in Romanian (ro)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_ro`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_ro
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_ro
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-ru)=
## mmlu_ru
Global-MMLU 0-shot CoT in Russian (ru)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_ru`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_ru
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_ru
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-si)=
## mmlu_si
Global-MMLU 0-shot CoT in Sinhala (si)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_si`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_si
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_si
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-sn)=
## mmlu_sn
Global-MMLU 0-shot CoT in Shona (sn)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_sn`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_sn
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_sn
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-so)=
## mmlu_so
Global-MMLU 0-shot CoT in Somali (so)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_so`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_so
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_so
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-sr)=
## mmlu_sr
Global-MMLU 0-shot CoT in Serbian (sr)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_sr`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_sr
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_sr
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-sv)=
## mmlu_sv
Global-MMLU 0-shot CoT in Swedish (sv)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_sv`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_sv
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_sv
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-sw)=
## mmlu_sw
Global-MMLU 0-shot CoT in Swahili (sw)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_sw`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_sw
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_sw
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-sw-lite)=
## mmlu_sw-lite
Global-MMLU-Lite 0-shot CoT in Swahili (sw)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_sw-lite`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_sw-lite
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_sw-lite
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-te)=
## mmlu_te
Global-MMLU 0-shot CoT in Telugu (te)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_te`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_te
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_te
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-tr)=
## mmlu_tr
Global-MMLU 0-shot CoT in Turkish (tr)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_tr`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_tr
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_tr
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-uk)=
## mmlu_uk
Global-MMLU 0-shot CoT in Ukrainian (uk)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_uk`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_uk
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_uk
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-vi)=
## mmlu_vi
Global-MMLU 0-shot CoT in Vietnamese (vi)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_vi`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_vi
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_vi
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-yo)=
## mmlu_yo
Global-MMLU 0-shot CoT in Yoruba (yo)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_yo`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_yo
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_yo
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-yo-lite)=
## mmlu_yo-lite
Global-MMLU-Lite 0-shot CoT in Yoruba (yo)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_yo-lite`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_yo-lite
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_yo-lite
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-mmlu-zh-lite)=
## mmlu_zh-lite
Global-MMLU-Lite 0-shot CoT in Chinese (Simplified) (zh)
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `mmlu_zh-lite`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: mmlu_zh-lite
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: mmlu_zh-lite
target:
api_endpoint: {}
```
:::
::::
---
(simple-evals-simpleqa)=
## simpleqa
A factuality benchmark called SimpleQA that measures the ability for language models to answer short, fact-seeking questions.
::::{tab-set}
:::{tab-item} Container
**Harness:** `simple_evals`
**Container:**
```
nvcr.io/nvidia/eval-factory/simple-evals:26.01
```
**Container Digest:**
```
sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158
```
**Container Arch:** `multiarch`
**Task Type:** `simpleqa`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: simple_evals
pkg_name: simple_evals
config:
params:
max_new_tokens: 16384
max_retries: 5
parallelism: 10
task: simpleqa
temperature: 0.0
request_timeout: 60
top_p: 1.0e-05
extra:
n_samples: 1
downsampling_ratio: null
add_system_prompt: false
custom_config: null
judge:
url: null
model_id: null
api_key: null
backend: openai
request_timeout: 600
max_retries: 16
temperature: 0.0
top_p: 0.0001
max_tokens: 1024
max_concurrent_requests: null
supported_endpoint_types:
- chat
type: simpleqa
target:
api_endpoint: {}
```
:::
::::
# tau2_bench
This page contains all evaluation tasks for the **tau2_bench** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [tau2_bench_airline](#tau2-bench-tau2-bench-airline)
- tau2-bench - Airline Domain
* - [tau2_bench_retail](#tau2-bench-tau2-bench-retail)
- tau2-bench - Retail Domain
* - [tau2_bench_telecom](#tau2-bench-tau2-bench-telecom)
- tau2-bench - Telecom Domain (used by Artificial Analysis Index v2)
```
(tau2-bench-tau2-bench-airline)=
## tau2_bench_airline
tau2-bench - Airline Domain
::::{tab-set}
:::{tab-item} Container
**Harness:** `tau2_bench`
**Container:**
```
nvcr.io/nvidia/eval-factory/tau2-bench:26.01
```
**Container Digest:**
```
sha256:24aae1ed0eb955810a597382b1cbbfd8da64f9f74e1e64a4afd6a271d1b98be3
```
**Container Arch:** `multiarch`
**Task Type:** `tau2_bench_airline`
:::
:::{tab-item} Command
```bash
{% if config.params.extra.cache.enabled %}export LLM_CACHE_ENABLED=true && export CACHE_TYPE=disk && export CACHE_DIR={{config.params.extra.cache.cache_dir}} && {% endif %} tau2 run --domain {{config.params.task}} --agent-llm openai/{{target.api_endpoint.model_id}} --user-llm openai/{{config.params.extra.user.model_id}} {% if config.params.extra.judge.enabled %}--judge-llm openai/{{config.params.extra.judge.model_id}}{% endif %} {% if target.api_endpoint.api_key_name is not none %}--agent-api-key {{target.api_endpoint.api_key_name}}{% endif %} {% if config.params.extra.user.api_key is not none %}--user-api-key {{config.params.extra.user.api_key}}{% endif %} {% if config.params.extra.judge.enabled and config.params.extra.judge.api_key is not none %}--judge-api-key {{config.params.extra.judge.api_key}}{% endif %} --agent-llm-args '{"base_url": "{{target.api_endpoint.url}}", "temperature": {{config.params.temperature}}, "top_p": {{config.params.top_p}}, "max_completion_tokens": {{config.params.max_new_tokens}}, "timeout": {{config.params.request_timeout}}{% if config.params.extra.agent_args is defined and config.params.extra.agent_args is not none %}{% for key, value in config.params.extra.agent_args.items() %}, "{{key}}": {{value|tojson}}{% endfor %}{% endif %}}' --user-llm-args '{"base_url": "{{config.params.extra.user.url}}", "temperature": {{config.params.extra.user.temperature}}, "top_p": {{config.params.extra.user.top_p}}, "max_completion_tokens": {{config.params.extra.user.max_new_tokens}}, "timeout": {{config.params.extra.user.request_timeout}}{% if config.params.extra.user.args is defined and config.params.extra.user.args is not none %}{% for key, value in config.params.extra.user.args.items() %}, "{{key}}": {{value|tojson}}{% endfor %}{% endif %}}' {% if config.params.extra.judge.enabled %}--judge-llm-args '{"base_url": "{{config.params.extra.judge.url}}", "temperature": {{config.params.extra.judge.temperature}}, "top_p": {{config.params.extra.judge.top_p}}, "max_completion_tokens": {{config.params.extra.judge.max_new_tokens}}, "timeout": {{config.params.extra.judge.request_timeout}}{% if config.params.extra.judge.args is defined and config.params.extra.judge.args is not none %}{% for key, value in config.params.extra.judge.args.items() %}, "{{key}}": {{value|tojson}}{% endfor %}{% endif %}}'{% endif %} {% if config.params.extra.judge.enabled %}--judge-system-prompt "{{config.params.extra.judge.system_prompt}}"{% endif %} {% if config.params.extra.judge.enabled %}--judge-window-size {{config.params.extra.judge_window_size}}{% endif %} --max-concurrency {{config.params.parallelism}} --max-retries {{config.params.max_retries}} --max-steps {{config.params.extra.max_steps}} --results-dir {{config.output_dir}} --num-trials {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --num-tasks {{config.params.limit_samples}} {% endif %} {% if config.params.extra.skip_failed_samples %} --skip-failed-samples {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: tau2_bench
pkg_name: nvidia_tau2
config:
params:
max_new_tokens: 16384
max_retries: 30
parallelism: 10
task: airline
temperature: 0.0
request_timeout: 3600
top_p: 0.95
extra:
n_samples: 3
max_steps: 100
judge_window_size: 30
skip_failed_samples: false
cache:
enabled: true
cache_dir: .cache/llm_cache
user:
url: https://integrate.api.nvidia.com/v1/chat/completions
model_id: nvdev/qwen/qwen-235b
api_key: USER_API_KEY
temperature: 0.0
max_new_tokens: 4096
top_p: 0.95
request_timeout: 3600
judge:
enabled: false
url: https://integrate.api.nvidia.com/v1/chat/completions
model_id: openai/gpt-oss-120b
system_prompt: Reasoning:medium
api_key: JUDGE_API_KEY
temperature: 0.6
max_new_tokens: 16000
top_p: 0.95
request_timeout: 3600
supported_endpoint_types:
- chat
type: tau2_bench_airline
target:
api_endpoint:
stream: false
```
:::
::::
---
(tau2-bench-tau2-bench-retail)=
## tau2_bench_retail
tau2-bench - Retail Domain
::::{tab-set}
:::{tab-item} Container
**Harness:** `tau2_bench`
**Container:**
```
nvcr.io/nvidia/eval-factory/tau2-bench:26.01
```
**Container Digest:**
```
sha256:24aae1ed0eb955810a597382b1cbbfd8da64f9f74e1e64a4afd6a271d1b98be3
```
**Container Arch:** `multiarch`
**Task Type:** `tau2_bench_retail`
:::
:::{tab-item} Command
```bash
{% if config.params.extra.cache.enabled %}export LLM_CACHE_ENABLED=true && export CACHE_TYPE=disk && export CACHE_DIR={{config.params.extra.cache.cache_dir}} && {% endif %} tau2 run --domain {{config.params.task}} --agent-llm openai/{{target.api_endpoint.model_id}} --user-llm openai/{{config.params.extra.user.model_id}} {% if config.params.extra.judge.enabled %}--judge-llm openai/{{config.params.extra.judge.model_id}}{% endif %} {% if target.api_endpoint.api_key_name is not none %}--agent-api-key {{target.api_endpoint.api_key_name}}{% endif %} {% if config.params.extra.user.api_key is not none %}--user-api-key {{config.params.extra.user.api_key}}{% endif %} {% if config.params.extra.judge.enabled and config.params.extra.judge.api_key is not none %}--judge-api-key {{config.params.extra.judge.api_key}}{% endif %} --agent-llm-args '{"base_url": "{{target.api_endpoint.url}}", "temperature": {{config.params.temperature}}, "top_p": {{config.params.top_p}}, "max_completion_tokens": {{config.params.max_new_tokens}}, "timeout": {{config.params.request_timeout}}{% if config.params.extra.agent_args is defined and config.params.extra.agent_args is not none %}{% for key, value in config.params.extra.agent_args.items() %}, "{{key}}": {{value|tojson}}{% endfor %}{% endif %}}' --user-llm-args '{"base_url": "{{config.params.extra.user.url}}", "temperature": {{config.params.extra.user.temperature}}, "top_p": {{config.params.extra.user.top_p}}, "max_completion_tokens": {{config.params.extra.user.max_new_tokens}}, "timeout": {{config.params.extra.user.request_timeout}}{% if config.params.extra.user.args is defined and config.params.extra.user.args is not none %}{% for key, value in config.params.extra.user.args.items() %}, "{{key}}": {{value|tojson}}{% endfor %}{% endif %}}' {% if config.params.extra.judge.enabled %}--judge-llm-args '{"base_url": "{{config.params.extra.judge.url}}", "temperature": {{config.params.extra.judge.temperature}}, "top_p": {{config.params.extra.judge.top_p}}, "max_completion_tokens": {{config.params.extra.judge.max_new_tokens}}, "timeout": {{config.params.extra.judge.request_timeout}}{% if config.params.extra.judge.args is defined and config.params.extra.judge.args is not none %}{% for key, value in config.params.extra.judge.args.items() %}, "{{key}}": {{value|tojson}}{% endfor %}{% endif %}}'{% endif %} {% if config.params.extra.judge.enabled %}--judge-system-prompt "{{config.params.extra.judge.system_prompt}}"{% endif %} {% if config.params.extra.judge.enabled %}--judge-window-size {{config.params.extra.judge_window_size}}{% endif %} --max-concurrency {{config.params.parallelism}} --max-retries {{config.params.max_retries}} --max-steps {{config.params.extra.max_steps}} --results-dir {{config.output_dir}} --num-trials {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --num-tasks {{config.params.limit_samples}} {% endif %} {% if config.params.extra.skip_failed_samples %} --skip-failed-samples {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: tau2_bench
pkg_name: nvidia_tau2
config:
params:
max_new_tokens: 16384
max_retries: 30
parallelism: 10
task: retail
temperature: 0.0
request_timeout: 3600
top_p: 0.95
extra:
n_samples: 3
max_steps: 100
judge_window_size: 30
skip_failed_samples: false
cache:
enabled: true
cache_dir: .cache/llm_cache
user:
url: https://integrate.api.nvidia.com/v1/chat/completions
model_id: nvdev/qwen/qwen-235b
api_key: USER_API_KEY
temperature: 0.0
max_new_tokens: 4096
top_p: 0.95
request_timeout: 3600
judge:
enabled: false
url: https://integrate.api.nvidia.com/v1/chat/completions
model_id: openai/gpt-oss-120b
system_prompt: Reasoning:medium
api_key: JUDGE_API_KEY
temperature: 0.6
max_new_tokens: 16000
top_p: 0.95
request_timeout: 3600
supported_endpoint_types:
- chat
type: tau2_bench_retail
target:
api_endpoint:
stream: false
```
:::
::::
---
(tau2-bench-tau2-bench-telecom)=
## tau2_bench_telecom
tau2-bench - Telecom Domain (used by Artificial Analysis Index v2)
::::{tab-set}
:::{tab-item} Container
**Harness:** `tau2_bench`
**Container:**
```
nvcr.io/nvidia/eval-factory/tau2-bench:26.01
```
**Container Digest:**
```
sha256:24aae1ed0eb955810a597382b1cbbfd8da64f9f74e1e64a4afd6a271d1b98be3
```
**Container Arch:** `multiarch`
**Task Type:** `tau2_bench_telecom`
:::
:::{tab-item} Command
```bash
{% if config.params.extra.cache.enabled %}export LLM_CACHE_ENABLED=true && export CACHE_TYPE=disk && export CACHE_DIR={{config.params.extra.cache.cache_dir}} && {% endif %} tau2 run --domain {{config.params.task}} --agent-llm openai/{{target.api_endpoint.model_id}} --user-llm openai/{{config.params.extra.user.model_id}} {% if config.params.extra.judge.enabled %}--judge-llm openai/{{config.params.extra.judge.model_id}}{% endif %} {% if target.api_endpoint.api_key_name is not none %}--agent-api-key {{target.api_endpoint.api_key_name}}{% endif %} {% if config.params.extra.user.api_key is not none %}--user-api-key {{config.params.extra.user.api_key}}{% endif %} {% if config.params.extra.judge.enabled and config.params.extra.judge.api_key is not none %}--judge-api-key {{config.params.extra.judge.api_key}}{% endif %} --agent-llm-args '{"base_url": "{{target.api_endpoint.url}}", "temperature": {{config.params.temperature}}, "top_p": {{config.params.top_p}}, "max_completion_tokens": {{config.params.max_new_tokens}}, "timeout": {{config.params.request_timeout}}{% if config.params.extra.agent_args is defined and config.params.extra.agent_args is not none %}{% for key, value in config.params.extra.agent_args.items() %}, "{{key}}": {{value|tojson}}{% endfor %}{% endif %}}' --user-llm-args '{"base_url": "{{config.params.extra.user.url}}", "temperature": {{config.params.extra.user.temperature}}, "top_p": {{config.params.extra.user.top_p}}, "max_completion_tokens": {{config.params.extra.user.max_new_tokens}}, "timeout": {{config.params.extra.user.request_timeout}}{% if config.params.extra.user.args is defined and config.params.extra.user.args is not none %}{% for key, value in config.params.extra.user.args.items() %}, "{{key}}": {{value|tojson}}{% endfor %}{% endif %}}' {% if config.params.extra.judge.enabled %}--judge-llm-args '{"base_url": "{{config.params.extra.judge.url}}", "temperature": {{config.params.extra.judge.temperature}}, "top_p": {{config.params.extra.judge.top_p}}, "max_completion_tokens": {{config.params.extra.judge.max_new_tokens}}, "timeout": {{config.params.extra.judge.request_timeout}}{% if config.params.extra.judge.args is defined and config.params.extra.judge.args is not none %}{% for key, value in config.params.extra.judge.args.items() %}, "{{key}}": {{value|tojson}}{% endfor %}{% endif %}}'{% endif %} {% if config.params.extra.judge.enabled %}--judge-system-prompt "{{config.params.extra.judge.system_prompt}}"{% endif %} {% if config.params.extra.judge.enabled %}--judge-window-size {{config.params.extra.judge_window_size}}{% endif %} --max-concurrency {{config.params.parallelism}} --max-retries {{config.params.max_retries}} --max-steps {{config.params.extra.max_steps}} --results-dir {{config.output_dir}} --num-trials {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --num-tasks {{config.params.limit_samples}} {% endif %} {% if config.params.extra.skip_failed_samples %} --skip-failed-samples {% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: tau2_bench
pkg_name: nvidia_tau2
config:
params:
max_new_tokens: 16384
max_retries: 30
parallelism: 10
task: telecom
temperature: 0.0
request_timeout: 3600
top_p: 0.95
extra:
n_samples: 3
max_steps: 100
judge_window_size: 30
skip_failed_samples: false
cache:
enabled: true
cache_dir: .cache/llm_cache
user:
url: https://integrate.api.nvidia.com/v1/chat/completions
model_id: nvdev/qwen/qwen-235b
api_key: USER_API_KEY
temperature: 0.0
max_new_tokens: 4096
top_p: 0.95
request_timeout: 3600
judge:
enabled: false
url: https://integrate.api.nvidia.com/v1/chat/completions
model_id: openai/gpt-oss-120b
system_prompt: Reasoning:medium
api_key: JUDGE_API_KEY
temperature: 0.6
max_new_tokens: 16000
top_p: 0.95
request_timeout: 3600
supported_endpoint_types:
- chat
type: tau2_bench_telecom
target:
api_endpoint:
stream: false
```
:::
::::
# tooltalk
This page contains all evaluation tasks for the **tooltalk** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [tooltalk](#tooltalk-tooltalk)
- ToolTalk task with default settings.
```
(tooltalk-tooltalk)=
## tooltalk
ToolTalk task with default settings.
::::{tab-set}
:::{tab-item} Container
**Harness:** `tooltalk`
**Container:**
```
nvcr.io/nvidia/eval-factory/tooltalk:26.01
```
**Container Digest:**
```
sha256:2c032e8274fd3a825b3c2774d33d0caddfa198fe24980dd99b8e3ae77c8aadee
```
**Container Arch:** `multiarch`
**Task Type:** `tooltalk`
:::
:::{tab-item} Command
```bash
{% if target.api_endpoint.api_key_name is not none %}API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} python -m tooltalk.evaluation.evaluate_{{'openai' if 'azure' in target.api_endpoint.url or 'api.openai' in target.api_endpoint.url else 'nim'}} --dataset data/easy --database data/databases --model {{target.api_endpoint.model_id}} {% if config.params.max_new_tokens is not none %}--max_new_tokens {{config.params.max_new_tokens}}{% endif %} {% if config.params.temperature is not none %}--temperature {{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}--top_p {{config.params.top_p}}{% endif %} --api_mode all --output_dir {{config.output_dir}} --url {{target.api_endpoint.url}} {% if config.params.limit_samples is not none %}--first_n {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: tooltalk
pkg_name: tooltalk
config:
params:
extra: {}
supported_endpoint_types:
- chat
type: tooltalk
target:
api_endpoint: {}
```
:::
::::
# vlmevalkit
This page contains all evaluation tasks for the **vlmevalkit** harness.
```{list-table}
:header-rows: 1
:widths: 30 70
* - Task
- Description
* - [ai2d_judge](#vlmevalkit-ai2d-judge)
- A benchmark for evaluating diagram understanding capabilities of large vision-language models.
* - [chartqa](#vlmevalkit-chartqa)
- A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
* - [mathvista-mini](#vlmevalkit-mathvista-mini)
- Evaluating Math Reasoning in Visual Contexts
* - [mmmu_judge](#vlmevalkit-mmmu-judge)
- A benchmark for evaluating multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning.
* - [ocr_reasoning](#vlmevalkit-ocr-reasoning)
- Comprehensive benchmark of 1,069 human-annotated examples designed to evaluate multimodal large language models on text-rich image reasoning tasks by assessing both final answers and the reasoning process across six core abilities and 18 practical tasks.
* - [ocrbench](#vlmevalkit-ocrbench)
- Comprehensive evaluation benchmark designed to assess the OCR capabilities of Large Multimodal Models
* - [slidevqa](#vlmevalkit-slidevqa)
- Evaluates ability to answer questions about slide decks by selecting relevant slides from multiple images
```
(vlmevalkit-ai2d-judge)=
## ai2d_judge
A benchmark for evaluating diagram understanding capabilities of large vision-language models.
::::{tab-set}
:::{tab-item} Container
**Harness:** `vlmevalkit`
**Container:**
```
nvcr.io/nvidia/eval-factory/vlmevalkit:26.01
```
**Container Digest:**
```
sha256:24c650c547cfd666bcc5ec822c996eb90e89e4964a1d4ec29e4d01d8bd3a22dc
```
**Container Arch:** `amd`
**Task Type:** `ai2d_judge`
:::
:::{tab-item} Command
```bash
cat > {{config.output_dir}}/vlmeval_config.json << 'EOF'
{
"model": {
"{{target.api_endpoint.model_id.split('/')[-1]}}": {
"class": "CustomOAIEndpoint",
"model": "{{target.api_endpoint.model_id}}",
"api_base": "{{target.api_endpoint.url}}",
"api_key_var_name": "{{target.api_endpoint.api_key_name}}",
"max_tokens": {{config.params.max_new_tokens}},
"temperature": {{config.params.temperature}},{% if config.params.top_p is not none %}
"top_p": {{config.params.top_p}},{% endif %}
"retry": {{config.params.max_retries}},
"timeout": {{config.params.request_timeout}}{% if config.params.extra.wait is defined %},
"wait": {{config.params.extra.wait}}{% endif %}{% if config.params.extra.img_size is defined %},
"img_size": {{config.params.extra.img_size}}{% endif %}{% if config.params.extra.img_detail is defined %},
"img_detail": "{{config.params.extra.img_detail}}"{% endif %}{% if config.params.extra.system_prompt is defined %},
"system_prompt": "{{config.params.extra.system_prompt}}"{% endif %}{% if config.params.extra.verbose is defined %},
"verbose": {{config.params.extra.verbose}}{% endif %}
}
},
"data": {
"{{config.params.extra.dataset.name}}": {
"class": "{{config.params.extra.dataset.class}}",
"dataset": "{{config.params.extra.dataset.name}}",
"model": "{{target.api_endpoint.model_id}}"
}
}
}
EOF
python -m vlmeval.run \
--config {{config.output_dir}}/vlmeval_config.json \
--work-dir {{config.output_dir}} \
--api-nproc {{config.params.parallelism}} \
{%- if config.params.extra.judge is defined %}
--judge {{config.params.extra.judge.model}} \
--judge-args '{{config.params.extra.judge.args}}' \
{%- endif %}
{% if config.params.limit_samples is not none %}--first-n {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: vlmevalkit
pkg_name: vlmevalkit
config:
params:
max_new_tokens: 2048
max_retries: 5
parallelism: 4
temperature: 0.0
request_timeout: 60
extra:
dataset:
name: AI2D_TEST
class: ImageMCQDataset
judge:
model: gpt-4o
args: '{"use_azure": true}'
supported_endpoint_types:
- vlm
type: ai2d_judge
target:
api_endpoint: {}
```
:::
::::
---
(vlmevalkit-chartqa)=
## chartqa
A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
::::{tab-set}
:::{tab-item} Container
**Harness:** `vlmevalkit`
**Container:**
```
nvcr.io/nvidia/eval-factory/vlmevalkit:26.01
```
**Container Digest:**
```
sha256:24c650c547cfd666bcc5ec822c996eb90e89e4964a1d4ec29e4d01d8bd3a22dc
```
**Container Arch:** `amd`
**Task Type:** `chartqa`
:::
:::{tab-item} Command
```bash
cat > {{config.output_dir}}/vlmeval_config.json << 'EOF'
{
"model": {
"{{target.api_endpoint.model_id.split('/')[-1]}}": {
"class": "CustomOAIEndpoint",
"model": "{{target.api_endpoint.model_id}}",
"api_base": "{{target.api_endpoint.url}}",
"api_key_var_name": "{{target.api_endpoint.api_key_name}}",
"max_tokens": {{config.params.max_new_tokens}},
"temperature": {{config.params.temperature}},{% if config.params.top_p is not none %}
"top_p": {{config.params.top_p}},{% endif %}
"retry": {{config.params.max_retries}},
"timeout": {{config.params.request_timeout}}{% if config.params.extra.wait is defined %},
"wait": {{config.params.extra.wait}}{% endif %}{% if config.params.extra.img_size is defined %},
"img_size": {{config.params.extra.img_size}}{% endif %}{% if config.params.extra.img_detail is defined %},
"img_detail": "{{config.params.extra.img_detail}}"{% endif %}{% if config.params.extra.system_prompt is defined %},
"system_prompt": "{{config.params.extra.system_prompt}}"{% endif %}{% if config.params.extra.verbose is defined %},
"verbose": {{config.params.extra.verbose}}{% endif %}
}
},
"data": {
"{{config.params.extra.dataset.name}}": {
"class": "{{config.params.extra.dataset.class}}",
"dataset": "{{config.params.extra.dataset.name}}",
"model": "{{target.api_endpoint.model_id}}"
}
}
}
EOF
python -m vlmeval.run \
--config {{config.output_dir}}/vlmeval_config.json \
--work-dir {{config.output_dir}} \
--api-nproc {{config.params.parallelism}} \
{%- if config.params.extra.judge is defined %}
--judge {{config.params.extra.judge.model}} \
--judge-args '{{config.params.extra.judge.args}}' \
{%- endif %}
{% if config.params.limit_samples is not none %}--first-n {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: vlmevalkit
pkg_name: vlmevalkit
config:
params:
max_new_tokens: 2048
max_retries: 5
parallelism: 4
temperature: 0.0
request_timeout: 60
extra:
dataset:
name: ChartQA_TEST
class: ImageVQADataset
supported_endpoint_types:
- vlm
type: chartqa
target:
api_endpoint: {}
```
:::
::::
---
(vlmevalkit-mathvista-mini)=
## mathvista-mini
Evaluating Math Reasoning in Visual Contexts
::::{tab-set}
:::{tab-item} Container
**Harness:** `vlmevalkit`
**Container:**
```
nvcr.io/nvidia/eval-factory/vlmevalkit:26.01
```
**Container Digest:**
```
sha256:24c650c547cfd666bcc5ec822c996eb90e89e4964a1d4ec29e4d01d8bd3a22dc
```
**Container Arch:** `amd`
**Task Type:** `mathvista-mini`
:::
:::{tab-item} Command
```bash
cat > {{config.output_dir}}/vlmeval_config.json << 'EOF'
{
"model": {
"{{target.api_endpoint.model_id.split('/')[-1]}}": {
"class": "CustomOAIEndpoint",
"model": "{{target.api_endpoint.model_id}}",
"api_base": "{{target.api_endpoint.url}}",
"api_key_var_name": "{{target.api_endpoint.api_key_name}}",
"max_tokens": {{config.params.max_new_tokens}},
"temperature": {{config.params.temperature}},{% if config.params.top_p is not none %}
"top_p": {{config.params.top_p}},{% endif %}
"retry": {{config.params.max_retries}},
"timeout": {{config.params.request_timeout}}{% if config.params.extra.wait is defined %},
"wait": {{config.params.extra.wait}}{% endif %}{% if config.params.extra.img_size is defined %},
"img_size": {{config.params.extra.img_size}}{% endif %}{% if config.params.extra.img_detail is defined %},
"img_detail": "{{config.params.extra.img_detail}}"{% endif %}{% if config.params.extra.system_prompt is defined %},
"system_prompt": "{{config.params.extra.system_prompt}}"{% endif %}{% if config.params.extra.verbose is defined %},
"verbose": {{config.params.extra.verbose}}{% endif %}
}
},
"data": {
"{{config.params.extra.dataset.name}}": {
"class": "{{config.params.extra.dataset.class}}",
"dataset": "{{config.params.extra.dataset.name}}",
"model": "{{target.api_endpoint.model_id}}"
}
}
}
EOF
python -m vlmeval.run \
--config {{config.output_dir}}/vlmeval_config.json \
--work-dir {{config.output_dir}} \
--api-nproc {{config.params.parallelism}} \
{%- if config.params.extra.judge is defined %}
--judge {{config.params.extra.judge.model}} \
--judge-args '{{config.params.extra.judge.args}}' \
{%- endif %}
{% if config.params.limit_samples is not none %}--first-n {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: vlmevalkit
pkg_name: vlmevalkit
config:
params:
max_new_tokens: 2048
max_retries: 5
parallelism: 4
temperature: 0.0
request_timeout: 60
extra:
dataset:
name: MathVista_MINI
class: MathVista
judge:
model: gpt-4o
args: '{"use_azure": true}'
supported_endpoint_types:
- vlm
type: mathvista-mini
target:
api_endpoint: {}
```
:::
::::
---
(vlmevalkit-mmmu-judge)=
## mmmu_judge
A benchmark for evaluating multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning.
::::{tab-set}
:::{tab-item} Container
**Harness:** `vlmevalkit`
**Container:**
```
nvcr.io/nvidia/eval-factory/vlmevalkit:26.01
```
**Container Digest:**
```
sha256:24c650c547cfd666bcc5ec822c996eb90e89e4964a1d4ec29e4d01d8bd3a22dc
```
**Container Arch:** `amd`
**Task Type:** `mmmu_judge`
:::
:::{tab-item} Command
```bash
cat > {{config.output_dir}}/vlmeval_config.json << 'EOF'
{
"model": {
"{{target.api_endpoint.model_id.split('/')[-1]}}": {
"class": "CustomOAIEndpoint",
"model": "{{target.api_endpoint.model_id}}",
"api_base": "{{target.api_endpoint.url}}",
"api_key_var_name": "{{target.api_endpoint.api_key_name}}",
"max_tokens": {{config.params.max_new_tokens}},
"temperature": {{config.params.temperature}},{% if config.params.top_p is not none %}
"top_p": {{config.params.top_p}},{% endif %}
"retry": {{config.params.max_retries}},
"timeout": {{config.params.request_timeout}}{% if config.params.extra.wait is defined %},
"wait": {{config.params.extra.wait}}{% endif %}{% if config.params.extra.img_size is defined %},
"img_size": {{config.params.extra.img_size}}{% endif %}{% if config.params.extra.img_detail is defined %},
"img_detail": "{{config.params.extra.img_detail}}"{% endif %}{% if config.params.extra.system_prompt is defined %},
"system_prompt": "{{config.params.extra.system_prompt}}"{% endif %}{% if config.params.extra.verbose is defined %},
"verbose": {{config.params.extra.verbose}}{% endif %}
}
},
"data": {
"{{config.params.extra.dataset.name}}": {
"class": "{{config.params.extra.dataset.class}}",
"dataset": "{{config.params.extra.dataset.name}}",
"model": "{{target.api_endpoint.model_id}}"
}
}
}
EOF
python -m vlmeval.run \
--config {{config.output_dir}}/vlmeval_config.json \
--work-dir {{config.output_dir}} \
--api-nproc {{config.params.parallelism}} \
{%- if config.params.extra.judge is defined %}
--judge {{config.params.extra.judge.model}} \
--judge-args '{{config.params.extra.judge.args}}' \
{%- endif %}
{% if config.params.limit_samples is not none %}--first-n {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: vlmevalkit
pkg_name: vlmevalkit
config:
params:
max_new_tokens: 2048
max_retries: 5
parallelism: 4
temperature: 0.0
request_timeout: 60
extra:
dataset:
name: MMMU_DEV_VAL
class: MMMUDataset
judge:
model: gpt-4o
args: '{"use_azure": true}'
supported_endpoint_types:
- vlm
type: mmmu_judge
target:
api_endpoint: {}
```
:::
::::
---
(vlmevalkit-ocr-reasoning)=
## ocr_reasoning
Comprehensive benchmark of 1,069 human-annotated examples designed to evaluate multimodal large language models on text-rich image reasoning tasks by assessing both final answers and the reasoning process across six core abilities and 18 practical tasks.
::::{tab-set}
:::{tab-item} Container
**Harness:** `vlmevalkit`
**Container:**
```
nvcr.io/nvidia/eval-factory/vlmevalkit:26.01
```
**Container Digest:**
```
sha256:24c650c547cfd666bcc5ec822c996eb90e89e4964a1d4ec29e4d01d8bd3a22dc
```
**Container Arch:** `amd`
**Task Type:** `ocr_reasoning`
:::
:::{tab-item} Command
```bash
cat > {{config.output_dir}}/vlmeval_config.json << 'EOF'
{
"model": {
"{{target.api_endpoint.model_id.split('/')[-1]}}": {
"class": "CustomOAIEndpoint",
"model": "{{target.api_endpoint.model_id}}",
"api_base": "{{target.api_endpoint.url}}",
"api_key_var_name": "{{target.api_endpoint.api_key_name}}",
"max_tokens": {{config.params.max_new_tokens}},
"temperature": {{config.params.temperature}},{% if config.params.top_p is not none %}
"top_p": {{config.params.top_p}},{% endif %}
"retry": {{config.params.max_retries}},
"timeout": {{config.params.request_timeout}}{% if config.params.extra.wait is defined %},
"wait": {{config.params.extra.wait}}{% endif %}{% if config.params.extra.img_size is defined %},
"img_size": {{config.params.extra.img_size}}{% endif %}{% if config.params.extra.img_detail is defined %},
"img_detail": "{{config.params.extra.img_detail}}"{% endif %}{% if config.params.extra.system_prompt is defined %},
"system_prompt": "{{config.params.extra.system_prompt}}"{% endif %}{% if config.params.extra.verbose is defined %},
"verbose": {{config.params.extra.verbose}}{% endif %}
}
},
"data": {
"{{config.params.extra.dataset.name}}": {
"class": "{{config.params.extra.dataset.class}}",
"dataset": "{{config.params.extra.dataset.name}}",
"model": "{{target.api_endpoint.model_id}}"
}
}
}
EOF
python -m vlmeval.run \
--config {{config.output_dir}}/vlmeval_config.json \
--work-dir {{config.output_dir}} \
--api-nproc {{config.params.parallelism}} \
{%- if config.params.extra.judge is defined %}
--judge {{config.params.extra.judge.model}} \
--judge-args '{{config.params.extra.judge.args}}' \
{%- endif %}
{% if config.params.limit_samples is not none %}--first-n {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: vlmevalkit
pkg_name: vlmevalkit
config:
params:
max_new_tokens: 2048
max_retries: 5
parallelism: 4
temperature: 0.0
request_timeout: 60
extra:
dataset:
name: OCR_Reasoning
class: OCR_Reasoning
judge:
model: gpt-4o
args: '{"use_azure": true}'
supported_endpoint_types:
- vlm
type: ocr_reasoning
target:
api_endpoint: {}
```
:::
::::
---
(vlmevalkit-ocrbench)=
## ocrbench
Comprehensive evaluation benchmark designed to assess the OCR capabilities of Large Multimodal Models
::::{tab-set}
:::{tab-item} Container
**Harness:** `vlmevalkit`
**Container:**
```
nvcr.io/nvidia/eval-factory/vlmevalkit:26.01
```
**Container Digest:**
```
sha256:24c650c547cfd666bcc5ec822c996eb90e89e4964a1d4ec29e4d01d8bd3a22dc
```
**Container Arch:** `amd`
**Task Type:** `ocrbench`
:::
:::{tab-item} Command
```bash
cat > {{config.output_dir}}/vlmeval_config.json << 'EOF'
{
"model": {
"{{target.api_endpoint.model_id.split('/')[-1]}}": {
"class": "CustomOAIEndpoint",
"model": "{{target.api_endpoint.model_id}}",
"api_base": "{{target.api_endpoint.url}}",
"api_key_var_name": "{{target.api_endpoint.api_key_name}}",
"max_tokens": {{config.params.max_new_tokens}},
"temperature": {{config.params.temperature}},{% if config.params.top_p is not none %}
"top_p": {{config.params.top_p}},{% endif %}
"retry": {{config.params.max_retries}},
"timeout": {{config.params.request_timeout}}{% if config.params.extra.wait is defined %},
"wait": {{config.params.extra.wait}}{% endif %}{% if config.params.extra.img_size is defined %},
"img_size": {{config.params.extra.img_size}}{% endif %}{% if config.params.extra.img_detail is defined %},
"img_detail": "{{config.params.extra.img_detail}}"{% endif %}{% if config.params.extra.system_prompt is defined %},
"system_prompt": "{{config.params.extra.system_prompt}}"{% endif %}{% if config.params.extra.verbose is defined %},
"verbose": {{config.params.extra.verbose}}{% endif %}
}
},
"data": {
"{{config.params.extra.dataset.name}}": {
"class": "{{config.params.extra.dataset.class}}",
"dataset": "{{config.params.extra.dataset.name}}",
"model": "{{target.api_endpoint.model_id}}"
}
}
}
EOF
python -m vlmeval.run \
--config {{config.output_dir}}/vlmeval_config.json \
--work-dir {{config.output_dir}} \
--api-nproc {{config.params.parallelism}} \
{%- if config.params.extra.judge is defined %}
--judge {{config.params.extra.judge.model}} \
--judge-args '{{config.params.extra.judge.args}}' \
{%- endif %}
{% if config.params.limit_samples is not none %}--first-n {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: vlmevalkit
pkg_name: vlmevalkit
config:
params:
max_new_tokens: 2048
max_retries: 5
parallelism: 4
temperature: 0.0
request_timeout: 60
extra:
dataset:
name: OCRBench
class: OCRBench
supported_endpoint_types:
- vlm
type: ocrbench
target:
api_endpoint: {}
```
:::
::::
---
(vlmevalkit-slidevqa)=
## slidevqa
Evaluates ability to answer questions about slide decks by selecting relevant slides from multiple images
::::{tab-set}
:::{tab-item} Container
**Harness:** `vlmevalkit`
**Container:**
```
nvcr.io/nvidia/eval-factory/vlmevalkit:26.01
```
**Container Digest:**
```
sha256:24c650c547cfd666bcc5ec822c996eb90e89e4964a1d4ec29e4d01d8bd3a22dc
```
**Container Arch:** `amd`
**Task Type:** `slidevqa`
:::
:::{tab-item} Command
```bash
cat > {{config.output_dir}}/vlmeval_config.json << 'EOF'
{
"model": {
"{{target.api_endpoint.model_id.split('/')[-1]}}": {
"class": "CustomOAIEndpoint",
"model": "{{target.api_endpoint.model_id}}",
"api_base": "{{target.api_endpoint.url}}",
"api_key_var_name": "{{target.api_endpoint.api_key_name}}",
"max_tokens": {{config.params.max_new_tokens}},
"temperature": {{config.params.temperature}},{% if config.params.top_p is not none %}
"top_p": {{config.params.top_p}},{% endif %}
"retry": {{config.params.max_retries}},
"timeout": {{config.params.request_timeout}}{% if config.params.extra.wait is defined %},
"wait": {{config.params.extra.wait}}{% endif %}{% if config.params.extra.img_size is defined %},
"img_size": {{config.params.extra.img_size}}{% endif %}{% if config.params.extra.img_detail is defined %},
"img_detail": "{{config.params.extra.img_detail}}"{% endif %}{% if config.params.extra.system_prompt is defined %},
"system_prompt": "{{config.params.extra.system_prompt}}"{% endif %}{% if config.params.extra.verbose is defined %},
"verbose": {{config.params.extra.verbose}}{% endif %}
}
},
"data": {
"{{config.params.extra.dataset.name}}": {
"class": "{{config.params.extra.dataset.class}}",
"dataset": "{{config.params.extra.dataset.name}}",
"model": "{{target.api_endpoint.model_id}}"
}
}
}
EOF
python -m vlmeval.run \
--config {{config.output_dir}}/vlmeval_config.json \
--work-dir {{config.output_dir}} \
--api-nproc {{config.params.parallelism}} \
{%- if config.params.extra.judge is defined %}
--judge {{config.params.extra.judge.model}} \
--judge-args '{{config.params.extra.judge.args}}' \
{%- endif %}
{% if config.params.limit_samples is not none %}--first-n {{config.params.limit_samples}}{% endif %}
```
:::
:::{tab-item} Defaults
```yaml
framework_name: vlmevalkit
pkg_name: vlmevalkit
config:
params:
max_new_tokens: 2048
max_retries: 5
parallelism: 4
temperature: 0.0
request_timeout: 60
extra:
dataset:
name: SLIDEVQA
class: SlideVQA
judge:
model: gpt-4o
args: '{"use_azure": true}'
supported_endpoint_types:
- vlm
type: slidevqa
target:
api_endpoint: {}
```
:::
::::
(benchmarks-full-list)=
# Available Benchmarks
```{include} all/benchmarks-table.md
```
# Benchmark Catalog
Comprehensive catalog of hundreds of benchmarks across popular evaluation harnesses, all available through NGC containers and the NeMo Evaluator platform.
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`graph;1.5em;sd-mr-1` About Selecting Benchmarks
:link: eval-benchmarks
:link-type: ref
Browse benchmark categories and choose the ones best suited for your model and use case
:::
:::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` Available Benchmarks
:link: benchmarks-full-list
:link-type: ref
Detailed description of all available tasks, groupped by evaluation harness.
:::
::::
:::{toctree}
:caption: Benchmark Catalog
:hidden:
About Selecting Benchmarks
Available Benchmarks
:::
(eval-custom-tasks)=
# Tasks Not Explicitly Defined by Framework Definition File
## Introduction
NeMo Evaluator provides a unified interface and a curated set of pre-defined task configurations for launching evaluations.
These task configurations are specified in the [Framework Definition File (FDF)](../about/concepts/framework-definition-file.md) to provide a simple and standardized way of running evaluations, with minimum user-provided input required.
However, you can choose to evaluate your model on a task that was not explicitly included in the FDF.
To do so, you must specify your task as `"."`, where the task name originates from the underlying evaluation harness, and ensure that all of the task parameters (e.g., sampling parameters, few-shot settings) are specified correctly.
Additionally, you need to determine which [endpoint type](../deployment/bring-your-own-endpoint/testing-endpoint-oai-compatibility.md) is appropriate for the task.
## Run Evaluation
In this example, we will use the [PolEmo 2.0](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/polemo2) task from LM Evaluation Harness.
This task consists of consumer reviews in Polish and assesses sentiment analysis abilities.
It requires a "completions" endpoint and has the sampling parameters defined as a part of [task configuration](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/polemo2/polemo2_in.yaml) of the underlying harness.
:::{note}
Make sure to review the task configuration in the underlying harness and ensure the sampling parameters are defined and match your preffered way of running the benchmark.
You can configure the evaluation using the `params` field in the `EvaluationConfig`.
:::
### 1. Prepare the Environment
Start `lm-evaluation-harness` Docker container:
```bash
docker run --rm -it nvcr.io/nvidia/eval-factory/lm-evaluation-harness:{{ docker_compose_latest }}
```
or install `nemo-evaluator` and `nvidia-lm-eval` Python package in your environment of choice:
```bash
pip install nemo-evaluator nvidia-lm-eval
```
### 2. Run the Evaluation
```{literalinclude} _snippets/polemo2.py
:language: python
:start-after: "## Run the evaluation"
```
# Add Evaluation Packages to NeMo Framework
The NeMo Framework Docker image comes with [nvidia-lm-eval](https://pypi.org/project/nvidia-lm-eval/) pre-installed.
However, you can add more evaluation methods by installing additional NeMo Evaluator packages.
For each package, follow these steps:
1. Install the required package.
2. Deploy your model:
```{literalinclude} ../get-started/_snippets/deploy.sh
:language: shell
:start-after: "## Deploy"
```
Wait for the server to get started and ready for accepting requests:
```python
from nemo_evaluator.api import check_endpoint
check_endpoint(
endpoint_url="http://0.0.0.0:8080/v1/completions/",
endpoint_type="completions",
model_name="megatron_model",
)
```
Make sure to open two separate terminals within the same container for executing the deployment and evaluation.
3. (Optional) Export the required environment variables.
4. Run the evalution of your choice.
Below you can find examples for enabling and launching evaluations for different packages.
:::{tip}
All examples below use only a subset of samples.
To run the evaluation on the whole dataset, remove the `limit_samples` parameter.
:::
## Enable On-Demand Evaluation Packages
:::{note}
If multiple harnesses are installed in your environment and they define a task with the same name, you must use the `.` format to avoid ambiguity. For example:
```python
eval_config = EvaluationConfig(type="lm-evaluation-harness.mmlu")
eval_config = EvaluationConfig(type="simple_evals.mmlu")
```
:::
::::{tab-set}
:::{tab-item} BFCL
1. Install the [nvidia-bfcl](https://pypi.org/project/nvidia-bfcl/) package:
```bash
pip install nvidia-bfcl
```
2. Run the evaluation:
```{literalinclude} _snippets/bfcl.py
:language: python
:start-after: "## Run the evaluation"
```
:::
:::{tab-item} garak
1. Install the [nvidia-eval-factory-garak](https://pypi.org/project/nvidia-eval-factory-garak/) package:
```bash
pip install nvidia-eval-factory-garak
```
2. Run the evaluation:
```{literalinclude} _snippets/garak.py
:language: python
:start-after: "## Run the evaluation"
```
:::
:::{tab-item} BigCode
1. Install the [nvidia-bigcode-eval](https://pypi.org/project/nvidia-bigcode-eval/) package:
```bash
pip install nvidia-bigcode-eval
```
2. Run the evaluation:
```{literalinclude} _snippets/bigcode.py
:language: python
:start-after: "## Run the evaluation"
```
:::
:::{tab-item} simple-evals
1. Install the [nvidia-simple-evals](https://pypi.org/project/nvidia-simple-evals/) package:
```bash
pip install nvidia-simple-evals
```
In the example below, we use the `AIME_2025` task, which follows the llm-as-a-judge approach for checking the output correctness.
By default, [Llama 3.3 70B](https://build.nvidia.com/meta/llama-3_3-70b-instruct) NVIDIA NIM is used for judging.
2. To run evaluation, set your [build.nvidia.com](https://build.nvidia.com/) API key as the `JUDGE_API_KEY` variable:
```bash
export JUDGE_API_KEY=...
```
To customize the judge setting, see the instructions for [NVIDIA Eval Factory package](https://pypi.org/project/nvidia-simple-evals/).
3. Run the evaluation:
```{literalinclude} _snippets/simple_evals.py
:language: python
:start-after: "## Run the evaluation"
```
:::
:::{tab-item} safety-harness
1. Install the [nvidia-safety-harness](https://pypi.org/project/nvidia-safety-harness/) package:
```bash
pip install nvidia-safety-harness
```
2. Deploy the judge model.
In the example below, we use the `aegis_v2` task, which requires the [Llama 3.1 NemoGuard 8B ContentSafety](https://docs.nvidia.com/nim/llama-3-1-nemoguard-8b-contentsafety/latest/getting-started.html) model to assess your model's responses.
The model is available through NVIDIA NIM.
See the [instructions](https://docs.nvidia.com/nim/llama-3-1-nemoguard-8b-contentsafety/latest/getting-started.html) on deploying the judge model.
If you set up a gated judge endpoint, you must export your API key as the `JUDGE_API_KEY` variable:
```bash
export JUDGE_API_KEY=...
```
3. To access the evaluation dataset, you must authenticate with the [Hugging Face Hub](https://huggingface.co/docs/huggingface_hub/quick-start#authentication).
4. Run the evaluation:
```{literalinclude} _snippets/safety.py
:language: python
:start-after: "## Run the evaluation"
```
Make sure to modify the judge configuration in the provided snippet to match your Llama 3.1 NemoGuard 8B ContentSafety endoint.
:::
::::
---
orphan: true
---
(eval-parameters)=
# Evaluation Configuration Parameters
Comprehensive reference for configuring evaluation tasks in {{ product_name_short }}, covering universal parameters, framework-specific settings, and optimization patterns.
:::{admonition} Quick Navigation
:class: info
**Looking for task-specific guides?**
- {ref}`text-gen` - Text generation evaluation
- {ref}`logprobs` - Log-probability evaluation
- {ref}`code-generation` - Code generation evaluation
**Looking for available benchmarks?**
- {ref}`eval-benchmarks` - Browse available benchmarks by category
**Need help getting started?**
- {ref}`evaluation-overview` - Overview of evaluation workflows
- {ref}`eval-run` - Step-by-step evaluation guides
:::
## Overview
All evaluation tasks in {{ product_name_short }} use the `ConfigParams` class for configuration. This provides a consistent interface across different evaluation harnesses while allowing framework-specific customization through the `extra` parameter.
```python
from nemo_evaluator.api.api_dataclasses import ConfigParams
# Basic configuration
params = ConfigParams(
temperature=0,
top_p=1.0,
max_new_tokens=256,
limit_samples=100
)
# Advanced configuration with framework-specific parameters
params = ConfigParams(
temperature=0,
parallelism=8,
extra={
"num_fewshot": 5,
"tokenizer": "/path/to/tokenizer",
"custom_prompt": "Answer the question:"
}
)
```
## Universal Parameters
These parameters are available for all evaluation tasks regardless of the underlying harness or benchmark.
### Core Generation Parameters
```{list-table}
:header-rows: 1
:widths: 15 10 30 25 20
* - Parameter
- Type
- Description
- Example Values
- Notes
* - `temperature`
- `float`
- Sampling randomness
- `0` (deterministic), `0.7` (creative)
- Use `0` for reproducible results
* - `top_p`
- `float`
- Nucleus sampling threshold
- `1.0` (disabled), `0.9` (selective)
- Controls diversity of generated text
* - `max_new_tokens`
- `int`
- Maximum response length
- `256`, `512`, `1024`
- Limits generation length
```
### Evaluation Control Parameters
```{list-table}
:header-rows: 1
:widths: 15 10 30 25 20
* - Parameter
- Type
- Description
- Example Values
- Notes
* - `limit_samples`
- `int/float`
- Evaluation subset size
- `100` (count), `0.1` (10% of dataset)
- Use for quick testing or resource limits
* - `task`
- `str`
- Task-specific identifier
- `"custom_task"`
- Used by some harnesses for task routing
```
### Performance Parameters
```{list-table}
:header-rows: 1
:widths: 15 10 30 25 20
* - Parameter
- Type
- Description
- Example Values
- Notes
* - `parallelism`
- `int`
- Concurrent request threads
- `1`, `8`, `16`
- Balance against server capacity
* - `max_retries`
- `int`
- Retry attempts for failed requests
- `3`, `5`, `10`
- Increases robustness for network issues
* - `request_timeout`
- `int`
- Request timeout (seconds)
- `60`, `120`, `300`
- Adjust for model response time
```
## Framework-Specific Parameters
Framework-specific parameters are passed through the `extra` dictionary within `ConfigParams`.
::::{dropdown} LM-Evaluation-Harness Parameters
:icon: code-square
```{list-table}
:header-rows: 1
:widths: 15 10 30 25 20
* - Parameter
- Type
- Description
- Example Values
- Use Cases
* - `num_fewshot`
- `int`
- Few-shot examples count
- `0`, `5`, `25`
- Academic benchmarks
* - `tokenizer`
- `str`
- Tokenizer path
- `"/path/to/tokenizer"`
- Log-probability tasks
* - `tokenizer_backend`
- `str`
- Tokenizer implementation
- `"huggingface"`, `"sentencepiece"`
- Custom tokenizer setups
* - `trust_remote_code`
- `bool`
- Allow remote code execution
- `True`, `False`
- For custom tokenizers
* - `add_bos_token`
- `bool`
- Add beginning-of-sequence token
- `True`, `False`
- Model-specific formatting
* - `add_eos_token`
- `bool`
- Add end-of-sequence token
- `True`, `False`
- Model-specific formatting
* - `fewshot_delimiter`
- `str`
- Separator between examples
- `"\\n\\n"`, `"\\n---\\n"`
- Custom prompt formatting
* - `fewshot_seed`
- `int`
- Reproducible example selection
- `42`, `1337`
- Ensures consistent few-shot examples
* - `description`
- `str`
- Custom prompt prefix
- `"Answer the question:"`
- Task-specific instructions
* - `bootstrap_iters`
- `int`
- Statistical bootstrap iterations
- `1000`, `10000`
- For confidence intervals
```
::::
::::{dropdown} Simple-Evals Parameters
:icon: code-square
```{list-table}
:header-rows: 1
:widths: 15 10 30 25 20
* - Parameter
- Type
- Description
- Example Values
- Use Cases
* - `pass_at_k`
- `list[int]`
- Code evaluation metrics
- `[1, 5, 10]`
- Code generation tasks
* - `timeout`
- `int`
- Code execution timeout
- `5`, `10`, `30`
- Code generation tasks
* - `max_workers`
- `int`
- Parallel execution workers
- `4`, `8`, `16`
- Code execution parallelism
* - `languages`
- `list[str]`
- Target programming languages
- `["python", "java", "cpp"]`
- Multi-language evaluation
```
::::
::::{dropdown} BigCode-Evaluation-Harness Parameters
:icon: code-square
```{list-table}
:header-rows: 1
:widths: 15 10 30 25 20
* - Parameter
- Type
- Description
- Example Values
- Use Cases
* - `num_workers`
- `int`
- Parallel execution workers
- `4`, `8`, `16`
- Code execution parallelism
* - `eval_metric`
- `str`
- Evaluation metric
- `"pass_at_k"`, `"bleu"`
- Different scoring methods
* - `languages`
- `list[str]`
- Programming languages
- `["python", "javascript"]`
- Language-specific evaluation
```
::::
::::{dropdown} Safety and Specialized Harnesses
:icon: code-square
```{list-table}
:header-rows: 1
:widths: 15 10 30 25 20
* - Parameter
- Type
- Description
- Example Values
- Use Cases
* - `probes`
- `str`
- Garak security probes
- `"ansiescape.AnsiEscaped"`
- Security evaluation
* - `detectors`
- `str`
- Garak security detectors
- `"base.TriggerListDetector"`
- Security evaluation
* - `generations`
- `int`
- Number of generations per prompt
- `1`, `5`, `10`
- Safety evaluation
```
::::
## Configuration Patterns
::::{dropdown} Academic Benchmarks (Deterministic)
:icon: code-square
```python
academic_params = ConfigParams(
temperature=0.01, # Near-deterministic generation (0.0 not supported by all endpoints)
top_p=1.0, # No nucleus sampling
max_new_tokens=256, # Moderate response length
limit_samples=None, # Full dataset evaluation
parallelism=4, # Conservative parallelism
extra={
"num_fewshot": 5, # Standard few-shot count
"fewshot_seed": 42 # Reproducible examples
}
)
```
::::
::::{dropdown} Creative Tasks (Controlled Randomness)
:icon: code-square
```python
creative_params = ConfigParams(
temperature=0.7, # Moderate creativity
top_p=0.9, # Nucleus sampling
max_new_tokens=512, # Longer responses
extra={
"repetition_penalty": 1.1, # Reduce repetition
"do_sample": True # Enable sampling
}
)
```
::::
::::{dropdown} Code Generation (Balanced)
:icon: code-square
```python
code_params = ConfigParams(
temperature=0.2, # Slight randomness for diversity
top_p=0.95, # Selective sampling
max_new_tokens=1024, # Sufficient for code solutions
extra={
"pass_at_k": [1, 5, 10], # Multiple success metrics
"timeout": 10, # Code execution timeout
"stop_sequences": ["```", "\\n\\n"] # Code block terminators
}
)
```
::::
::::{dropdown} Log-Probability Tasks
:icon: code-square
```python
logprob_params = ConfigParams(
# No generation parameters needed for log-probability tasks
limit_samples=100, # Quick testing
extra={
"tokenizer_backend": "huggingface",
"tokenizer": "/path/to/nemo_tokenizer",
"trust_remote_code": True
}
)
```
::::
::::{dropdown} High-Throughput Evaluation
:icon: code-square
```python
performance_params = ConfigParams(
temperature=0.01, # Near-deterministic for speed
parallelism=16, # High concurrency
max_retries=5, # Robust retry policy
request_timeout=120, # Generous timeout
limit_samples=0.1, # 10% sample for testing
extra={
"batch_size": 8, # Batch requests if supported
"cache_requests": True # Enable caching
}
)
```
::::
## Parameter Selection Guidelines
### By Evaluation Type
**Text Generation Tasks**:
- Use `temperature=0.01` for near-deterministic, reproducible results (most endpoints don't support exactly 0.0)
- Set appropriate `max_new_tokens` based on expected response length
- Configure `parallelism` based on server capacity
**Log-Probability Tasks**:
- Always specify `tokenizer` and `tokenizer_backend` in `extra`
- Generation parameters (temperature, top_p) are not used
- Focus on tokenizer configuration accuracy
**Code Generation Tasks**:
- Use moderate `temperature` (0.1-0.3) for diversity without randomness
- Set higher `max_new_tokens` (1024+) for complete solutions
- Configure `timeout` and `pass_at_k` in `extra`
**Safety Evaluation**:
- Use appropriate `probes` and `detectors` in `extra`
- Consider multiple `generations` per prompt
- Use chat endpoints for instruction-following safety tests
### By Resource Constraints
**Limited Compute**:
- Reduce `parallelism` to 1-4
- Use `limit_samples` for subset evaluation
- Increase `request_timeout` for slower responses
**High-Performance Clusters**:
- Increase `parallelism` to 16-32
- Enable request batching in `extra` if supported
- Use full dataset evaluation (`limit_samples=None`)
**Development/Testing**:
- Use `limit_samples=10-100` for quick validation
- Set `temperature=0.01` for consistent results
- Enable verbose logging in `extra` if available
## Common Configuration Errors
### Tokenizer Issues
:::{admonition} Problem
:class: error
Missing tokenizer for log-probability tasks
```python
# Incorrect - missing tokenizer
params = ConfigParams(extra={})
```
:::
:::{admonition} Solution
:class: tip
Always specify tokenizer for log-probability tasks
```python
# Correct
params = ConfigParams(
extra={
"tokenizer_backend": "huggingface",
"tokenizer": "/path/to/nemo_tokenizer"
}
)
```
:::
### Performance Issues
:::{admonition} Problem
:class: error
Excessive parallelism overwhelming server
```python
# Incorrect - too many concurrent requests
params = ConfigParams(parallelism=100)
```
:::
:::{admonition} Solution
:class: tip
Start conservative and scale up
```python
# Correct - reasonable concurrency
params = ConfigParams(parallelism=8, max_retries=3)
```
:::
### Parameter Conflicts
:::{admonition} Problem
:class: error
Mixing generation and log-probability parameters
```python
# Incorrect - generation params unused for log-probability
params = ConfigParams(
temperature=0.7, # Ignored for log-probability tasks
extra={"tokenizer": "/path"}
)
```
:::
:::{admonition} Solution
:class: tip
Use appropriate parameters for task type
```python
# Correct - only relevant parameters
params = ConfigParams(
limit_samples=100, # Relevant for all tasks
extra={"tokenizer": "/path"} # Required for log-probability
)
```
:::
## Best Practices
### Development Workflow
1. **Start Small**: Use `limit_samples=10` for initial validation
2. **Test Configuration**: Verify parameters work before full evaluation
3. **Monitor Resources**: Check memory and compute usage during evaluation
4. **Document Settings**: Record successful configurations for reproducibility
### Production Evaluation
1. **Deterministic Settings**: Use `temperature=0.01` for consistent results
2. **Full Datasets**: Remove `limit_samples` for complete evaluation
3. **Robust Configuration**: Set appropriate retries and timeouts
4. **Resource Planning**: Scale `parallelism` based on available infrastructure
### Parameter Tuning
1. **Task-Appropriate**: Match parameters to evaluation methodology
2. **Incremental Changes**: Adjust one parameter at a time
3. **Baseline Comparison**: Compare against known good configurations
4. **Performance Monitoring**: Track evaluation speed and resource usage
## Next Steps
- **Basic Usage**: See {ref}`text-gen` for getting started
- **Custom Tasks**: Learn {ref}`eval-custom-tasks` for specialized evaluations
- **Troubleshooting**: Refer to {ref}`troubleshooting-index` for common issues
- **Benchmarks**: Browse {ref}`eval-benchmarks` for task-specific recommendations
---
orphan: true
---
(code-generation)=
# Code Generation Evaluation
Evaluate programming capabilities through code generation, completion, and algorithmic problem solving using the BigCode evaluation harness.
## Overview
Code generation evaluation assesses a model's ability to:
- **Generate Code**: Write complete functions from natural language descriptions
- **Code Completion**: Fill in missing code segments
- **Algorithm Implementation**: Solve programming challenges and competitive programming problems
## Before You Start
Ensure you have:
- **Model Endpoint**: An OpenAI-compatible endpoint for your model
- **API Access**: Valid API key for your model endpoint
- **Sufficient Context**: Models with adequate context length for code problems
### Pre-Flight Check
Verify your setup before running code evaluation: {ref}`deployment-testing-compatibility`.
## Choose Your Approach
::::{tab-set}
:::{tab-item} NeMo Evaluator Launcher
:sync: launcher
**Recommended** - The fastest way to run code generation evaluations with unified CLI:
```bash
# List available code generation tasks
nemo-evaluator-launcher ls tasks | grep -E "(mbpp|humaneval)"
# Run MBPP evaluation
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
-o 'evaluation.tasks=["mbpp"]' \
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o target.api_endpoint.api_key=${YOUR_API_KEY}
# Run multiple code generation benchmarks
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
-o 'evaluation.tasks=["mbpp", "humaneval"]'
```
:::
:::{tab-item} Core API
:sync: api
For programmatic evaluation in custom workflows:
```python
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig, EvaluationTarget, ApiEndpoint, ConfigParams, EndpointType
)
# Configure code generation evaluation
eval_config = EvaluationConfig(
type="mbpp",
output_dir="./results",
params=ConfigParams(
limit_samples=10, # Remove for full dataset
temperature=0.2, # Low temperature for consistent code
max_new_tokens=1024, # Sufficient tokens for complete functions
top_p=0.9
)
)
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
url="https://integrate.api.nvidia.com/v1/chat/completions",
model_id="meta/llama-3.2-3b-instruct",
type=EndpointType.CHAT,
api_key="your_api_key"
)
)
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
print(f"Evaluation completed: {result}")
```
:::
:::{tab-item} Containers Directly
:sync: containers
For specialized container workflows:
```bash
# Pull and run BigCode evaluation container
docker run --rm -it nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:{{ docker_compose_latest }} bash
# Inside container - set environment
export MY_API_KEY=your_api_key_here
# Run code generation evaluation
nemo-evaluator run_eval \
--eval_type mbpp \
--model_id meta/llama-3.2-3b-instruct \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--model_type chat \
--api_key_name MY_API_KEY \
--output_dir /tmp/results \
--overrides 'config.params.limit_samples=10,config.params.temperature=0.2'
```
:::
::::
## Container Access
The BigCode evaluation harness is available through Docker containers. No separate package installation is required:
```bash
docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:{{ docker_compose_latest }}
```
## Discovering Available Tasks
Use the launcher CLI to discover all available code generation tasks:
```bash
# List all available benchmarks
nemo-evaluator-launcher ls tasks
# Filter for code generation tasks
nemo-evaluator-launcher ls tasks | grep -E "(mbpp|humaneval)"
```
## Available Tasks
The BigCode harness provides these programming benchmarks:
```{list-table}
:header-rows: 1
:widths: 20 40 20 20
* - Task
- Description
- Language
- Endpoint Type
* - `mbpp`
- Mostly Basic Programming Problems
- Python
- chat
* - `mbppplus`
- Extended MBPP with additional test cases
- Python
- chat
* - `humaneval`
- Hand-written programming problems
- Python
- completions
```
## Basic Code Generation Evaluation
The Most Basic Programming Problems (MBPP) benchmark tests fundamental programming skills. Use any of the three approaches above to run MBPP evaluations.
### Understanding Results
Code generation evaluations typically report pass@k metrics that indicate what percentage of problems were solved correctly within k attempts.
## Advanced Configuration
::::{dropdown} Custom Evaluation Parameters
:icon: code-square
```python
# Advanced configuration for code generation
eval_params = ConfigParams(
limit_samples=100, # Evaluate on subset for testing
parallelism=4, # Concurrent evaluation requests
temperature=0.2, # Low temperature for consistent code
max_new_tokens=1024 # Sufficient tokens for complete functions
)
eval_config = EvaluationConfig(
type="mbpp",
output_dir="/results/mbpp_advanced/",
params=eval_params
)
```
::::
::::{dropdown} Multiple Task Evaluation
:icon: code-square
Evaluate across different code generation benchmarks:
```python
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig, EvaluationTarget, ApiEndpoint, ConfigParams, EndpointType
)
# Configure target endpoint (reused for all tasks)
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
url="https://integrate.api.nvidia.com/v1/chat/completions",
model_id="meta/llama-3.2-3b-instruct",
type=EndpointType.CHAT,
api_key="your_api_key"
)
)
code_tasks = ["mbpp", "mbppplus"]
results = {}
for task in code_tasks:
eval_config = EvaluationConfig(
type=task,
output_dir=f"./results/{task}/",
params=ConfigParams(
limit_samples=50,
temperature=0.1,
parallelism=2
)
)
results[task] = evaluate(
eval_cfg=eval_config,
target_cfg=target_config
)
```
::::
## Understanding Metrics
### Pass@k Interpretation
Code generation evaluations typically report pass@k metrics:
- **Pass@1**: Percentage of problems solved on the first attempt
- **Pass@k**: Percentage of problems solved in k attempts (if multiple samples are generated)
---
orphan: true
---
(function-calling)=
# Function Calling Evaluation
Assess tool use capabilities, API calling accuracy, and structured output generation for agent-like behaviors using the Berkeley Function Calling Leaderboard (BFCL).
## Overview
Function calling evaluation measures a model's ability to:
- **Tool Discovery**: Identify appropriate functions for given tasks
- **Parameter Extraction**: Extract correct parameters from natural language
- **API Integration**: Generate proper function calls and handle responses
- **Multi-Step Reasoning**: Chain function calls for complex workflows
- **Error Handling**: Manage invalid parameters and API failures
## Before You Start
Ensure you have:
- **Chat Model Endpoint**: Function calling requires chat-formatted OpenAI-compatible endpoints
- **API Access**: Valid API key for your model endpoint
- **Structured Output Support**: Model capable of generating JSON/function call formats
---
## Choose Your Approach
::::{tab-set}
:::{tab-item} NeMo Evaluator Launcher
:sync: launcher
**Recommended** - The fastest way to run function calling evaluations with unified CLI:
```bash
# List available function calling tasks
nemo-evaluator-launcher ls tasks | grep -E "(bfcl|function)"
# Run BFCL AST prompting evaluation
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
-o 'evaluation.tasks=["bfclv3_ast_prompting"]' \
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o target.api_endpoint.api_key=${YOUR_API_KEY}
```
:::
:::{tab-item} Core API
:sync: api
For programmatic evaluation in custom workflows:
```python
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig,
EvaluationTarget,
ApiEndpoint,
ConfigParams,
EndpointType
)
# Configure function calling evaluation
eval_config = EvaluationConfig(
type="bfclv3_ast_prompting",
output_dir="./results",
params=ConfigParams(
limit_samples=10, # Remove for full dataset
temperature=0.1, # Low temperature for precise function calls
max_new_tokens=512, # Adequate for function call generation
top_p=0.9
)
)
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
url="https://integrate.api.nvidia.com/v1/chat/completions",
model_id="meta/llama-3.2-3b-instruct",
type=EndpointType.CHAT,
api_key="your_api_key"
)
)
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
print(f"Evaluation completed: {result}")
```
:::
:::{tab-item} Containers Directly
:sync: containers
For specialized container workflows:
```bash
# Pull and run BFCL evaluation container
docker run --rm -it nvcr.io/nvidia/eval-factory/bfcl:{{ docker_compose_latest }} bash
# Inside container - set environment
export MY_API_KEY=your_api_key_here
# Run function calling evaluation
nemo-evaluator run_eval \
--eval_type bfclv3_ast_prompting \
--model_id meta/llama-3.2-3b-instruct \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--model_type chat \
--api_key_name MY_API_KEY \
--output_dir /tmp/results \
--overrides 'config.params.limit_samples=10,config.params.temperature=0.1'
```
:::
::::
## Installation
Install the BFCL evaluation package for local development:
```bash
pip install nvidia-bfcl==25.7.1
```
## Discovering Available Tasks
Use the launcher CLI to discover all available function calling tasks:
```bash
# List all available benchmarks
nemo-evaluator-launcher ls tasks
# Filter for function calling tasks
nemo-evaluator-launcher ls tasks | grep -E "(bfcl|function)"
```
## Available Function Calling Tasks
BFCL provides comprehensive function calling benchmarks:
| Task | Description | Complexity | Format |
|------|-------------|------------|---------|
| `bfclv3_ast_prompting` | AST-based function calling with structured output | Intermediate | Structured |
| `bfclv2_ast_prompting` | BFCL v2 AST-based function calling (legacy) | Intermediate | Structured |
## Basic Function Calling Evaluation
The most comprehensive BFCL task is AST-based function calling that evaluates structured function calling. Use any of the three approaches above to run BFCL evaluations.
### Understanding Function Calling Format
BFCL evaluates models on their ability to generate proper function calls:
**Input Example**:
```text
What's the weather like in San Francisco and New York?
Available functions:
- get_weather(city: str, units: str = "celsius") -> dict
```
**Expected Output**:
```json
[
{"name": "get_weather", "arguments": {"city": "San Francisco"}},
{"name": "get_weather", "arguments": {"city": "New York"}}
]
```
## Advanced Configuration
### Custom Evaluation Parameters
```python
# Optimized settings for function calling
eval_params = ConfigParams(
limit_samples=100,
parallelism=2, # Conservative for complex reasoning
temperature=0.1, # Low temperature for precise function calls
max_new_tokens=512, # Adequate for function call generation
top_p=0.9 # Focused sampling for accuracy
)
eval_config = EvaluationConfig(
type="bfclv3_ast_prompting",
output_dir="/results/bfcl_optimized/",
params=eval_params
)
```
### Multi-Task Function Calling Evaluation
Evaluate multiple BFCL versions:
```python
function_calling_tasks = [
"bfclv2_ast_prompting", # BFCL v2
"bfclv3_ast_prompting" # BFCL v3 (latest)
]
results = {}
for task in function_calling_tasks:
eval_config = EvaluationConfig(
type=task,
output_dir=f"/results/{task}/",
params=ConfigParams(
limit_samples=50,
temperature=0.0, # Deterministic for consistency
parallelism=1 # Sequential for complex reasoning
)
)
result = evaluate(
target_cfg=target_config,
eval_cfg=eval_config
)
results[task] = result
# Access metrics from EvaluationResult object
print(f"Completed {task} evaluation")
print(f"Results: {result}")
```
## Understanding Metrics
### Results Structure
The `evaluate()` function returns an `EvaluationResult` object containing task-level and metric-level results:
```python
from nemo_evaluator.core.evaluate import evaluate
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
# Access task results
if result.tasks:
for task_name, task_result in result.tasks.items():
print(f"Task: {task_name}")
for metric_name, metric_result in task_result.metrics.items():
for score_name, score in metric_result.scores.items():
print(f" {metric_name}.{score_name}: {score.value}")
```
### Interpreting BFCL Scores
BFCL evaluations measure function calling accuracy across various dimensions. The specific metrics depend on the BFCL version and configuration. Check the `results.yml` output file for detailed metric breakdowns.
---
*For more function calling tasks and advanced configurations, see the [BFCL package documentation](https://pypi.org/project/nvidia-bfcl/).*
(eval-run)=
# Evaluation Techniques
Follow step-by-step guides for different evaluation scenarios and methodologies in NeMo Evaluator.
## Before You Start
Ensure you have:
1. Completed the initial getting started guides for {ref}`gs-install` and {ref}`gs-quickstart`.
2. Have your endpoint and API key ready or prepared for the checkpoint you wish to deploy.
3. Prepared your [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) for accessing gated datasets.
## Evaluations
Select an evaluation type tailored to your model capabilities.
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`pencil;1.5em;sd-mr-1` Text Generation
:link: text-gen
:link-type: ref
Measure model performance through natural language generation for academic benchmarks, reasoning tasks, and general knowledge assessment.
:::
:::{grid-item-card} {octicon}`graph;1.5em;sd-mr-1` Log-Probability
:link: logprobs
:link-type: ref
Assess model confidence and uncertainty using log-probabilities for multiple-choice scenarios without text generation.
:::
:::{grid-item-card} {octicon}`comment;1.5em;sd-mr-1` Reasoning
:link: run-eval-reasoning
:link-type: ref
Control the thinking budget and post-process the responses to extract the reasoning content and the final answer
:::
::::
:::{toctree}
:hidden:
Text Generation
Log Probability
Reasoning
:::
(text-gen)=
# Text Generation Evaluation
Text generation evaluation is the primary method for assessing LLM capabilities where models produce natural language responses to prompts. This approach evaluates the quality, accuracy, and appropriateness of generated text across various tasks and domains.
:::{tip}
In the example below we use the `gpqa_diamond` benchmark, but the instructions provided apply to all text generation tasks, such as:
- `mmlu`
- `mmlu_pro`
- `ifeval`
- `gsm8k`
- `mgsm`
- `mbpp`
:::
## Before You Start
Ensure you have:
- **Model Endpoint**: An OpenAI-compatible API endpoint for your model (completions or chat). See {ref}`deployment-testing-compatibility` for snippets you can use to test your endpoint.
- **API Access**: Valid API key if your endpoint requires authentication
- **Installed Packages**: NeMo Evaluator or access to evaluation containers
## Evaluation Approach
In text generation evaluation:
1. **Prompt Construction**: Models receive carefully crafted prompts (questions, instructions, or text to continue)
2. **Response Generation**: Models generate natural language responses using their trained parameters
3. **Response Assessment**: Generated text is evaluated for correctness, quality, or adherence to specific criteria
4. **Metric Calculation**: Numerical scores are computed based on evaluation criteria
This differs from **log-probability evaluation** where models assign confidence scores to predefined choices.
For log-probability methods, see the {ref}`logprobs`.
## Use NeMo Evaluator Launcher
Use an example config for evaluating the [Meta Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model:
```{literalinclude} ../../../packages/nemo-evaluator-launcher/examples/local_basic.yaml
:language: yaml
:start-after: "[docs-start-snippet]"
```
To launch the evaluation, run:
```bash
export HF_TOKEN_FOR_GPQA_DIAMOND=hf_your-token-here # GPQA is a gated dataset
export NGC_API_KEY=nvapi-your-token-here # API Key with access to build.nvidia.com
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_basic.yaml
```
## Use NeMo Evaluator
Start `simple-evals` docker container:
```bash
docker run --rm -it nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }}
```
or install `nemo-evaluator` and `nvidia-simple-evals` Python package in your environment of choice:
```bash
pip install nemo-evaluator nvidia-simple-evals
```
### Run with CLI
```bash
export HF_TOKEN_FOR_GPQA_DIAMOND=hf_your-token-here # GPQA is a gated dataset
export NGC_API_KEY=nvapi-your-token-here # API Key with access to build.nvidia.com
# Run evaluation
nemo-evaluator run_eval \
--eval_type gpqa_diamond \
--model_id meta/llama-3.2-3b-instruct \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--model_type chat \
--api_key_name NGC_API_KEY \
--output_dir ./llama_3_1_8b_instruct_results
```
### Run with Python API
```python
# set env variables before entering Python:
# export HF_TOKEN_FOR_GPQA_DIAMOND=hf_your-token-here # GPQA is a gated dataset
# export NGC_API_KEY=nvapi-your-token-here # API Key with access to build.nvidia.com
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
ApiEndpoint, EvaluationConfig, EvaluationTarget, ConfigParams, EndpointType
)
# Configure target endpoint
api_endpoint = ApiEndpoint(
url="https://integrate.api.nvidia.com/v1/chat/completions",
type=EndpointType.CHAT,
model_id="meta/llama-3.2-3b-instruct",
api_key="NGC_API_KEY" # variable name storing the key
)
target = EvaluationTarget(api_endpoint=api_endpoint)
# Configure evaluation task
config = EvaluationConfig(
type="gpqa_diamond",
output_dir="./llama_3_1_8b_instruct_results"
)
# Execute evaluation
results = evaluate(target_cfg=target, eval_cfg=config)
```
(logprobs)=
# Evaluate LLMs Using Log-Probabilities
## Introduction
While the most typical approach to LLM evaluation involves assessing the quality of a model's generated response to a question, an alternative method uses **log-probabilities**.
In this approach, we quantify a model's "surprise" or uncertainty when processing a text sequence.
This is done by calculating the sum of log-probabilities that the model assigns to each token.
A higher sum indicates the model is more confident about the sequence.
In this evaluation approach:
* The LLM is given a single combined text containing both the question and a potential answer.
* Next, the sum of log-probabilities is calculated only for the tokens that belong to the answer.
* This allows an assessment of how likely it is that the model would provide that answer for the given question.
For multiple-choice scenarios, the answer with the highest sum is treated as the one selected by the model.
The sum of log-probabilities can be used to calculate different metrics, such as **perplexity**.
Additionally, log-probabilities can be analyzed to assess whether a response would be generated by the model using greedy sampling—a method commonly employed to evaluate **accuracy**.
Using log-probabilities is especially useful for evaluating base (pre-trained) models, as it eliminates the need for complex instruction-following and does not require the model to adhere to a specific output format.
:::{tip}
In the example below we use the `piqa` benchmark, but the instructions provided apply to all `lm-evaluation-harness` tasks utilizing log-probabilities, such as:
- arc_challenge
- arc_multilingual
- bbh
- commonsense_qa
- hellaswag
- hellaswag_multilingual
- musr
- openbookqa
- social_iqa
- truthfulqa
- winogrande
:::
## Before You Start
Ensure you have:
- **Completions Endpoint**: Log-probability tasks require completions endpoints (not chat) that supports `logprobs` and `echo` parameters (see {ref}`compatibility-log-probs`)
- **Model Tokenizer**: Access to tokenizer files for client-side tokenization (supported types: `huggingface` or `tiktoken`)
- **API Access**: Valid API key for your model endpoint if it is gated
- **Authentication**: Hugging Face token for gated datasets and tokenizers
## Use NeMo Evaluator Launcher
Use an example config for deploying and evaluating the [Meta Llama 3.1 8B](https://huggingface.co/meta-llama/Llama-3.1-8B) model:
```{literalinclude} ../../../packages/nemo-evaluator-launcher/examples/local_vllm_logprobs.yaml
:language: yaml
:start-after: "[docs-start-snippet]"
```
To launch the evaluation, run:
```bash
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_vllm_logprobs.yaml
```
:::{tip}
Set `deployment: none` and provide `target` specification if you want to evaluate an existing endpoint instead:
```yaml
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: llama_local
env_vars:
HF_TOKEN: ${oc.env:HF_TOKEN} # needed to access meta-llama/Llama-3.1-8B gated model
target:
api_endpoint:
model_id: meta-llama/Llama-3.1-8B
url: https://your-endpoint.com/v1/completions
api_key_name: API_KEY # API Key with access to provided url
# specify the benchmarks to evaluate
evaluation:
nemo_evaluator_config: # global config settings that apply to all tasks
config:
params:
extra: # for log-probability tasks like piqa, you need to specify the tokenizer
tokenizer: meta-llama/Llama-3.1-8B # or use a path to locally stored checkpoint
tokenizer_backend: huggingface # or "tiktoken"
tasks:
- name: piqa
```
:::
## Use NeMo Evaluator
Start `lm-evaluation-harness` docker container:
```bash
docker run --rm -it nvcr.io/nvidia/eval-factory/lm-evaluation-harness:{{ docker_compose_latest }}
```
or install `nemo-evaluator` and `nvidia-lm-eval` Python package in your environment of choice:
```bash
pip install nemo-evaluator nvidia-lm-eval
```
To launch the evaluation, run the following Python code:
```{literalinclude} ../_snippets/piqa_hf.py
:language: python
:start-after: "# [snippet-start]"
:end-before: "# [snippet-end]"
```
Make sure to provide the source for the tokenizer and a backend for loading it.
For models trained with NeMo Framework, the tokenizer is stored inside the checkpoint directory.
For the NeMo format it is available inside `context/nemo_tokenizer` subdirectory:
```python
extra={
"tokenizer": "/workspace/llama3_8b_nemo2/context/nemo_tokenizer",
"tokenizer_backend": "huggingface",
},
```
For Megatron Bridge checkpoints, the tokenizer is stored under `tokenizer` subdirectory:
```python
extra={
"tokenizer": "/workspace/mbridge_llama3_8b/iter_0000000/tokenizer",
"tokenizer_backend": "huggingface",
},
```
## How it works
When the server receives a `logprob=` parameter in the request, it will return the log-probabilities of tokens.
When combined with `echo=true`, the model will include the input in its response, along with the corresponding log-probabilities.
Then the recieved response is processed on the client (benchmark) side to isolate the log-probabilities corresponding specifically to the answer portion of the input.
For this purpose the input is tokenized, which allows to trace which log-probabilities originated from the question, and which from the answer.
(run-eval-reasoning)=
# Evaluation of Reasoning Models
Reasoning models require a distinct approach compared to standard language models. Their outputs are typically longer, may contain dedicated reasoning tokens, and are more susceptible to generating loops or repetitive sequences. Evaluating these models effectively requires custom parameter settings and careful handling of generation constraints.
## Before You Start
Ensure you have:
- **Model Endpoint**: An OpenAI-compatible API endpoint for your model (completions or chat). See {ref}`deployment-testing-compatibility` for snippets you can use to test your endpoint.
- **API Access**: Valid API key if your endpoint requires authentication
- **Installed Packages**: NeMo Evaluator or access to evaluation containers
## Recommended Settings
### Generation Settings
Below are recommended generation settings for some popular reasoning-optimized models. These configurations should be included in the **model card**:
| Model | Temperature | Top-p | Top-k |
|---------------------|-------------|--------|--------|
| **NVIDIA Nemotron** | 0.6 | 0.95 | — |
| **DeepSeek R1** | 0.6 | 0.95 | — |
| **Qwen 230B** | 0.6 | 0.95 | 20 |
| **Phi-4 Reasoning** | 0.8 | 0.95 | 50 |
### Token Configuration
- `max_new_tokens` must be **significantly increased** for reasoning tasks as it includes the length of both reasoning trace and the final answer.
- Check the model card to see settings recommended by the model creators.
- It is important to observe if the specified `max_new_tokens` is enough for the model to finish reasoning.
:::{tip}
You can verify successful reasoning completion in the logs via the {ref}`interceptor-reasoning` Interceptor, for example:
```
[I 2025-12-02T16:14:28.257] Reasoning tracking information reasoning_words=1905 original_content_words=85 updated_content_words=85 reasoning_finished=True reasoning_started=True reasoning_tokens=unknown updated_content_tokens=unknown logger=ResponseReasoningInterceptor request_id=ccff76b2-2b85-4eed-a9d0-2363b533ae58
```
:::
## Reasoning Output Formats
Reasoning models produce outputs that contain both the **reasoning trace** (the model's step-by-step thinking process) and the **final answer**. The reasoning trace typically includes intermediate thoughts, calculations, and logical steps before arriving at the conclusion.
There are two main ways to structure reasoning output:
### 1. Wrapped with reasoning tokens
e.g.
```
...
```
```
...
```
or
```
...
```
Most of the benchmarks expect only the final answer to be present in model's response.
If your model endpoint replies with reasoning trace present in the main content, it needs to be removed from the assistant messages.
You can do it using the {ref}`interceptor-reasoning` Interceptor.
The interceptor will remove reasoning trace from the content and (optionally) track statistics for reasoning traces.
:::{note}
The `ResponseReasoningInterceptor` is by default configured for the `...` and ` ...` format. If your model uses these special tokens, you do not need to modify anything in your configuration.
:::
### 2. Returned as `reasoning_content` field in messages output
If your model is deployed with e.g. vLLM, sglang or NIM, the reasoning part of the model's output is likely returned in the separate `reasoning_content` field in messages output (see [vLLM documentation](https://docs.vllm.ai/en/stable/features/reasoning_outputs.html) and [sglang documentation](https://sgl-project.github.io/advanced_features/separate_reasoning.html)).
In the messages returned by the endpoint, there are:
- `reasoning_content`: The reasoning part of the output.
- `content`: The content of the final answer.
Conversely to the first method, this setup does not require any extra response parsing.
However, in some benchmarks, errors may appear if the reasoning has not finished and the benchmark does not support empty answers in `content`.
#### Enabling reasoning parser in vLLM
To enable the `reasoning_content` field in vLLM, you need to pass the `--reasoning-parser` argument to the vLLM server.
In NeMo Evaluator Launcher, you can do this via `deployment.extra_args`:
```yaml
deployment:
hf_model_handle: Qwen/Qwen3-Next-80B-A3B-Thinking
extra_args: "--reasoning-parser deepseek_r1"
```
Available reasoning parsers depend on your vLLM version. Common options include `deepseek_r1` for models using `...` format.
See the [vLLM reasoning outputs documentation](https://docs.vllm.ai/en/stable/features/reasoning_outputs.html) for details.
---
## Control the Reasoning Effort
Some models allow turning reasoning on/off or setting its level of effort. There are usually 2 ways of doing it:
- **Special instruction in the system prompt**
- **Extra parameters passed to the chat_template**
:::{tip}
Check the model card and documentation of the deployment of your choice to see how you can control the reasoning effort for your model.
If there are several options available, it is recommended to use the dedicated chat template parameters over the system prompt.
:::
### Control reasoning with the system prompt
In this example we will use the [NVIDIA-Nemotron-Nano-9B-v2](https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2/modelcard) model.
This model allows you to control the reasoning effort by including `/think` or `/no_think` in the system prompt, e.g.:
```json
{
"model": "nvidia/nvidia-nemotron-nano-9b-v2",
"messages": [
{"role": "system", "content": "You are a helpful assistant. /think"},
{"role": "user", "content": "What is 2+2?"}
],
"temperature": 0.6,
"top_p": 0.95,
"max_tokens": 32768
}
```
When launching the evaluation, we can use the {ref}`interceptor-system-messages` Interceptor to add `/think` or `/no_think` to the system prompt.
```yaml
config:
params:
temperature: 0.6
top_p: 0.95
max_new_tokens: 32768 # for reasoning + final answer
target:
api_endpoint:
adapter_config:
process_reasoning_traces: true # strips reasoning tokens and collects reasoning stats
use_system_prompt: true # turn reasoning on with special system prompt
custom_system_prompt: >-
"/think"
```
### Control reasoning with additional parameters
In this example we will use the [Granite-3.3-8B-Instruct](https://build.nvidia.com/ibm/granite-3_3-8b-instruct/modelcard) model.
Conversely to NVIDIA-Nemotron-Nano-9B-v2, this model allows you to turn the reasoning on with an additional `thinking` parameter passed to the chat template:
```json
{
"model": "ibm/granite-3.3-8b-instruct",
"messages": [
{
"role": "user",
"content": "What is 2+2?"
}
],
"temperature": 0.2,
"top_p": 0.7,
"max_tokens": 8192,
"seed": 42,
"stream": true,
"chat_template_kwargs": {
"thinking": true
}
}
```
When running the evaluation, use the {ref}`interceptor-payload-modification` Interceptor to add this parameter to benchmarks' requests:
```yaml
config:
params:
temperature: 0.6
top_p: 0.95
max_new_tokens: 32768 # for reasoning + final answer
target:
api_endpoint:
adapter_config:
process_reasoning_traces: true
params_to_add:
chat_template_kwargs:
thinking: true
```
## Benchmarks for Reasoning
Reasoning models excel at tasks that require multi-step thinking, logical deduction, and complex problem-solving. The following benchmark categories are particularly well-suited for evaluating reasoning capabilities:
- **CoT tasks**: e.g., AIME, Math, GPQA-diamond
- **Coding**: e.g., scicodebench, livedocebench
:::{tip}
When evaluating your model on a task that does not require step-by-step thinking, consider turning the reasoning off or lowering the thinking budget.
:::
## Full Working Example
### Run the evaluation
An example config is available in `packages/nemo-evaluator-launcher/examples/local_reasoning.yaml`:
```{literalinclude} ../../../packages/nemo-evaluator-launcher/examples/local_reasoning.yaml
:language: yaml
:start-after: "[docs-start-snippet]"
```
To launch the evaluation, run:
```bash
export NGC_API_KEY=nvapi-...
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_reasoning.yaml
```
### Analyze the artifacts
NeMo Evaluator produces several artifacts for analysis after evaluation completion.
The primary output file is `results.yaml`, which stores the metrics produced by the benchmark (see {ref}`evaluation-output` for more details).
The `eval_factory_metrics.json` file provides valuable insights into your model's behavior.
When the reasoning interceptor is enabled, this file contains a `reasoning` key that stores statistics about reasoning traces in your model's responses:
```json
"reasoning": {
"description": "Reasoning statistics saved during processing",
"total_responses": 3672,
"responses_with_reasoning": 2860,
"reasoning_finished_count": 2860,
"reasoning_finished_ratio": 1.0,
"reasoning_started_count": 2860,
"reasoning_unfinished_count": 0,
"avg_reasoning_words": 153.21,
"avg_original_content_words": 192.17,
"avg_updated_content_words": 38.52,
"max_reasoning_words": 806,
"max_original_content_words": 863,
"max_updated_content_words": 863,
"max_reasoning_tokens": null,
"avg_reasoning_tokens": null,
"max_updated_content_tokens": null,
"avg_updated_content_tokens": null,
"total_reasoning_words": 561696,
"total_original_content_words": 705555,
"total_updated_content_words": 140999,
"total_reasoning_tokens": 0,
"total_updated_content_tokens": 0
},
```
In the example above, the model used reasoning for 2860 out of 3672 responses (approximately 78%).
The matching values for `reasoning_started_count` and `reasoning_finished_count` (and `reasoning_unfinished_count` being 0) indicate that the `max_new_tokens` parameter was set sufficiently high, allowing the model to complete all reasoning traces without truncation.
These statistics also enable cost analysis for reasoning operations.
While the endpoint in this example does not return reasoning token usage statistics (the `*_tokens` fields are null or zero), you can still analyze computational cost using the word count metrics from the responses.
For more information on available artifacts, see {ref}`evaluation-output`.
(deployment-overview)=
# Serve and Deploy Models
Deploy and serve models with NeMo Evaluator's flexible deployment options. Select a deployment strategy that matches your workflow, infrastructure, and requirements.
## Overview
NeMo Evaluator keeps model serving separate from evaluation execution, giving you flexible architectures and scalable workflows. Choose who manages deployment based on your needs.
### Key Concepts
- **Model-Evaluation Separation**: Models serve via OpenAI-compatible APIs, evaluations run in containers
- **Deployment Responsibility**: Choose who manages the model serving infrastructure
- **Multi-Backend Support**: Deploy locally, on HPC clusters, or in the cloud
## Deployment Strategy Guide
### **Launcher-Orchestrated Deployment** (Recommended)
Let NeMo Evaluator Launcher handle both model deployment and evaluation orchestration:
```bash
# Launcher deploys model AND runs evaluation
HOSTNAME=cluster-login-node.com
ACCOUNT=my_account
OUT_DIR=/absolute/path/on/login/node
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/slurm_vllm_checkpoint_path.yaml \
-o execution.hostname=$HOSTNAME \
-o execution.output_dir=$OUT_DIR \
-o execution.account=$ACCOUNT \
-o deployment.checkpoint_path=/shared/models/llama-3.1-8b-it
```
**When to use:**
- You want automated deployment lifecycle management
- You prefer integrated monitoring and cleanup
- You want the simplest path from model to results
**Supported deployment types:** vLLM, NIM, SGLang, TRT-LLM, or no deployment (existing endpoints)
:::{seealso}
For detailed YAML configuration reference for each deployment type, see the {ref}`configuration-overview` in the NeMo Evaluator Launcher library.
:::
### **Bring-Your-Own-Endpoint**
You handle model deployment, NeMo Evaluator handles evaluation:
**Launcher users with existing endpoints:**
```bash
# Point launcher to your deployed model
URL=http://localhost:8000/v1/chat/completions
MODEL=your-model-name
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
-o target.api_endpoint.url=$URL \
-o target.api_endpoint.model_id=$MODEL
```
**Core library users:**
```python
from nemo_evaluator import evaluate, ApiEndpoint, EvaluationTarget, EvaluationConfig
api_endpoint = ApiEndpoint(url="http://localhost:8080/v1/completions")
target = EvaluationTarget(api_endpoint=api_endpoint)
config = EvaluationConfig(type="gsm8k", output_dir="./results")
evaluate(target_cfg=target, eval_cfg=config)
```
**When to use:**
- You have existing model serving infrastructure
- You need custom deployment configurations
- You want to deploy once and run many evaluations
- You have specific security or compliance requirements
## Available Deployment Types
The launcher supports multiple deployment types through Hydra configuration:
**vLLM Deployment**
```yaml
deployment:
type: vllm
image: vllm/vllm-openai:latest
hf_model_handle: hf-model/handle # HuggingFace ID
checkpoint_path: null # or provide a path to the stored checkpoint
served_model_name: your-model-name
port: 8000
```
**NIM Deployment**
```yaml
deployment:
image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.6
served_model_name: meta/llama-3.1-8b-instruct
port: 8000
```
**SGLang Deployment**
```yaml
deployment:
type: sglang
image: lmsysorg/sglang:latest
hf_model_handle: hf-model/handle # HuggingFace ID
checkpoint_path: null # or provide a path to the stored checkpoint
served_model_name: your-model-name
port: 8000
```
**No Deployment**
```yaml
deployment:
type: none # Use existing endpoint
```
## Bring-Your-Own-Endpoint Options
Choose from these approaches when managing your own deployment:
### Hosted Services
- **NVIDIA Build**: Ready-to-use hosted models with OpenAI-compatible APIs
- **OpenAI API**: Direct integration with OpenAI's models
- **Other providers**: Any service providing OpenAI-compatible endpoints
### Enterprise Integration
- **Kubernetes deployments**: Container orchestration in production environments
- **Existing MLOps pipelines**: Integration with current model serving infrastructure
- **Custom infrastructure**: Specialized deployment requirements
(adapters-client-mode)=
# Client Mode
The NeMo Evaluator adapter system supports **Client Mode**, where adapters run in-process through a custom httpx transport, providing a simpler alternative to the proxy server architecture.
## Overview
| Feature | Server Mode | Client Mode |
|---------|------------|-------------|
| **Architecture** | Separate proxy server process | In-process via httpx transport |
| **Setup** | Automatic server startup/shutdown | Simple client instantiation |
| **Use Case** | Framework-driven evaluations | Direct API usage, notebooks |
| **Overhead** | Network proxy | Direct in-process execution |
| **Debugging** | Separate process | Same process, easier debugging |
## Quick Start
```python
from nemo_evaluator.client import NeMoEvaluatorClient
from nemo_evaluator.api.api_dataclasses import EndpointModelConfig
from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig
# Configure model and adapters
config = EndpointModelConfig(
model_id="my-model",
url="https://api.example.com/v1/chat/completions",
api_key_name="API_KEY", # Environment variable name
adapter_config=AdapterConfig(
mode="client", # Use client mode (no server)
interceptors=[
InterceptorConfig(name="caching", enabled=True),
InterceptorConfig(name="endpoint", enabled=True),
]
),
is_base_url=False, # True if URL is base, False for complete endpoint
)
# Create client
async with NeMoEvaluatorClient(config, output_dir="./output") as client:
response = await client.chat_completion(
messages=[{"role": "user", "content": "Hello!"}]
)
print(response)
```
## Mode Configuration
### Adapter Mode Field
The `mode` field in `AdapterConfig` controls whether a server process is spawned:
- **`mode="server"`** (default): Spawns adapter server process in `evaluate()` calls
- **`mode="client"`**: Skips server spawning, for use with `NeMoEvaluatorClient`
When using `NeMoEvaluatorClient` directly, set `mode="client"` to prevent unnecessary server creation if the config is also used in `evaluate()` calls.
## URL Modes
Client mode supports two URL configurations via the `is_base_url` flag:
### Base URL Mode (`is_base_url=True`)
Use when the URL is a base URL and the client should append paths:
```python
config = EndpointModelConfig(
url="https://api.example.com/v1", # Base URL
is_base_url=True,
...
)
# Requests go to: https://api.example.com/v1/chat/completions
```
### Passthrough Mode (`is_base_url=False`)
Use when the URL is the complete endpoint:
```python
config = EndpointModelConfig(
url="https://api.example.com/v1/chat/completions", # Complete endpoint
is_base_url=False, # Default
...
)
# Requests go to: https://api.example.com/v1/chat/completions (as-is)
```
## API Reference
### Initialization
```python
from nemo_evaluator.client import NeMoEvaluatorClient
from nemo_evaluator.api.api_dataclasses import EndpointModelConfig
client = NeMoEvaluatorClient(
endpoint_model_config=EndpointModelConfig(
model_id="model-name",
url="https://api.example.com/v1/chat/completions",
api_key_name="API_KEY",
adapter_config=adapter_config,
is_base_url=False,
temperature=0.7,
top_p=0.9,
max_new_tokens=100,
request_timeout=60,
max_retries=3,
parallelism=5,
),
output_dir="./eval_output"
)
```
### Methods
#### Chat Completion
```python
# Single request (async)
response = await client.chat_completion(
messages=[{"role": "user", "content": "Hello"}],
seed=42 # Optional
)
# Batch requests (sync wrapper)
responses = client.chat_completions(
messages_list=[
[{"role": "user", "content": "Hello"}],
[{"role": "user", "content": "Hi"}],
],
seeds=[42, 43], # Optional
show_progress=True
)
# Batch requests (async)
responses = await client.batch_chat_completions(
messages_list=[...],
seeds=[...],
show_progress=True
)
```
#### Text Completion
```python
# Single completion
response = await client.completion(
prompt="Once upon a time",
seed=42
)
# Batch completions
responses = client.completions(
prompts=["Prompt 1", "Prompt 2"],
seeds=[42, 43],
show_progress=True
)
```
#### Embeddings
```python
# Single embedding
embedding = await client.embedding(text="Hello world")
# Batch embeddings
embeddings = client.embeddings(
texts=["Text 1", "Text 2"],
show_progress=True
)
```
### Context Manager
```python
# Recommended: ensures post-eval hooks run
async with NeMoEvaluatorClient(config, output_dir="./output") as client:
response = await client.chat_completion(messages=[...])
# Hooks run automatically on exit
```
### Manual Cleanup
```python
client = NeMoEvaluatorClient(config, output_dir="./output")
try:
response = await client.chat_completion(messages=[...])
finally:
await client.aclose() # Runs post-eval hooks
```
## Adapter Configuration
Client mode uses the same `AdapterConfig` as server mode, but with `mode="client"` to prevent server spawning:
```python
from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig
adapter_config = AdapterConfig(
mode="client", # Prevents adapter server from spawning
interceptors=[
InterceptorConfig(
name="system_message",
config={"system_message": "You are helpful."}
),
InterceptorConfig(name="request_logging"),
InterceptorConfig(
name="caching",
config={"cache_dir": "./cache"}
),
InterceptorConfig(name="reasoning"),
InterceptorConfig(name="endpoint"), # Required
],
post_eval_hooks=[
{"name": "post_eval_report", "config": {"report_types": ["html"]}}
]
)
```
**Note:** When using `NeMoEvaluatorClient`, the `mode` is automatically set to `"client"` if not specified.
## Implementation Details
### Architecture
```
┌─────────────────────────┐
│ Your Script/Notebook │
└───────────┬─────────────┘
│
↓
┌─────────────────────────┐
│ NeMoEvaluatorClient │
│ (AsyncOpenAI wrapper) │
└───────────┬─────────────┘
│
↓
┌─────────────────────────┐
│ AsyncAdapterTransport │
│ (httpx.AsyncBaseTransport) │
│ │
│ ┌───────────────────┐ │
│ │ Adapter Pipeline │ │
│ │ - Interceptors │ │
│ │ - Post-eval hooks │ │
│ └───────────────────┘ │
└───────────┬─────────────┘
│ HTTP
↓
┌─────────────────────────┐
│ Model Endpoint │
└─────────────────────────┘
```
### Request Flow
1. User calls `client.chat_completion(...)`
2. AsyncOpenAI client constructs httpx.Request
3. AsyncAdapterTransport intercepts the request
4. Request wrapped for adapter compatibility (HttpxRequestWrapper)
5. Request passes through interceptor chain (in thread pool for sync interceptors)
6. Endpoint interceptor makes HTTP call
7. Response passes back through response interceptors
8. Response converted back to httpx.Response
9. AsyncOpenAI client parses and returns completion
### Sync/Async Bridging
Client mode handles the async/sync boundary automatically:
- AsyncAdapterTransport is async (implements `httpx.AsyncBaseTransport`)
- Adapter pipeline and interceptors are synchronous
- `asyncio.to_thread()` runs sync pipeline in thread pool
- Seamless integration with async OpenAI client
## When to Use Client Mode
### Use Client Mode When:
- Writing custom evaluation scripts
- Working in Jupyter notebooks
- Need direct API control
- Want simpler setup
- Debugging in same process
- Single-process evaluations
### Use Server Mode When:
- Running framework-driven evaluations with `evaluate()`
- Need shared adapter state across processes
- Working with harnesses that don't support custom clients
- Running distributed evaluations
## Examples
### Basic Usage
```python
from nemo_evaluator.client import NeMoEvaluatorClient
from nemo_evaluator.api.api_dataclasses import EndpointModelConfig
from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig
config = EndpointModelConfig(
model_id="llama-3-70b",
url="https://integrate.api.nvidia.com/v1/chat/completions",
api_key_name="NVIDIA_API_KEY",
is_base_url=False,
adapter_config=AdapterConfig(
interceptors=[
InterceptorConfig(name="caching"),
InterceptorConfig(name="endpoint"),
]
),
)
async with NeMoEvaluatorClient(config, "./output") as client:
response = await client.chat_completion(
messages=[{"role": "user", "content": "What is AI?"}]
)
print(response)
```
### Batch Processing
```python
# Process multiple prompts with progress bar
prompts = [
[{"role": "user", "content": f"Question {i}"}]
for i in range(100)
]
responses = client.chat_completions(
messages_list=prompts,
show_progress=True
)
```
### With All Interceptors
```python
adapter_config = AdapterConfig(
interceptors=[
InterceptorConfig(
name="system_message",
config={"system_message": "Be concise."}
),
InterceptorConfig(name="request_logging"),
InterceptorConfig(name="response_logging"),
InterceptorConfig(
name="caching",
config={
"cache_dir": "./cache",
"reuse_cached_responses": True,
"save_requests": True,
"save_responses": True,
}
),
InterceptorConfig(
name="reasoning",
config={"start_reasoning_token": ""}
),
InterceptorConfig(name="response_stats"),
InterceptorConfig(name="endpoint"),
],
post_eval_hooks=[
{"name": "post_eval_report", "config": {"report_types": ["html", "json"]}}
]
)
```
## See Also
- {ref}`adapters-concepts` - Conceptual overview of the adapter system
- {ref}`adapters-configuration` - Available interceptors and configuration options
- {ref}`deployment-adapters-recipes` - Common adapter patterns and recipes
(adapters-configuration)=
# Configuration
Configure the adapter system using the `AdapterConfig` class from `nemo_evaluator.adapters.adapter_config`. This class uses a registry-based interceptor architecture where you configure a list of interceptors, each with their own parameters.
## Core Configuration Structure
`AdapterConfig` accepts the following structure:
```python
from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig
adapter_config = AdapterConfig(
interceptors=[
InterceptorConfig(
name="interceptor_name",
enabled=True, # Optional, defaults to True
config={
# Interceptor-specific parameters
}
)
],
endpoint_type="chat" # Optional, defaults to "chat"
)
```
## Available Interceptors
### System Message Interceptor
**Name:** `system_message`
Adds a system message to requests by adding it as a system role message.
```{list-table}
:header-rows: 1
:widths: 20 15 15 50
* - Parameter
- Type
- Default
- Description
* - `system_message`
- `str`
- Required
- System message to add to requests
* - `strategy`
- `str`
- `"prepend"`
- Strategy for handling existing system messages. Options: `"replace"` (replaces existing), `"append"` (appends to existing), `"prepend"` (prepends to existing)
```
**Example:**
```python
InterceptorConfig(
name="system_message",
config={
"system_message": "You are a helpful assistant."
}
)
# With explicit strategy
InterceptorConfig(
name="system_message",
config={
"system_message": "You are a helpful assistant.",
"strategy": "replace" # Replace existing system messages
}
)
```
### Reasoning Interceptor
**Name:** `reasoning`
Processes reasoning content in responses by detecting and removing reasoning tokens, tracking reasoning statistics, and optionally extracting reasoning to separate fields.
```{list-table}
:header-rows: 1
:widths: 25 15 20 40
* - Parameter
- Type
- Default
- Description
* - `start_reasoning_token`
- `str \| None`
- `""`
- Token marking start of reasoning section
* - `end_reasoning_token`
- `str`
- `""`
- Token marking end of reasoning section
* - `add_reasoning`
- `bool`
- `True`
- Whether to add reasoning information
* - `migrate_reasoning_content`
- `bool`
- `False`
- Migrate reasoning_content to content field with tokens
* - `enable_reasoning_tracking`
- `bool`
- `True`
- Enable reasoning tracking and logging
* - `include_if_not_finished`
- `bool`
- `True`
- Include reasoning if end token not found
* - `enable_caching`
- `bool`
- `True`
- Cache individual request reasoning statistics
* - `cache_dir`
- `str`
- `"/tmp/reasoning_interceptor"`
- Cache directory for reasoning stats
* - `stats_file_saving_interval`
- `int \| None`
- `None`
- Save stats to file every N responses (None = only save via post_eval_hook)
* - `logging_aggregated_stats_interval`
- `int`
- `100`
- Log aggregated stats every N responses
```
**Example:**
```python
InterceptorConfig(
name="reasoning",
config={
"start_reasoning_token": "",
"end_reasoning_token": "",
"enable_reasoning_tracking": True
}
)
```
### Request Logging Interceptor
**Name:** `request_logging`
Logs incoming requests with configurable limits and detail levels.
```{list-table}
:header-rows: 1
:widths: 20 15 15 50
* - Parameter
- Type
- Default
- Description
* - `log_request_body`
- `bool`
- `True`
- Whether to log request body
* - `log_request_headers`
- `bool`
- `True`
- Whether to log request headers
* - `max_requests`
- `int \| None`
- `2`
- Maximum requests to log (None for unlimited)
```
**Example:**
```python
InterceptorConfig(
name="request_logging",
config={
"max_requests": 50,
"log_request_body": True
}
)
```
### Response Logging Interceptor
**Name:** `response_logging`
Logs outgoing responses with configurable limits and detail levels.
```{list-table}
:header-rows: 1
:widths: 20 15 15 50
* - Parameter
- Type
- Default
- Description
* - `log_response_body`
- `bool`
- `True`
- Whether to log response body
* - `log_response_headers`
- `bool`
- `True`
- Whether to log response headers
* - `max_responses`
- `int \| None`
- `None`
- Maximum responses to log (None for unlimited)
```
**Example:**
```python
InterceptorConfig(
name="response_logging",
config={
"max_responses": 50,
"log_response_body": True
}
)
```
### Caching Interceptor
**Name:** `caching`
Caches requests and responses to disk with options for reusing cached responses.
```{list-table}
:header-rows: 1
:widths: 25 15 15 45
* - Parameter
- Type
- Default
- Description
* - `cache_dir`
- `str`
- `"/tmp"`
- Directory to store cache files
* - `reuse_cached_responses`
- `bool`
- `False`
- Whether to reuse cached responses
* - `save_requests`
- `bool`
- `False`
- Whether to save requests to cache
* - `save_responses`
- `bool`
- `True`
- Whether to save responses to cache
* - `max_saved_requests`
- `int \| None`
- `None`
- Maximum requests to save (None for unlimited)
* - `max_saved_responses`
- `int \| None`
- `None`
- Maximum responses to save (None for unlimited)
```
**Notes:**
- If `reuse_cached_responses` is `True`, `save_responses` is automatically set to `True` and `max_saved_responses` to `None`
- The system generates cache keys automatically using SHA256 hash of request data
**Example:**
```python
InterceptorConfig(
name="caching",
config={
"cache_dir": "./evaluation_cache",
"reuse_cached_responses": True
}
)
```
### Progress Tracking Interceptor
**Name:** `progress_tracking`
Tracks evaluation progress by counting processed samples and optionally sending updates to a webhook.
```{list-table}
:header-rows: 1
:widths: 25 15 20 40
* - Parameter
- Type
- Default
- Description
* - `progress_tracking_url`
- `str \| None`
- `"http://localhost:8000"`
- URL to post progress updates. Supports shell variable expansion.
* - `progress_tracking_interval`
- `int`
- `1`
- Update every N samples
* - `request_method`
- `str`
- `"PATCH"`
- HTTP method for progress updates
* - `output_dir`
- `str \| None`
- `None`
- Directory to save progress file (creates a `progress` file in this directory)
```
**Example:**
```python
InterceptorConfig(
name="progress_tracking",
config={
"progress_tracking_url": "http://monitor:8000/progress",
"progress_tracking_interval": 10
}
)
```
### Endpoint Interceptor
**Name:** `endpoint`
Makes the actual HTTP request to the upstream API. This interceptor has no configurable parameters and is typically added automatically as the final interceptor in the chain.
**Example:**
```python
InterceptorConfig(name="endpoint")
```
## Configuration Examples
### Basic Configuration
```python
from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig
adapter_config = AdapterConfig(
interceptors=[
InterceptorConfig(
name="request_logging",
config={"max_requests": 10}
),
InterceptorConfig(
name="caching",
config={"cache_dir": "./cache"}
)
]
)
```
### Advanced Configuration
```python
from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig
adapter_config = AdapterConfig(
interceptors=[
# System prompting
InterceptorConfig(
name="system_message",
config={
"system_message": "You are an expert AI assistant."
}
),
# Reasoning processing
InterceptorConfig(
name="reasoning",
config={
"start_reasoning_token": "",
"end_reasoning_token": "",
"enable_reasoning_tracking": True
}
),
# Request logging
InterceptorConfig(
name="request_logging",
config={
"max_requests": 1000,
"log_request_body": True
}
),
# Response logging
InterceptorConfig(
name="response_logging",
config={
"max_responses": 1000,
"log_response_body": True
}
),
# Caching
InterceptorConfig(
name="caching",
config={
"cache_dir": "./production_cache",
"reuse_cached_responses": True
}
),
# Progress tracking
InterceptorConfig(
name="progress_tracking",
config={
"progress_tracking_url": "http://monitoring:3828/progress",
"progress_tracking_interval": 10
}
)
],
endpoint_type="chat"
)
```
### YAML Configuration
You can also configure adapters through YAML files in your evaluation configuration:
```yaml
target:
api_endpoint:
url: http://localhost:8080/v1/chat/completions
type: chat
model_id: megatron_model
adapter_config:
interceptors:
- name: system_message
config:
system_message: "You are a helpful assistant."
strategy: "prepend" # Optional: replace, append, or prepend (default)
- name: reasoning
config:
start_reasoning_token: ""
end_reasoning_token: ""
- name: request_logging
config:
max_requests: 50
- name: response_logging
config:
max_responses: 50
- name: caching
config:
cache_dir: ./cache
reuse_cached_responses: true
```
## Interceptor Order
Interceptors are executed in the order they appear in the `interceptors` list:
1. **Request interceptors** process the request in list order
2. The **endpoint interceptor** makes the actual API call (automatically added if not present)
3. **Response interceptors** process the response in reverse list order
For example, with interceptors `[system_message, request_logging, caching, response_logging, reasoning]`:
- Request flow: `system_message` → `request_logging` → `caching` (check cache) → API call (if cache miss)
- Response flow: API call → `caching` (save to cache) → `response_logging` → `reasoning`
## Shorthand Syntax
You can use string names as shorthand for interceptors with default configuration:
```python
adapter_config = AdapterConfig(
interceptors=["request_logging", "caching", "response_logging"]
)
```
This is equivalent to:
```python
adapter_config = AdapterConfig(
interceptors=[
InterceptorConfig(name="request_logging"),
InterceptorConfig(name="caching"),
InterceptorConfig(name="response_logging")
]
)
```
---
orphan: true
---
(adapters)=
# Evaluation Adapters
Evaluation adapters provide a flexible mechanism for intercepting and modifying requests/responses between the evaluation harness and the model endpoint. This allows for custom processing, logging, and transformation of data during the evaluation process.
## Concepts
For a conceptual overview and architecture diagram of adapters and interceptor chains, refer to {ref}`adapters-concepts`.
## Topics
Explore the following pages to use and configure adapters.
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} Usage
:link: adapters-usage
:link-type: ref
Learn how to enable adapters and pass `AdapterConfig` to `evaluate`.
:::
:::{grid-item-card} Recipes
:link: deployment-adapters-recipes
:link-type: ref
Reasoning cleanup, system prompt override, response shaping, logging caps.
:::
:::{grid-item-card} Configuration
:link: adapters-configuration
:link-type: ref
View available `AdapterConfig` options and defaults.
:::
::::
```{toctree}
:maxdepth: 1
:hidden:
Usage
Recipes
Configuration
```
(adapters-usage)=
# Usage
Configure the adapter system using the `AdapterConfig` class with interceptors. Pass the configuration through the `ApiEndpoint.adapter_config` parameter:
```python
from nemo_evaluator import (
ApiEndpoint,
EndpointType,
EvaluationConfig,
EvaluationTarget,
evaluate
)
from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig
# Configure adapter with multiple interceptors
adapter_config = AdapterConfig(
interceptors=[
# Reasoning interceptor
InterceptorConfig(
name="reasoning",
config={
"start_reasoning_token": "",
"end_reasoning_token": ""
}
),
# System message interceptor
InterceptorConfig(
name="system_message",
config={
"system_message": "You are a helpful assistant that thinks step by step."
}
),
# Logging interceptors
InterceptorConfig(
name="request_logging",
config={"max_requests": 50}
),
InterceptorConfig(
name="response_logging",
config={"max_responses": 50}
),
# Caching interceptor
InterceptorConfig(
name="caching",
config={
"cache_dir": "./evaluation_cache"
}
),
# Progress tracking
InterceptorConfig(
name="progress_tracking"
)
]
)
# Configure evaluation target
api_endpoint = ApiEndpoint(
url="http://localhost:8080/v1/completions/",
type=EndpointType.COMPLETIONS,
model_id="megatron_model",
adapter_config=adapter_config
)
target_config = EvaluationTarget(api_endpoint=api_endpoint)
# Configure evaluation
eval_config = EvaluationConfig(
type="mmlu_pro",
params={"limit_samples": 10},
output_dir="./results/mmlu",
)
# Run evaluation with adapter system
results = evaluate(
eval_cfg=eval_config,
target_cfg=target_config
)
```
## YAML Configuration
You can also configure adapters through YAML configuration files:
```yaml
target:
api_endpoint:
url: http://localhost:8080/v1/completions/
type: completions
model_id: megatron_model
adapter_config:
interceptors:
- name: reasoning
config:
start_reasoning_token: ""
end_reasoning_token: ""
- name: system_message
config:
system_message: "You are a helpful assistant that thinks step by step."
- name: request_logging
config:
max_requests: 50
- name: response_logging
config:
max_responses: 50
- name: caching
config:
cache_dir: ./cache
- name: progress_tracking
config:
type: mmlu_pro
output_dir: ./results
params:
limit_samples: 10
```
(adapters-recipe-system-prompt)=
# Custom System Prompt (Chat)
Apply a standard system message to chat endpoints for consistent behavior.
```python
from nemo_evaluator import (
ApiEndpoint, EndpointType, EvaluationConfig, EvaluationTarget, evaluate
)
from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig
# Configure chat endpoint
chat_url = "http://0.0.0.0:8080/v1/chat/completions/"
api_endpoint = ApiEndpoint(url=chat_url, type=EndpointType.CHAT, model_id="megatron_model")
# Configure adapter with custom system prompt using interceptor
api_endpoint.adapter_config = AdapterConfig(
interceptors=[
InterceptorConfig(
name="system_message",
config={
"system_message": "You are a precise, concise assistant. Answer questions directly and accurately.",
"strategy": "prepend" # Optional: "replace", "append", or "prepend" (default)
}
)
]
)
target = EvaluationTarget(api_endpoint=api_endpoint)
config = EvaluationConfig(type="mmlu_pro", output_dir="results")
results = evaluate(target_cfg=target, eval_cfg=config)
```
## How It Works
The `system_message` interceptor modifies chat-format requests based on the configured strategy:
- **`prepend` (default)**: Prepends the configured system message before any existing system message
- **`replace`**: Removes any existing system messages and replaces with the configured message
- **`append`**: Appends the configured system message after any existing system message
All strategies:
1. Insert or modify the system message as the first message with `role: "system"`
2. Preserve all other request parameters
## Strategy Examples
```python
# Replace existing system messages (ignore any existing ones)
InterceptorConfig(
name="system_message",
config={
"system_message": "You are a helpful assistant.",
"strategy": "replace"
}
)
# Prepend to existing system messages (default behavior)
InterceptorConfig(
name="system_message",
config={
"system_message": "Important: ",
"strategy": "prepend"
}
)
# Append to existing system messages
InterceptorConfig(
name="system_message",
config={
"system_message": "\nRemember to be concise.",
"strategy": "append"
}
)
```
Refer to {ref}`adapters-configuration` for more configuration options.
(deployment-adapters-recipes)=
# Recipes
Practical, focused examples for common adapter scenarios.
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} Reasoning Cleanup
:link: adapters-recipe-reasoning
:link-type: ref
Strip intermediate reasoning tokens before scoring.
:::
:::{grid-item-card} Custom System Prompt (Chat)
:link: adapters-recipe-system-prompt
:link-type: ref
Enforce a standard system prompt for chat endpoints.
:::
:::{grid-item-card} Request Parameter Modification
:link: adapters-recipe-response-shaping
:link-type: ref
Standardize request parameters across endpoint providers.
:::
:::{grid-item-card} Logging Caps
:link: adapters-recipe-logging
:link-type: ref
Control logging volume for requests and responses.
:::
::::
```{toctree}
:maxdepth: 1
:hidden:
Reasoning Cleanup
Custom System Prompt (Chat)
Request Parameter Modification
Logging Caps
```
(adapters-recipe-reasoning)=
# Reasoning Cleanup
Use the reasoning adapter to remove intermediate thoughts from model outputs before scoring.
```python
from nemo_evaluator import (
ApiEndpoint, EndpointType, EvaluationConfig, EvaluationTarget, evaluate
)
from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig
# Configure completions endpoint
completions_url = "http://0.0.0.0:8080/v1/completions/"
api_endpoint = ApiEndpoint(url=completions_url, type=EndpointType.COMPLETIONS, model_id="megatron_model")
# Configure adapter with reasoning extraction
api_endpoint.adapter_config = AdapterConfig(
interceptors=[
InterceptorConfig(
name="reasoning",
enabled=True,
config={
"start_reasoning_token": "",
"end_reasoning_token": ""
}
)
]
)
target = EvaluationTarget(api_endpoint=api_endpoint)
config = EvaluationConfig(type="gsm8k", output_dir="results")
results = evaluate(target_cfg=target, eval_cfg=config)
```
## Configuration Parameters
Set both `start_reasoning_token` and `end_reasoning_token` to match your model's delimiters. The reasoning interceptor removes content between these tokens from the final response before scoring.
Optional parameters:
- `include_if_not_finished` (default: `True`): Include reasoning content if reasoning is not finished (end token not found)
- `enable_reasoning_tracking` (default: `True`): Enable reasoning tracking and logging
- `add_reasoning` (default: `True`): Whether to add reasoning information to the response
- `migrate_reasoning_content` (default: `False`): Migrate `reasoning_content` field to `content` field with tokens
Reasoning statistics (word counts, token counts, completion status) are automatically tracked and logged when enabled.
Refer to {ref}`adapters-configuration` for all interceptor options and defaults.
(adapters-recipe-logging)=
# Logging Caps
Limit logging volume during evaluations to control overhead.
```python
from nemo_evaluator import (
ApiEndpoint, EndpointType, EvaluationConfig, EvaluationTarget, evaluate
)
from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig
# Configure completions endpoint
completions_url = "http://0.0.0.0:8080/v1/completions/"
api_endpoint = ApiEndpoint(url=completions_url, type=EndpointType.COMPLETIONS, model_id="megatron_model")
# Configure adapter with logging limits
api_endpoint.adapter_config = AdapterConfig(
interceptors=[
InterceptorConfig(
name="request_logging",
enabled=True,
config={"max_requests": 5} # Limit request logging
),
InterceptorConfig(
name="response_logging",
enabled=True,
config={"max_responses": 5} # Limit response logging
)
]
)
target = EvaluationTarget(api_endpoint=api_endpoint)
config = EvaluationConfig(type="hellaswag", output_dir="results")
results = evaluate(target_cfg=target, eval_cfg=config)
```
Use the following tips to control logging caps:
- Include `request_logging` and `response_logging` interceptors to enable logging
- Set `max_requests` and `max_responses` in the interceptor config to limit volume
- Omit or disable interceptors to turn off logging for that direction
- Use low limits for quick debugging, and increase when needed
Refer to {ref}`adapters-configuration` for all `AdapterConfig` options and defaults
(adapters-recipe-response-shaping)=
# Request Parameter Modification
Standardize request parameters across different endpoint providers.
```python
from nemo_evaluator import (
ApiEndpoint, EndpointType, EvaluationConfig, EvaluationTarget, evaluate
)
from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig
# Configure completions endpoint
completions_url = "http://0.0.0.0:8080/v1/completions/"
api_endpoint = ApiEndpoint(url=completions_url, type=EndpointType.COMPLETIONS, model_id="megatron_model")
# Configure adapter with payload modification for response shaping
api_endpoint.adapter_config = AdapterConfig(
interceptors=[
InterceptorConfig(
name="payload_modifier",
enabled=True,
config={
"params_to_add": {"temperature": 0.0, "max_new_tokens": 100},
"params_to_remove": ["max_tokens"] # Remove conflicting parameters
}
),
InterceptorConfig(
name="request_logging",
enabled=True,
config={"max_requests": 10}
),
InterceptorConfig(
name="response_logging",
enabled=True,
config={"max_responses": 10}
)
]
)
target = EvaluationTarget(api_endpoint=api_endpoint)
config = EvaluationConfig(type="lambada", output_dir="results")
results = evaluate(target_cfg=target, eval_cfg=config)
```
Guidance:
- Use the `payload_modifier` interceptor to standardize request parameters across different endpoints
- Configure `params_to_add` in the interceptor config to add or override parameters
- Configure `params_to_remove` in the interceptor config to eliminate conflicting or unsupported parameters
- Configure `params_to_rename` in the interceptor config to map parameter names between different API formats
- Use `request_logging` and `response_logging` interceptors to monitor transformations
- Keep transformations minimal to avoid masking upstream issues
- The payload modifier interceptor works with both chat and completions endpoints
(bring-your-own-endpoint-hosted)=
# Hosted Services
Use existing hosted model APIs from cloud providers without managing your own infrastructure. This approach offers the fastest path to evaluation with minimal setup requirements.
## Overview
Hosted services provide:
- Pre-deployed models accessible via API
- No infrastructure management required
- Pay-per-use pricing models
- Instant availability and global access
- Professional SLA and support
## NVIDIA Build
NVIDIA's catalog of ready-to-use AI models with OpenAI-compatible APIs.
### NVIDIA Build Setup and Authentication
```bash
# Get your NGC API key from https://build.nvidia.com
export NGC_API_KEY="nvapi-your-ngc-api-key"
# Test authentication
curl -H "Authorization: Bearer $NGC_API_KEY" \
"https://integrate.api.nvidia.com/v1/models"
```
Refer to the [NVIDIA Build catalog](https://build.nvidia.com) for available models.
### NVIDIA Build Configuration
#### Basic NVIDIA Build Evaluation
```yaml
# config/nvidia_build_basic.yaml
defaults:
- execution: local
- deployment: none # No deployment needed
- _self_
target:
api_endpoint:
url: https://integrate.api.nvidia.com/v1/chat/completions
model_id: meta/llama-3.2-3b-instruct
api_key_name: NGC_API_KEY # Name of environment variable
execution:
output_dir: ./results
evaluation:
overrides:
config.params.limit_samples: 100
tasks:
- name: ifeval
```
#### Multi-Model Comparison
For multi-model comparison, run separate evaluations for each model and compare results:
```bash
# Evaluate first model
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
-o target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
-o execution.output_dir=./results/llama-3.1-8b
# Evaluate second model
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
-o target.api_endpoint.model_id=meta/llama-3.1-70b-instruct \
-o execution.output_dir=./results/llama-3.1-70b
# Gather the results
nemo-evaluator-launcher export --dest local --format json
```
## OpenAI API
Direct integration with OpenAI's GPT models for comparison and benchmarking.
### OpenAI Setup and Authentication
```bash
# Get API key from https://platform.openai.com/api-keys
export OPENAI_API_KEY="your-openai-api-key"
# Test authentication
curl -H "Authorization: Bearer $OPENAI_API_KEY" \
"https://api.openai.com/v1/models"
```
Refer to the [OpenAI model documentation](https://platform.openai.com/docs/models) for available models.
### OpenAI Configuration
#### Basic OpenAI Evaluation
```yaml
# config/openai_basic.yaml
defaults:
- execution: local
- deployment: none
- _self_
target:
api_endpoint:
url: https://api.openai.com/v1/chat/completions
model_id: gpt-4
api_key_name: OPENAI_API_KEY # Name of environment variable
execution:
output_dir: ./results
evaluation:
overrides:
config.params.limit_samples: 100
tasks:
- name: ifeval
```
#### Cost-Optimized Configuration
```yaml
# config/openai_cost_optimized.yaml
defaults:
- execution: local
- deployment: none
- _self_
target:
api_endpoint:
url: https://api.openai.com/v1/chat/completions
model_id: gpt-3.5-turbo # Less expensive model
api_key_name: OPENAI_API_KEY
execution:
output_dir: ./results
evaluation:
overrides:
config.params.limit_samples: 50 # Smaller sample size
config.params.parallelism: 2 # Lower parallelism to respect rate limits
tasks:
- name: mmlu_pro
```
## Troubleshooting
### Authentication Errors
Verify that your API key has the correct value:
```bash
# Verify NVIDIA Build API key
curl -H "Authorization: Bearer $NGC_API_KEY" \
"https://integrate.api.nvidia.com/v1/models"
# Verify OpenAI API key
curl -H "Authorization: Bearer $OPENAI_API_KEY" \
"https://api.openai.com/v1/models"
```
### Rate Limiting
If you encounter rate limit errors (HTTP 429), reduce the `parallelism` parameter in your configuration:
```yaml
evaluation:
overrides:
config.params.parallelism: 2 # Lower parallelism to respect rate limits
```
(bring-your-own-endpoint)=
# Bring-Your-Own-Endpoint
Deploy and manage model serving yourself, then point NeMo Evaluator to your endpoint. This approach gives you full control over deployment infrastructure while still leveraging NeMo Evaluator's evaluation capabilities.
## Overview
With bring-your-own-endpoint, you:
- Handle model deployment and serving independently
- Provide an OpenAI-compatible API endpoint
- Use either the launcher or core library for evaluations
- Maintain full control over infrastructure and scaling
## When to Use This Approach
**Choose bring-your-own-endpoint when you:**
- Have existing model serving infrastructure
- Need custom deployment configurations
- Want to deploy once and run many evaluations
- Have specific security or compliance requirements
- Use enterprise Kubernetes or MLOps pipelines
## Quick Examples
### Using Launcher with Existing Endpoint
```bash
# Point launcher to your deployed model
URL=http://your-endpoint:8000/v1/completions
MODEL=your-model-name
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
-o target.api_endpoint.url=$URL \
-o target.api_endpoint.model_id=$MODEL
-o deployment.type=none # No launcher deployment
```
### Using Core Library
```python
from nemo_evaluator import (
ApiEndpoint, EvaluationConfig, EvaluationTarget, evaluate
)
# Configure your endpoint
api_endpoint = ApiEndpoint(
url="http://your-endpoint:8000/v1/completions",
model_id="your-model-name"
)
target = EvaluationTarget(api_endpoint=api_endpoint)
# Run evaluation
config = EvaluationConfig(type="gsm8k", output_dir="results")
results = evaluate(eval_cfg=config, target_cfg=target)
```
## Endpoint Requirements
Your endpoint must provide OpenAI-compatible APIs:
### Required Endpoints
- **Completions**: `/v1/completions` (POST) - For text completion tasks
- **Chat Completions**: `/v1/chat/completions` (POST) - For conversational tasks
- **Health Check**: `/v1/health` (GET) - For monitoring (recommended)
### Request/Response Format
Must follow OpenAI API specifications for compatibility with evaluation frameworks. See the {ref}`deployment-testing-compatibility` guide to verify your endpoint's OpenAI compatibility.
## Configuration Management
### Basic Configuration
```yaml
# config/bring_your_own.yaml
deployment:
type: none # No launcher deployment
target:
api_endpoint:
url: http://your-endpoint:8000/v1/completions
model_id: your-model-name
api_key_name: API_KEY # Optional, needed for gated endpoints
evaluation:
tasks:
- name: mmlu
- name: gsm8k
```
## Key Benefits
### Infrastructure Control
- **Custom configurations**: Tailor deployment to your specific needs
- **Resource optimization**: Optimize for your hardware and workloads
- **Security compliance**: Meet your organization's security requirements
- **Cost management**: Control costs through efficient resource usage
### Operational Flexibility
- **Deploy once, evaluate many**: Reuse deployments across multiple evaluations
- **Integration ready**: Works with existing infrastructure and workflows
- **Technology choice**: Use any serving framework or cloud provider
- **Scaling control**: Scale according to your requirements
## Getting Started
1. **Choose your approach**: Select from manual deployment, hosted services, or enterprise integration
2. **Deploy your model**: Set up your OpenAI-compatible endpoint
3. **Configure NeMo Evaluator**: Point to your endpoint with proper configuration
4. **Run evaluations**: Use launcher or core library to run benchmarks
5. **Monitor and optimize**: Track performance and optimize as needed
```{toctree}
:maxdepth: 1
:hidden:
Hosted Services
Testing Endpoint Compatibility
```
(deployment-testing-compatibility)=
# Testing Endpoint Compatibility
This guide helps you test your hosted endpoint to verify OpenAI-compatible API compatibility using `curl` requests for different task types. Models deployed using `nemo-evaluator-launcher` should be compatible with these tests.
To test your endpoint run the provided command and check the model's response. Make sure to populate
`FULL_ENDPOINT_URL` and `API_KEY` and replace `` with your own values.
:::{tip}
If you model is not gated, skip the line with authorization header:
```bash
-H "Authorization: Bearer ${API_KEY}"
```
from the commands below.
:::
## General Requirements
Your endpoint should support the following parameters:
- `top_p`
- `temperature`
- `max_tokens`
## Chat endpoint testing
```bash
export FULL_ENDPOINT_URL="https://your-server.com/v1/chat/completions"
export API_KEY="your-api-key-here"
export MODEL_NAME="your-model-name-here"
curl -X POST ${FULL_ENDPOINT_URL} \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${API_KEY}" \
-d '{
"messages": [
{
"role": "user",
"content": "Write Python code that can add a list of numbers together."
}
],
"model": "'"$MODEL_NAME"'",
"temperature": 0.6,
"top_p": 0.95,
"max_tokens": 256,
"stream": false
}'
```
## Completions endpoint testing
```bash
export FULL_ENDPOINT_URL="https://your-server.com/v1/completions"
export API_KEY="your-api-key-here"
export MODEL_NAME="your-model-name-here"
curl -X POST ${FULL_ENDPOINT_URL} \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${API_KEY}" \
-d '{
"prompt": "Write Python code that can add a list of numbers together.",
"model": "'"$MODEL_NAME"'",
"temperature": 0.6,
"top_p": 0.95,
"max_tokens": 256,
"stream": false
}'
```
## VLM chat endpoint testing
NeMo Evaluator supports the **OpenAI Images API** ([docs](https://platform.openai.com/docs/guides/images-vision#giving-a-model-images-as-input)) and **vLLM** ([docs](https://docs.vllm.ai/en/stable/features/multimodal_inputs.html)) with the image provided as **base64-encoded image**, and the following content types:
- `image_url`
- `text`
```bash
export FULL_ENDPOINT_URL="https://your-server.com/v1/chat/completions"
export API_KEY="your-api-key-here"
export MODEL_NAME="your-model-name-here"
curl -X POST ${FULL_ENDPOINT_URL} \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${API_KEY}" \
-H "Accept: application/json" \
-d '{
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAAQABADASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD3+iiigAooooA//9k="
}
},
{
"type": "text",
"text": "Describe the image:"
}
]
}
],
"model": "'"$MODEL_NAME"'",
"stream": false,
"max_tokens": 16,
"temperature": 0.0,
"top_p": 1.0
}'
```
## Function calling testing
We support OpenAI-compatible function calling ([docs](https://platform.openai.com/docs/guides/function-calling?api-mode=responses)):
Function calling request:
```bash
export FULL_ENDPOINT_URL="https://your-server.com/v1/chat/completions"
export API_KEY="your-api-key-here"
export MODEL_NAME="your-model-name-here"
curl -X POST ${FULL_ENDPOINT_URL} \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${API_KEY}" \
-H "Accept: application/json" \
-d '{
"model": "'"$MODEL_NAME"'",
"stream": false,
"max_tokens": 16,
"temperature": 0.0,
"top_p": 1.0,
"messages": [
{
"role": "user",
"content": "What is the slope of the line which is perpendicular to the line with the equation y = 3x + 2?"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "find_critical_points",
"description": "Finds the critical points of the function. Note that the provided function is in Python 3 syntax.",
"parameters": {
"type": "object",
"properties": {
"function": {
"type": "string",
"description": "The function to find the critical points for."
},
"variable": {
"type": "string",
"description": "The variable in the function."
},
"range": {
"type": "array",
"items": {
"type": "number"
},
"description": "The range to consider for finding critical points. Optional. Default is [0.0, 3.4]."
}
},
"required": ["function", "variable"]
}
}
}
]
}'
```
## Audio endpoint testing
We support audio input with the following content types:
- `audio_url`
Example:
``` bash
export FULL_ENDPOINT_URL="https://your-server.com/v1/chat/completions"
export API_KEY="your-api-key-here"
export MODEL_NAME="your-model-name-here"
curl -X POST ${FULL_ENDPOINT_URL} \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${API_KEY}" \
-H "Accept: application/json" \
-d '{
"max_tokens": 256,
"model": "'"$MODEL_NAME"'",
"messages": [
{
"content": [
{
"audio_url": {
"url": "data:audio/wav;base64,UklGRqQlAABXQVZFZm10IBAAAAABAAEAgLsAAAB3AQACABAAZGF0YYAlAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAD/////AAAAAAAAAAAAAAAAAAAAAAAA/////wAAAAD/////AAAAAAAAAAAAAAAAAAAAAP//////////AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA////////////////////////////////////////////////////////AAD/////////////////////AAD//wAAAAAAAAAA//////////////////8AAAAAAAAAAAAA/////wAAAAD/////AAAAAAAAAAAAAAAAAAAAAAAA////////AAAAAAAAAAAAAAAAAAAAAAAAAAAAAP//////////AAAAAAAAAAAAAAAAAAD/////////////AAAAAAAA////////////////////////////////////////////////////////AAAAAAAAAAD/////////////AAD//wAAAAAAAAAA//////////////////8AAAAAAAAAAAAA/////wAAAAD/////AAAAAAAAAAAAAAAAAAAAAAAA/////wAA////////AAAAAAAAAAAAAP//////////////////AAAAAAAAAAD///////////////////////8AAAAAAAD/////////////////////AAAAAP//////////////////////////AAAAAAAAAAAAAAAA/////wAAAAAAAAAAAAAAAAAA//////////8AAAAAAAAAAAAAAAAAAAAAAAD//wAA////////AAAAAAAAAAAAAAAAAAAAAP////8AAAAA////////AAAAAAAAAAAAAP//////////////////AAAAAAAAAAD//wAAAAD///////8AAAAAAAAAAAAA//////////////////////////8AAAAA/////////////////////wAAAAAAAP///////////////////////wAAAAAAAAAA////////////////AAAAAAAAAAAAAP//////////AAAAAP////8AAAAA////////AAAAAAAA/////wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA/////wAAAAAAAAAAAAAAAAAAAAAAAAAAAAD///////8AAAAAAAAAAAAAAAD///////8AAP////8AAAAAAAAAAAAAAAAAAAAA/////wAAAAD/////AAAAAAAAAAAAAAAAAAAAAAAAAAAAAP////8AAAAA////////AAAAAAAAAAAAAAAAAAAAAAAAAAD/////////////////////AAAAAP//AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA//8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAD///////8AAAAA/////wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAD/////////////////////AAAAAAAAAAAAAAAA//////////////////////////////////8AAAAAAAAAAAAAAAAAAP//AAAAAAAAAAAAAAAAAAAAAP///////wAAAAAAAAAAAAAAAAAAAAD///////8AAAAAAAD//////////////////////////////////////////////////////////wAA//8AAAAAAAAAAP///////wAAAAAAAAAAAAAAAAAAAAD///////////////////////////////////////8AAAAAAAAAAAAAAAAAAP///////wAAAAAAAAAA//8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAP//AAAAAAAAAAAAAAAA//////////8AAAAAAAAAAP//////////////////////////AAAAAAAAAAAAAAAAAAAAAP////////////////////8AAAAA//////////////////////////////////////////8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAD//wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAP///////////////////////////////wAAAAAAAAAAAAAAAP//////////AAAAAAAAAAAAAAAAAAD///////8AAAAAAAD/////////////////////////////////////////////////////////////////////AAAAAP////////////////////8AAAAAAAD/////AAAAAAAAAAAAAAAAAAD/////AAAAAP///////////////////////////////wAAAAD///////////////////////////////////////8AAP///////wAAAAD/////////////////////////////////////////////////////AAAAAP//////////////////////////AAAAAAAAAAAAAAAAAAAAAP//AAAAAAAAAAAAAAAAAAAAAAAAAAAAAP/////+////////////AAAAAAAA//8AAAAAAAAAAP////8AAP//////////////////AAAAAAAAAAAAAAAAAAAAAP//AAAAAP///////wAAAAD/////AAAAAAAA/////wAA//////////8AAAAA//8AAAAA/////////////////////wAAAAAAAAAAAAAAAP//////////AAAAAAAAAAAAAAAAAAD//////////wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA//8AAAAAAAAAAP///////wAAAAAAAAAAAAAAAAAAAAD//////////wAAAAAAAP//AAAAAP/////////////////////////////////////////////////////////////////////+//7//v//////////////////////////////////////AAD/////////////////////////////////////////////////////AAAAAAAA//8AAAAA/////////v/+//7//v//////AAAAAAAAAAAAAAAAAAD//////v/+//7//v/+//3//f/+///////+//7//f/+//7//f/8//3//f/+//7///8AAAAAAAAAAAAAAQAAAP7//f/+//7//f/8//z//f/9//z/+//8//z//P/7//v/+//7//v/+v/6//r/+v/7//z//f/9//z//P/8//z/+//5//n/+P/4//j/+P/5//r/+v/6//r/+//8//z/+v/5//n/+f/4//f/9v/1//X/9v/4//n/+f/4//n/+v/7//v/+f/4//f/9v/2//f/+f/6//n/+P/4//n/+f/5//j/+P/5//n/+v/6//n/+P/4//n/+//7//r/+P/4//n/+f/4//b/9f/2//j/+P/4//f/9//3//b/9P/z//P/9P/0//T/9f/1//b/9v/2//f/+P/3//b/9v/2//b/9v/1//X/9v/4//n/+f/5//n/+P/3//f/9//2//X/9f/2//f/9//5//z///8AAAAAAAD///7//P/8//7///8AAAEAAgAFAAUAAwABAAIAAwACAP///f/9////AQADAAQABgAHAAcABgAFAAQAAgACAAMABAAEAAQABQAJAAsADAALAAsACgAJAAYABAADAAIAAQAAAP///v/9//7/AQAEAAUAAgD/////AAABAAAAAAAAAAEAAQAAAAAAAgAEAAUAAwABAAAAAQABAAEAAAD//wAAAQABAP7//P/6//r/+v/6//n/9//2//f/+v/9//3/+//6//7/AgACAP///v8CAAYABgACAAAAAwAFAAEA/P/6//7/AAD+//z//f8BAAMAAgAAAAIABAADAAAA/v8AAAMAAwABAP///v8AAAIAAQD+//3//f/9//3/+//7//z//f/6//j/+P/6//z/+f/1//b/+f/7//f/9P/0//j/+//6//j/+P/5//z//P/8//v/+f/4//j/+v/8//z/+//5//j/+f/8//3//P/6//f/9v/1//P/8f/v/+//8f/0//X/9v/2//f//P8BAAMAAAD8//z/AQAEAAIA+//4//v///////3//f8AAAMAAQD9//v//v8DAAUABQAFAAcACQAMAAwACwAJAAgABwAIAAgABwAFAAUABgAHAAcABwAHAAgACwANAA8AEAAPAA8AEAASABIADQAIAAcADAAPAA4ACgAJAA0ADgAJAAQABAALABAADgAJAAkADwAVABYAEwAPAA4ADwAOAAwACAAEAAEA//////7//v/+/wEABAAHAAcABgAEAAQABQAIAAkABwAFAAMAAwAEAAUABwAIAAYAAwADAAYACQAHAAEA/f8BAAUABQD///z//v8BAAAA+//5//r//P/7//n/+v/7//v/+f/3//f/+v/9//z/+f/4//v/AAACAP//+//8/wAAAwABAAAAAgAHAAYAAQD9/wAAAwD///j/9//9/wMAAQD9/wAACQAOAAsABgAIAA8AEQAMAAcACAALAAwACAADAP///f/9//7//v/8//n/+P/7//7/AAD+//3//f///wAA///9//3/AAAEAAYAAwD///7/AQAEAAMAAgACAAQAAwAAAP7/AgAIAAsACAADAP///f/6//n/+f/5//j/9//4//z/AAADAAYADAARABIAEAAOAA8AEAAPAA0ADAANABAAFAAYABsAHAAdAB4AHwAbABQADwAQABQAFQASABEAFwAgACYAJgAkACUAJgAjAB4AHAAeACMAJwArADAAMwAyAC8ALAAqACcAIwAgAB8AHgAcABkAGAAZABsAHgAhACIAIQAcABgAGAAZABcAEwARABQAGAAaABgAGAAYABgAFgAUABQAEwAOAAcABAAHAAsACgAHAAgADgATABEACAACAAMACQANAAoABAAFAA4AFgAWAA8ACAAJAA4AEgAUABQAEwASABIAFAAWABUAEgASABYAFgAMAPj/4f/S/83/z//T/9P/zv/H/8L/wP+6/6z/nP+Q/4r/hf99/3j/d/93/3X/dv9//4j/hv98/3X/ef98/3T/Zv9j/27/d/9x/2b/Y/9o/2r/Zf9h/2X/a/9r/2j/Zf9i/1n/UP9P/1X/Vv9P/0b/RP9H/0n/Sv9O/1P/U/9S/1j/YP9h/1f/VP9l/37/iv+H/4r/nf+v/6//ov+g/6v/sP+j/5P/lf+j/6n/nv+S/4//kP+K/37/d/93/3z/gP+G/5D/mv+h/6b/qv+r/6f/pP+n/7D/u//I/9f/5f/u//X/AAAPABkAGwAbAB4AJQAnACUAJQAtADYAOQA2ADIALgAmABgACQD9//H/4v/U/83/z//V/9r/3v/j/+r/8/8AABIAJAAxADoAQABEAEkATwBVAFYAUABKAEsAUgBWAFIATgBQAFQAUQBJAEQARgBIAEMAPAA5ADcALwAiABgAFgAXABQADQAMABEAFQATAA0ABQD7/+z/2f/J/7//vP+8/73/vP+5/7b/tf+y/6r/mv+G/2//Wv9J/z7/NP8k/xL/Bv8F/wP/+v7s/uf+7f7x/uf+2P7U/uD+8P76/gH/DP8X/xz/Gv8U/w//Cv8J/wz/Df8E//H+4/7n/vL+9P7p/t7+3f7i/uL+3P7Y/tj+3P7g/uf+7/7z/vL+8f73/v7+Av8B/wH/Cv8a/yz/OP88/zv/Pv9J/1b/Wf9U/1b/Zv93/3b/aP9k/3X/iP+H/3T/Zv9o/3T/fP+B/4n/k/+X/5f/lv+V/5P/jf+L/5H/nf+p/7T/wf/N/9r/6P/6/wwAGQAkADMARQBPAE4ATABUAGIAZwBiAF4AYQBmAGQAYABlAHEAfAB9AHwAfgCGAJEAowC8ANUA5gDuAPcAAwENARABDAEKAQsBCgEEAf0A/wASATIBUwFnAWsBZgFhAV0BVAFDATIBJwEgARkBDwEFAf4A+gD6APoA8ADdAMsAygDaAO4A/wAVATABRAFBATMBNQFMAVgBQwEfAREBGAEWAf4A6gDzAAMB9wDSAL0AxQDIAKYAdQBkAHcAggBjADYAJwA7AEcALgAGAPb/CwApADEAHgAGAAEACQAIAO//0v/L/9j/1/+x/3//cf+M/6r/q/+e/6X/vv/J/7z/tv/O/+z/6f/K/7r/zf/i/9D/ov+E/4v/mf+P/3T/bP+K/7j/1P/U/8z/z//i//j/BwARABsAKQA2AD4ARQBPAFsAYwBmAGsAdAB4AG4AXABUAF8AdgCLAJoAqQC8AMoAzwDNAMwA0ADWAN8A6QDxAPQA9gD9AAcBDwETARkBJAEuAS0BJQEhASMBHwEOAfoA8gD1APIA4ADMAMgA0ADTAMcAswCfAIsAbgBKACwAHAATAAQA7v/a/8//z//S/9P/zf++/6r/mP+O/4b/e/9t/2b/aP9t/2r/Yf9a/1f/T/86/xb/5f6r/mv+Mf4K/vT94/3L/a39l/2Q/ZT9mP2U/Yb9cP1W/Tn9Fv3v/M78vvy+/MX8xPy8/Lf8tvyw/KD8jPx//Hn8cfxf/Eb8LPwX/An8BfwI/An8BfwE/A38GfwT/Pr74vvg++z76fvL+6T7kPuU+5z7mPuM+4j7kvui+677svu6+8376fsD/A/8D/wK/AP88vvW+7z7s/u4+7n7rfui+6n7vPvJ+8r7zPvY++H71/u9+6r7qvu4+8X7z/vZ++D73fvP+8H7uPuz+7L7uPvH+9b73/vp+/z7Gvwz/EH8Svxa/Gr8cPxq/Gf8dPyO/Kb8uvzO/On8CP0o/Ur9c/2e/cL92v3s/QT+JP5F/mP+gf6n/tb+Cf86/2f/k//B//D/HwBQAIQAugDtAB8BVQGTAdcBGwJbApgC0gIHAzcDYgOMA7UD4AMMBDYEWARxBIkEqQTRBPkEHQVCBWsFkQWuBcUF4wUMBjQGVAZvBpMGwgbxBhsHQAdnB40HrQfIB94H6gfpB+YH7Qf+BwoIBgj+BwAICwgMCP8H8QfyB/wH/QfsB9QHwAevB5cHewdiB0sHMAcOB+kGxQaiBoEGZAZQBj0GIgb+BdsFwQWwBaAFjAVwBUwFIgX+BOIExwShBHQETAQrBAMEzAOVA3IDZANUAzADAwPfAsUCpQJ4AkUCGQL0AdEBqgGCAVkBLwEGAeEAvQCUAGQANQAKAN7/rf97/07/Kf8A/83+kv5V/hv+4/2s/Xf9Pf3//MX8lfxt/EH8Dvzc+7P7jftg+yz7+/rT+q/6ifpi+kH6JPoF+uT5wvmd+W/5O/kK+eb4xfib+Gb4NPgP+PL30/ew95D3c/dU9y33Avfe9sX2tPaf9oX2aPZS9kj2R/ZM9lb2afaB9pP2m/aj9rr24vYM9yr3RPdp95r3xvfl9wD4J/ha+If4pPi6+N34E/lQ+YX5r/nU+QH6Nvpr+pn6xfr8+kT7kfvU+w38Tfyg/Pj8QP1x/Z791/0b/lr+kP7G/gb/Tf+U/9v/IgBoAKkA5QAfAVcBhwGvAdcBBAIzAlkCdwKWArwC5wIIAxoDHwMeAx4DIAMjAyIDIAMlAzgDUANfA2ADXwNlA2sDYQNCAyADCwP9AuMCvQKbAowChwJ5AlwCOAIXAvYB0wGwAZIBdgFWATUBGQEBAeQAwwCqAJ0AkgB5AFgAQgBAAEMAPQAwACwAMgA4ADkAOwBDAE4AWABpAIcArQDRAPEAFQFAAWgBhAGdAcAB6wEOAiUCOgJaAoQCsALcAg0DQQNyA58DzAP/AzIEZASbBNoEGAVKBXgFrgXuBSgGUQZ0BqAG0Qb4BhMHLwdVB3sHlgesB8oH7gcNCCIINwhSCGgIawhiCGAIawh6CH0IdAhnCF0IVAhJCD0ILQgaCAcI+QfoB80HqAeGB28HXAc+BxYH8QbZBscGrwaMBmkGSwYzBhkG/QXjBcsFsgWXBXsFZAVPBTcFGQX2BNcEwASwBJ4EhgRtBFsEUQRKBEIEPAQ8BEMESQRMBE0EUgRfBHUEkQSrBL0ExgTOBNwE7gT8BAQFBwUNBRcFJQU0BUEFSgVWBWkFfwWQBZcFnwWzBdIF7AX8BQsGIQY6BksGVQZiBncGiAaNBosGjQaUBpsGpAa6BtsG+gYJBw0HDwcPBwkH/gb4BvUG6QbQBrcGrAamBpUGdgZbBk8GRAYqBgUG6gXdBc8FtQWTBXYFXQU+BRoF+QTgBMkErASLBG0ETgQqBAME3QO4A5ADaANHAywDCwPdAqwChAJjAjsCBQLNAaUBigFqAToBBAHWALUAmQB3AFEAKwAHAOf/yP+r/47/cv9W/zn/Gf/6/t3+w/6o/oj+Zf5F/if+Cf7p/dH9yf3G/bT9hv1H/RD96vzE/Iv8RfwQ/Pj76/vO+577cftb+0/7NPsC+8n6mvp0+k/6KPoF+uj5z/m4+aX5k/l7+VX5Kfn/+Nv4sviB+E/4Ifj298f3k/de9yz3/vbS9qj2f/ZS9iD27/XF9aH1fvVe9UT1M/Uj9RL1AvX69Pn09/Tw9Ov07fT09Pn0+/T+9AX1DfUS9RT1G/Um9TP1O/VA9Uf1VfVp9X31ivWU9aD1svXG9db14/X29RL2MvZO9mb2iPa59vH2JPdM93P3offT9wD4I/hD+Gn4k/i/+On4E/k/+Wr5kPmz+dn5A/op+kf6YPqB+qf6xvrZ+ub6/Pob+zn7Tvti+377nvu1+8D7xfvM+9P71fvS+837yfvE+737ufu6+7j7rvue+5D7h/t++3L7Zvte+1r7VftQ+0r7Rfs/+zj7L/sg+wv7+Prq+t/6zfqz+qH6oPqm+p/6jfqD+o76o/qu+q36r/q7+sj6yvrJ+tb68voO+yD7Kvs3+0r7YPt5+5j7tvvM+9v77vsJ/Cn8Rvxi/IP8qPzJ/OH89vwN/Sf9P/1W/W39gf2Q/Z39sf3P/er9+v0H/hv+OP5O/lX+Vf5g/nj+k/6m/rP+vf7H/tH+3/7w/v3+/f70/u7+8P7t/tr+vf6q/qr+r/6i/n/+VP4z/h7+B/7k/bj9kv19/XX9bP1Y/Tz9Kf0m/Sj9Hv0I/fP86/zr/OL8zfy3/Kz8rvyz/Lb8vPzG/ND82fzm/Pr8D/0i/TP9Sv1n/YP9l/2p/b/92v30/Q3+KP5F/mL+gf6j/sr+8/4Z/zr/Wf94/5v/yP///zgAaQCSAL8A+AA1AWgBkwHDAf4BNwJjAogCsgLkAhMDOwNiA48DvQPmAw4EOwRqBJEEsgTaBA0FPgVdBW8FhAWiBbwFxwXMBdcF6gX5BQAGBAYIBgoGDAYRBhwGJAYjBhwGGwYlBi4GLgYnBiQGJgYpBicGHQYKBu0F0QW9Ba8FmAVyBU8FQgVGBUAFJQUIBQIFDgUSBQIF7QToBO4E6wTYBMIEuwTABMcEywTQBNcE2QTUBM8EzwTSBM0EvwSyBKsEqASeBI0EgAR6BHkEcwRmBFYERwQ7BDIELQQqBCcEIgQbBBgEGAQaBBoEGAQVBBIEDwQOBBgELQRHBFwEaQRwBHQEdQRzBHUEgQSUBKYEsgS4BLsEuwS7BL8ExQTCBK0EjwR6BHgEfAR0BF0ERgQ/BEAEOgQmBBAEBQQBBPoD6wPXA8cDvgO5A7QDrQOlA6ADoQOkA58DjQN2A2UDXANSA0MDMgMlAxkDBgPuAtsC1ALRAscCtwKpAp8CkgJ+AmsCYwJfAlICNgIWAvsB4wHFAaUBjAF5AWUBTwE/ATwBPAEvARUB/gD1APEA5gDYANUA3wDnAOIA1gDOAMwAxgC3AKoAqACrAKUAjgB1AGgAbgB+AIoAjwCTAJoApACpAKYAngCdAKIApQCcAIsAfgB8AH4AfQB7AH4AhgCKAIcAfwB7AHwAfgB8AHgAdQBzAHIAcgB1AHYAcwBsAGcAaABsAG8AbQBpAGYAZQBmAGgAawBvAHIAcgBwAHAAeQCMAKQAtgC9AL4AvwDDAMcAzQDWAN0A2QDJALsAugDBAMAAtACjAJkAlACKAH0AeAB+AIcAigCGAIQAhgCKAIwAjwCSAJIAjwCSAJwApAChAJkAlwCfAKYAowCfAKQAswC8ALsAtgCzAK4AnwCPAIgAjQCOAIQAdgBwAHIAcQBoAF8AXABbAFUATABBADQAHAD6/9b/u/+p/5n/iP91/17/RP8n/w3/+v7r/tz+yv64/qX+kP54/mL+Tf44/iH+DP78/e/93f3H/bP9pP2Z/Yv9ff11/XD9aP1W/UP9OP00/TH9Lv0s/S79L/0o/R39F/0b/R/9IP0i/S79QP1N/U39SP1F/Ur9Tv1P/Uv9SP1E/T79Nv0v/S39L/0y/TP9Mv0w/TD9MP0t/Sf9H/0b/Rb9EP0L/Q79Hf0t/TX9Mv0x/Tr9TP1a/V79Wv1W/Vb9WP1a/Vn9U/1I/Tn9KP0Z/Q39Av36/PP86fzZ/MX8sfyl/J/8nfya/Jb8kvyT/JX8lvyU/JT8lfyY/Jr8nfyg/J78lfyJ/Ib8kPyf/Kb8pPyk/Kz8uPzA/MH8wfzF/Mv8z/zR/ND80PzT/N387Pz3/Pr8/PwJ/Rv9KP0q/Sv9Nv1G/U39S/1N/Vv9cf2B/Yr9lf2m/bX9vv3F/dP95/34/QD+Af79/ff98v30/fz9Av4B/vz9+f34/fP97f3u/fn9A/4E/v79/P0C/gf+Bv4F/g/+I/40/jr+O/48/j7+PP44/jn+QP5H/k3+U/5e/mj+b/50/nv+hv6P/pX+n/6u/rz+wf6//sT+1/7w/gP/Dv8c/zb/U/9p/3n/i/+n/8X/2v/o//b/DgAvAFIAcQCMAKYAwwDlAAkBKgFFAVwBcwGKAZ0BqwG6Ac8B5AH0Af4BCAIXAikCOgJIAlMCWwJdAmACawKAApUCogKoArACvwLTAugC+AIEAwwDFQMhAy8DOwNDA0sDVgNmA3UDgAOHA5ADmwOlA64DuAPEA9QD5APxA/gD/gMHBBUEIAQlBCkEMwREBFQEWwRcBGEEagR0BH0EhwSSBJgElgSVBJwEqQSzBLcEvQTHBM4EywTBBLoEtQSsBJwEkASMBIsEhAR4BG4EaAReBFAEQgQ6BDQEKwQgBBoEGAQSBAYE/AP4A/kD9APpA9wD0QPIA7wDsAOlA5kDiwN8A2sDWANBAyoDFwMIA/gC4wLOAr4CswKpAp4ClQKMAoECcwJjAlcCTAJBAjICIQIUAgkC/QHwAeAB1AHOAcwBygHIAckBzgHWAd0B4QHlAewB8QHwAekB4gHgAeEB5AHoAe4B8QHuAegB6AHxAfsBAQIEAgkCDgINAgUC/wH/AQICAAL+AQECCwIQAgwCCgITAicCNQI1Ai4CLAIxAjQCMgIuAjECNwI3AjECLQIvAjMCMwIzAjUCOgI6AjYCNQI5AjoCMgImAiICJwIpAiECFAIMAggC/wHxAegB6wH0AfkB+gH7AfsB9AHpAeUB7gH3AfYB7QHpAekB5gHgAdwB4AHgAdcBygHHAc0BzgHIAcUBygHKAbsBqAGkAa0BrwGhAZMBlAGcAZsBkAGJAY0BjgGAAWsBXwFbAVMBQgE1ATIBMQEmARMBAwH4AOgA0QC9ALMAsAClAJAAegBuAGwAawBmAFkASQA5ACoAGQAGAPL/4P/P/77/rP+a/4v/ff9u/17/UP8//yz/GP8J/wD/9v7j/s7+vv62/q3+oP6T/or+hv59/nH+Zf5e/ln+UP5E/jv+Nv4x/in+HP4N/vv95v3Q/bz9rP2d/ZD9g/11/Wf9Wf1J/Tj9Jv0X/Q79Cv0D/ff85fzT/Mb8vfy3/LL8rvyv/LD8r/yo/J/8m/yc/J78m/yT/Ib8dvxj/E/8Pfwu/CL8FPwH/Pf74/vL+7j7s/uz+6z7l/uC+3r7fPt4+2z7Zftp+2/7aPtY+037T/tR+0r7QPs++0H7QPs5+zb7P/tL+1H7T/tP+1P7Wftc+1/7Y/tn+2j7aPtn+2j7aPtn+2b7Zvtn+2j7bPtx+3j7fvuG+477kvuT+5P7lPuP+4T7eft7+4b7kPuU+5n7p/u0+7P7pfuf+6j7tfuz+6X7nfuj+7H7vfvK+9779PsF/BD8H/wz/En8Wfxk/G78evyG/I/8l/yc/J/8ofyn/LL8uvy7/Lj8ufy//Mb8y/zQ/Nj84/zu/Pj8A/0Q/R/9L/0//Uz9Vv1e/Wj9dP2A/Yj9kP2b/a/9x/3c/en99P0F/hr+Lf46/kX+Uf5e/mX+av51/oj+nf6r/rn+0P7v/gr/Fv8d/y3/Qv9L/0P/Ov89/0b/SP9F/0n/Wf9r/3b/fP+F/47/kP+N/5D/mv+g/5z/l/+e/7L/w//H/8T/x//R/9z/4//t//7/FAAmAC4AMQA4AEYAVgBjAG4AegCLAJsAqQC3AMYA1gDiAOgA8AD+AAsBEgEWASIBNwFHAUsBSQFNAVcBXwFfAWEBawF2AXkBdQF0AXsBhQGPAZ8BswHEAcwB0QHeAfEB/wEEAgYCCgIPAhACEwIcAigCMQI1AjsCQwJCAjsCOQJGAlgCXQJWAlECWwJoAmoCYgJfAmQCaQJiAlYCTwJNAkkCQQI8Aj8CRwJNAlICXAJlAmQCWQJQAlICWgJaAlICTgI="
},
"type": "audio_url"
},
{
"text": "Please recognize the speech and only output the recognized content:",
"type": "text"
}
],
"role": "user"
}
],
"temperature": 0.0,
"top_p": 1.0
}'
```
(compatibility-log-probs)=
## Log-probabilities testing
For evaluation with log-probabilities your `completions` endpoint must support `logprobs` and `echo` parameters:
```bash
export FULL_ENDPOINT_URL="https://your-server.com/v1/completions"
export API_KEY="your-api-key-here"
export MODEL_NAME="your-model-name-here"
curl -X POST ${FULL_ENDPOINT_URL} \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${API_KEY}" \
-d '{
"prompt": "3 + 3 = 6",
"model": "'$MODEL_NAME'",
"max_tokens": 1,
"logprobs": 1,
"echo": true
}'
```
## Next Steps
- **Run your first evaluation**: Choose your path with {ref}`gs-quickstart`
- **Select benchmarks**: Explore available evaluation tasks
---
orphan: true
---
(bring-your-own-endpoint-manual)=
# Manual Deployment
Deploy models yourself using popular serving frameworks, then point NeMo Evaluator to your endpoints. This approach gives you full control over deployment infrastructure and serving configuration.
## Overview
Manual deployment involves:
- Setting up model serving using frameworks like vLLM, TensorRT-LLM, or custom solutions
- Configuring OpenAI-compatible endpoints
- Managing infrastructure, scaling, and monitoring yourself
- Using either the launcher or core library to run evaluations against your endpoints
:::{note}
This guide focuses on NeMo Evaluator configuration. For specific serving framework installation and deployment instructions, refer to their official documentation:
- [vLLM Documentation](https://docs.vllm.ai/)
- [TensorRT-LLM Documentation](https://nvidia.github.io/TensorRT-LLM/)
- [Hugging Face TGI Documentation](https://huggingface.co/docs/text-generation-inference/)
:::
## Using Manual Deployments with NeMo Evaluator
Before connecting to your manual deployment, verify it's properly configured using our {ref}`deployment-testing-compatibility` guide.
### With Launcher
Once your manual deployment is running, use the launcher to evaluate:
```bash
# Basic evaluation against manual deployment
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
-o target.api_endpoint.url=http://localhost:8080/v1/completions \
-o target.api_endpoint.model_id=your-model-name
```
#### Configuration File Approach
```yaml
# config/manual_deployment.yaml
defaults:
- execution: local
- deployment: none # No deployment by launcher
- _self_
target:
api_endpoint:
url: http://localhost:8080/v1/completions
model_id: llama-3.1-8b
# Optional authentication (name of environment variable holding API key)
api_key_name: API_KEY
execution:
output_dir: ./results
evaluation:
tasks:
- name: mmlu_pro
overrides:
config.params.limit_samples: 100
- name: gsm8k
overrides:
config.params.limit_samples: 50
```
### With Core Library
Direct API usage for manual deployments:
```python
from nemo_evaluator import (
ApiEndpoint,
ConfigParams,
EndpointType,
EvaluationConfig,
EvaluationTarget,
evaluate
)
# Configure your manual deployment endpoint
api_endpoint = ApiEndpoint(
url="http://localhost:8080/v1/completions",
type=EndpointType.COMPLETIONS,
model_id="llama-3.1-8b",
api_key="API_KEY" # Name of environment variable holding API key
)
target = EvaluationTarget(api_endpoint=api_endpoint)
# Configure evaluation
config = EvaluationConfig(
type="mmlu_pro",
output_dir="./results",
params=ConfigParams(
limit_samples=100,
parallelism=4
)
)
# Run evaluation
results = evaluate(eval_cfg=config, target_cfg=target)
print(f"Results: {results}")
```
#### With Adapter Configuration
```python
from nemo_evaluator import (
ApiEndpoint,
ConfigParams,
EndpointType,
EvaluationConfig,
EvaluationTarget,
evaluate
)
from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig
# Configure adapter with interceptors
adapter_config = AdapterConfig(
interceptors=[
InterceptorConfig(
name="caching",
config={
"cache_dir": "./cache",
"reuse_cached_responses": True
}
),
InterceptorConfig(
name="request_logging",
config={"max_requests": 10}
),
InterceptorConfig(
name="response_logging",
config={"max_responses": 10}
)
]
)
# Configure endpoint with adapter
api_endpoint = ApiEndpoint(
url="http://localhost:8080/v1/completions",
type=EndpointType.COMPLETIONS,
model_id="llama-3.1-8b",
api_key="API_KEY",
adapter_config=adapter_config
)
target = EvaluationTarget(api_endpoint=api_endpoint)
# Configure evaluation
config = EvaluationConfig(
type="mmlu_pro",
output_dir="./results",
params=ConfigParams(
limit_samples=100,
parallelism=4
)
)
# Run evaluation
results = evaluate(eval_cfg=config, target_cfg=target)
print(f"Results: {results}")
```
## Prerequisites
Before using a manually deployed endpoint with NeMo Evaluator, ensure:
- Your model endpoint is running and accessible
- The endpoint supports OpenAI-compatible API format
- You have any required API keys or authentication credentials
- Your endpoint supports the required generation parameters (see below)
### Endpoint Requirements
Your endpoint must support the following generation parameters for compatibility with NeMo Evaluator:
- `temperature`: Controls randomness in generation (0.0 to 1.0)
- `top_p`: Nucleus sampling threshold (0.0 to 1.0)
- `max_tokens`: Maximum tokens to generate
## Testing Your Endpoint
Before running evaluations, verify your endpoint is working as expected.
::::{dropdown} Test Completions Endpoint
:icon: code-square
```bash
# Basic test (no authentication)
curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "your-model-name",
"prompt": "What is machine learning?",
"temperature": 0.6,
"top_p": 0.95,
"max_tokens": 256,
"stream": false
}'
# With authentication
curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "your-model-name",
"prompt": "What is machine learning?",
"temperature": 0.6,
"top_p": 0.95,
"max_tokens": 256,
"stream": false
}'
```
::::
::::{dropdown} Test Chat Completions Endpoint
:icon: code-square
```bash
# Basic test (no authentication)
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "your-model-name",
"messages": [
{
"role": "user",
"content": "What is machine learning?"
}
],
"temperature": 0.6,
"top_p": 0.95,
"max_tokens": 256,
"stream": false
}'
# With authentication
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "your-model-name",
"messages": [
{
"role": "user",
"content": "What is machine learning?"
}
],
"temperature": 0.6,
"top_p": 0.95,
"max_tokens": 256,
"stream": false
}'
```
::::
:::{note}
Each evaluation task requires a specific endpoint type. Verify your endpoint supports the correct type for your chosen tasks. Use `nemo-evaluator-launcher ls tasks` to see which endpoint type each task requires.
:::
## OpenAI API Compatibility
Your endpoint must implement the OpenAI API format:
::::{dropdown} Completions Endpoint Format
:icon: code-square
**Request**: `POST /v1/completions`
```json
{
"model": "model-name",
"prompt": "string",
"max_tokens": 100,
"temperature": 0.7,
"top_p": 0.9
}
```
**Response**:
```json
{
"id": "cmpl-xxx",
"object": "text_completion",
"created": 1234567890,
"model": "model-name",
"choices": [{
"text": "generated text",
"index": 0,
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 20,
"total_tokens": 30
}
}
```
::::
::::{dropdown} Chat Completions Endpoint Format
:icon: code-square
**Request**: `POST /v1/chat/completions`
```json
{
"model": "model-name",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"max_tokens": 100,
"temperature": 0.7
}
```
**Response**:
```json
{
"id": "chatcmpl-xxx",
"object": "chat.completion",
"created": 1234567890,
"model": "model-name",
"choices": [{
"message": {
"role": "assistant",
"content": "Hello! How can I help you?"
},
"index": 0,
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 15,
"completion_tokens": 10,
"total_tokens": 25
}
}
```
::::
## Troubleshooting
### Connection Issues
If you encounter connection errors:
1. Verify the endpoint is running and accessible. Check the health endpoint (path varies by framework):
```bash
# For vLLM, SGLang, NIM
curl http://localhost:8080/health
# For NeMo/Triton deployments
curl http://localhost:8080/v1/triton_health
```
2. Check that the URL in your configuration matches your deployment:
- Include the full path (e.g., `/v1/completions` or `/v1/chat/completions`)
- Verify the port number matches your server configuration
- Ensure no firewall rules are blocking connections
3. Test with a simple curl command before running full evaluations
### Authentication Errors
If you see authentication failures:
1. Verify the environment variable has a value:
```bash
echo $API_KEY
```
2. Ensure the `api_key_name` in your YAML configuration matches the environment variable name
3. Check that your endpoint requires the same authentication method
### Timeout Errors
If requests are timing out:
1. Increase the timeout in your configuration:
```yaml
evaluation:
overrides:
config.params.request_timeout: 300 # 5 minutes
```
2. Reduce parallelism to avoid overwhelming your endpoint:
```yaml
evaluation:
overrides:
config.params.parallelism: 1
```
3. Check your endpoint's logs for performance issues
## Next Steps
- **Hosted services**: Compare with [hosted services](hosted-services.md) for managed solutions
- **Launcher-orchestrated deployment**: [Deploy](../launcher-orchestrated/index.md) models for evaluation with `nemo-evaluator-launcher`
---
orphan: true
---
(launcher-orchestrated-deployment)=
# Launcher-Orchestrated Deployment
Let NeMo Evaluator Launcher handle both model deployment and evaluation orchestration automatically. This is the recommended approach for most users, providing automated lifecycle management, multi-backend support, and integrated monitoring.
## Overview
Launcher-orchestrated deployment means the launcher:
- Deploys your model using the specified deployment type
- Manages the model serving lifecycle
- Runs evaluations against the deployed model
- Handles cleanup and resource management
The launcher supports multiple deployment backends and execution environments.
## Quick Start
```bash
# Deploy model and run evaluation in one command (Slurm example)
HOSTNAME=cluster-login-node.com
ACCOUNT=my_account
OUT_DIR=/absolute/path/on/login/node
nemo-evaluator-launcher run \
-o execution.hostname=$HOSTNAME \
-o execution.account=$ACCOUNT \
-o execution.output_dir=$OUT_DIR \
--config packages/nemo-evaluator-launcher/examples/slurm_vllm_basic.yaml
```
## Execution Backends
Choose the execution backend that matches your infrastructure:
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`desktop-download;1.5em;sd-mr-1` Local Execution
:link: local
:link-type: doc
Run evaluations on your local machine against existing endpoints. **Note**: Local executor does **not** deploy models. Use Slurm or Lepton for deployment.
:::
:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` Slurm Deployment
:link: slurm
:link-type: doc
Deploy on HPC clusters with Slurm workload manager. Ideal for large-scale evaluations with multi-node parallelism.
:::
:::{grid-item-card} {octicon}`cloud;1.5em;sd-mr-1` Lepton Deployment
:link: lepton
:link-type: doc
Deploy on Lepton AI cloud platform. Best for cloud-native deployments with managed infrastructure and auto-scaling.
:::
::::
## Deployment Types
The launcher supports multiple deployment types:
### vLLM Deployment
- **Fast inference** with optimized attention mechanisms
- **Continuous batching** for high throughput
- **Tensor parallelism** support for large models
- **Memory optimization** with configurable GPU utilization
### NIM Deployment
- **Production-grade reliability** with enterprise features
- **NVIDIA optimized containers** for maximum performance
- **Built-in monitoring** and logging capabilities
- **Enterprise security** features
### SGLang Deployment
- **Structured generation** support for complex tasks
- **Function calling** capabilities
- **JSON mode** for structured outputs
- **Efficient batching** for high throughput
### No Deployment
- **Use existing endpoints** without launcher deployment
- **Bring-your-own-endpoint** integration
- **Flexible configuration** for any OpenAI-compatible API
## Configuration Overview
Basic configuration structure for launcher-orchestrated deployment:
```yaml
# Use Hydra defaults to compose config
defaults:
- execution: slurm/default # or lepton/default; local does not deploy
- deployment: vllm # or nim, sglang, none
- _self_
# Deployment configuration
deployment:
checkpoint_path: /path/to/model # Or HuggingFace model ID
served_model_name: my-model
# ... deployment-specific options
# Execution backend configuration
execution:
account: my-account
output_dir: /path/to/results
# ... backend-specific options
# Evaluation tasks
evaluation:
tasks:
- name: mmlu_pro
- name: gsm8k
```
## Key Benefits
### Automated Lifecycle Management
- **Deployment automation**: No manual setup required
- **Resource management**: Automatic allocation and cleanup
- **Error handling**: Built-in retry and recovery mechanisms
- **Monitoring integration**: Real-time status and logging
### Multi-Backend Support
- **Consistent interface**: Same commands work across all backends
- **Environment flexibility**: Local development to production clusters
- **Resource optimization**: Backend-specific optimizations
- **Scalability**: From single GPU to multi-node deployments
### Integrated Workflows
- **End-to-end automation**: From model to results in one command
- **Configuration management**: Version-controlled, reproducible configs
- **Result integration**: Built-in export and analysis tools
- **Monitoring and debugging**: Comprehensive logging and status tracking
## Getting Started
1. **Choose your backend**: Start with {ref}`launcher-orchestrated-local` for development
2. **Configure your model**: Set deployment type and model path
3. **Run evaluation**: Use the launcher to deploy and evaluate
4. **Monitor progress**: Check status and logs during execution
5. **Analyze results**: Export and analyze evaluation outcomes
## Next Steps
- **Local Development**: Start with {ref}`launcher-orchestrated-local` for testing
- **Scale Up**: Move to {ref}`launcher-orchestrated-slurm` for production workloads
- **Cloud Native**: Try {ref}`launcher-orchestrated-lepton` for managed infrastructure
- **Configure Adapters**: Set up {ref}`adapters` for custom processing
```{toctree}
:maxdepth: 1
:hidden:
Local Deployment
Slurm Deployment
Lepton Deployment
```
(launcher-orchestrated-local)=
# Local Execution
Run evaluations on your local machine using Docker containers. The local executor connects to existing model endpoints and orchestrates evaluation tasks locally.
:::{important}
The local executor does **not** deploy models. You must have an existing model endpoint running before starting evaluation. For launcher-orchestrated model deployment, use {ref}`launcher-orchestrated-slurm` or {ref}`launcher-orchestrated-lepton`.
:::
## Overview
Local execution:
- Runs evaluation containers locally using Docker
- Connects to existing model endpoints (local or remote)
- Suitable for development, testing, and small-scale evaluations
- Supports parallel or sequential task execution
## Quick Start
```bash
# Run evaluation against existing endpoint
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_basic.yaml
```
## Configuration
### Basic Configuration
```yaml
# examples/local_basic.yaml
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: llama_3_1_8b_instruct_results
# mode: sequential # Optional: run tasks sequentially instead of parallel
target:
api_endpoint:
model_id: meta/llama-3.2-3b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY
evaluation:
tasks:
- name: ifeval
- name: gpqa_diamond
```
**Required fields:**
- `execution.output_dir`: Directory for results
- `target.api_endpoint.url`: Model endpoint URL
- `evaluation.tasks`: List of evaluation tasks
### Execution Modes
```yaml
execution:
output_dir: ./results
mode: parallel # Default: run tasks in parallel
# mode: sequential # Run tasks one at a time
```
### Multi-Task Evaluation
```yaml
evaluation:
tasks:
- name: mmlu_pro
overrides:
config.params.limit_samples: 200
- name: gsm8k
overrides:
config.params.limit_samples: 100
- name: humaneval
overrides:
config.params.limit_samples: 50
```
### Task-Specific Configuration
```yaml
evaluation:
tasks:
- name: gpqa_diamond
overrides:
config.params.temperature: 0.6
config.params.top_p: 0.95
config.params.max_new_tokens: 8192
config.params.parallelism: 4
env_vars:
HF_TOKEN: HF_TOKEN_FOR_GPQA_DIAMOND
```
### With Adapter Configuration
Configure adapters using evaluation overrides:
```yaml
target:
api_endpoint:
url: http://localhost:8080/v1/chat/completions
model_id: my-model
evaluation:
overrides:
target.api_endpoint.adapter_config.use_reasoning: true
target.api_endpoint.adapter_config.use_system_prompt: true
target.api_endpoint.adapter_config.custom_system_prompt: "Think step by step."
```
For detailed adapter configuration options, refer to {ref}`adapters`.
### Tasks Requiring Dataset Mounting
Some tasks require access to local datasets. For these tasks, specify the dataset location:
```yaml
evaluation:
tasks:
- name: mteb.techqa
dataset_dir: /path/to/your/techqa/dataset
```
The system will automatically:
- Mount the dataset directory into the evaluation container at `/datasets` (or a custom path if specified)
- Set the `NEMO_EVALUATOR_DATASET_DIR` environment variable
- Validate that all required environment variables are configured
**Custom mount path example:**
```yaml
evaluation:
tasks:
- name: mteb.techqa
dataset_dir: /mnt/data/techqa
dataset_mount_path: /custom/path # Optional: customize container mount point
```
### Advanced settings
If you are deploying the model locally with Docker, you can use a dedicated docker network.
This will provide a secure connetion between deployment and evaluation docker containers.
```shell
docker network create my-custom-network
docker run --gpus all --network my-custom-network --name my-phi-container vllm/vllm-openai:latest \
--model microsoft/Phi-4-mini-instruct
```
Then use the same network in the evaluator config:
```yaml
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: my_phi_test
extra_docker_args: "--network my-custom-network"
target:
api_endpoint:
model_id: microsoft/Phi-4-mini-instruct
url: http://my-phi-container:8000/v1/chat/completions
api_key_name: null
evaluation:
tasks:
- name: simple_evals.mmlu_pro
overrides:
config.params.limit_samples: 10 # TEST ONLY: Limits to 10 samples for quick testing
config.params.parallelism: 1
```
Alternatively you can expose ports and use the host network:
```shell
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
--model microsoft/Phi-4-mini-instruct
```
```yaml
execution:
extra_docker_args: "--network host"
```
## Command-Line Usage
### Basic Commands
```bash
# Run evaluation
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_basic.yaml
# Dry run to preview configuration
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
--dry-run
# Override endpoint URL
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
-o target.api_endpoint.url=http://localhost:8080/v1/chat/completions
```
### Job Management
```bash
# Check job status
nemo-evaluator-launcher status
# Check entire invocation
nemo-evaluator-launcher status
# Kill running job
nemo-evaluator-launcher kill
# List available tasks
nemo-evaluator-launcher ls tasks
# List recent runs
nemo-evaluator-launcher ls runs
```
## Requirements
### System Requirements
- **Docker**: Docker Engine installed and running
- **Storage**: Adequate space for evaluation containers and results
- **Network**: Internet access to pull Docker images
### Model Endpoint
You must have a model endpoint running and accessible before starting evaluation. Options include:
- {ref}`bring-your-own-endpoint-manual` using vLLM, TensorRT-LLM, or other frameworks
- {ref}`bring-your-own-endpoint-hosted` like NVIDIA API Catalog or OpenAI
- Custom deployment solutions
## Troubleshooting
### Docker Issues
**Docker not running:**
```bash
# Check Docker status
docker ps
# Start Docker daemon (varies by platform)
sudo systemctl start docker # Linux
# Or open Docker Desktop on macOS/Windows
```
**Permission denied:**
```bash
# Add user to docker group (Linux)
sudo usermod -aG docker $USER
# Log out and back in for changes to take effect
```
### Endpoint Connectivity
**Cannot connect to endpoint:**
```bash
# Test endpoint availability
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "test", "messages": [{"role": "user", "content": "Hi"}]}'
```
**API authentication errors:**
- Verify `api_key_name` matches your environment variable
- Check that the environment variable has a value: `echo $NGC_API_KEY`
- Check API key has proper permissions
### Evaluation Issues
**Job hangs or shows no progress:**
Check logs in the output directory:
```bash
# Track logs in real-time
tail -f //logs/stdout.log
# Kill and restart if needed
nemo-evaluator-launcher kill
```
**Tasks fail with errors:**
- Check logs in `//logs/stdout.log`
- Verify model endpoint supports required request format
- Ensure adequate disk space for results
### Configuration Validation
```bash
# Validate configuration before running
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
--dry-run
```
## Next Steps
- **Deploy your own model**: See {ref}`bring-your-own-endpoint-manual` for local model serving
- **Scale to HPC**: Use {ref}`launcher-orchestrated-slurm` for cluster deployments
- **Cloud execution**: Try {ref}`launcher-orchestrated-lepton` for cloud-based evaluation
- **Configure adapters**: Add interceptors with {ref}`adapters`
(launcher-orchestrated-lepton)=
# Lepton AI Deployment via Launcher
Deploy and evaluate models on Lepton AI cloud platform using NeMo Evaluator Launcher orchestration. This approach provides scalable cloud inference with managed infrastructure.
## Overview
Lepton launcher-orchestrated deployment:
- Deploys models on Lepton AI cloud platform
- Provides managed infrastructure and scaling
- Supports various resource shapes and configurations
- Handles deployment lifecycle in the cloud
## Quick Start
```bash
# Deploy and evaluate on Lepton AI
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/lepton_vllm.yaml \
-o deployment.checkpoint_path=meta-llama/Llama-3.1-8B-Instruct \
-o deployment.lepton_config.resource_shape=gpu.1xh200
```
This command:
1. Deploys a vLLM endpoint on Lepton AI
2. Runs the configured evaluation tasks
3. Returns an invocation ID for monitoring
The launcher handles endpoint creation, evaluation execution, and provides cleanup commands.
## Prerequisites
### Lepton AI Setup
```bash
# Install Lepton AI CLI
pip install leptonai
# Authenticate with Lepton AI
lep login
```
Refer to the [Lepton AI documentation](https://docs.nvidia.com/dgx-cloud/lepton/get-started) for authentication and workspace configuration.
## Deployment Types
### vLLM Lepton Deployment
High-performance inference with cloud scaling:
Refer to the complete working configuration in `packages/nemo-evaluator-launcher/examples/lepton_vllm.yaml`. Key configuration sections:
```yaml
deployment:
type: vllm
checkpoint_path: meta-llama/Llama-3.1-8B-Instruct
served_model_name: llama-3.1-8b-instruct
tensor_parallel_size: 1
lepton_config:
resource_shape: gpu.1xh200
min_replicas: 1
max_replicas: 3
auto_scaler:
scale_down:
no_traffic_timeout: 3600
execution:
type: lepton
evaluation_tasks:
timeout: 3600
evaluation:
tasks:
- name: ifeval
```
The launcher automatically retrieves the endpoint URL after deployment, eliminating the need for manual URL configuration.
### NIM Lepton Deployment
Enterprise-grade serving in the cloud. Refer to the complete working configuration in `packages/nemo-evaluator-launcher/examples/lepton_nim.yaml`:
```yaml
deployment:
type: nim
image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.6
served_model_name: meta/llama-3.1-8b-instruct
lepton_config:
resource_shape: gpu.1xh200
min_replicas: 1
max_replicas: 3
auto_scaler:
scale_down:
no_traffic_timeout: 3600
execution:
type: lepton
evaluation:
tasks:
- name: ifeval
```
### SGLang Deployment
SGLang is also supported as a deployment type. Use `deployment.type: sglang` with similar configuration to vLLM.
## Resource Shapes
Resource shapes are Lepton platform-specific identifiers that determine the compute resources allocated to your deployment. Available shapes depend on your Lepton workspace configuration and quota.
Configure in your deployment:
```yaml
deployment:
lepton_config:
resource_shape: gpu.1xh200 # Example: Check your Lepton workspace for available shapes
```
Refer to the [Lepton AI documentation](https://docs.nvidia.com/dgx-cloud/lepton/get-started) or check your workspace settings for available resource shapes in your environment.
## Configuration Examples
### Auto-Scaling Configuration
Configure auto-scaling behavior through the `lepton_config.auto_scaler` section:
```yaml
deployment:
lepton_config:
min_replicas: 1
max_replicas: 3
auto_scaler:
scale_down:
no_traffic_timeout: 3600 # Seconds before scaling down
scale_from_zero: false
```
### Using Existing Endpoints
To evaluate against an already-deployed Lepton endpoint without creating a new deployment, use `deployment.type: none` and provide the endpoint URL in the `target.api_endpoint` section.
Refer to `packages/nemo-evaluator-launcher/examples/lepton_basic.yaml` for a complete example.
### Tasks Requiring Dataset Mounting
Some tasks require access to local datasets that must be mounted into the evaluation container:
```yaml
evaluation:
tasks:
- name: mteb.techqa
dataset_dir: /path/to/shared/storage/techqa
```
The system will automatically:
- Mount the dataset directory into the evaluation container
- Set the `NEMO_EVALUATOR_DATASET_DIR` environment variable
- Validate that all required environment variables are configured
**Custom mount path example:**
```yaml
evaluation:
tasks:
- name: mteb.techqa
dataset_dir: /lepton/shared/datasets/techqa
dataset_mount_path: /data/techqa # Optional: customize container mount point
```
:::{note}
Ensure the dataset directory is accessible from the Lepton platform's shared storage configured in your workspace.
:::
## Advanced Configuration
### Environment Variables
Pass environment variables to deployment containers through `lepton_config.envs`:
```yaml
deployment:
lepton_config:
envs:
HF_TOKEN:
value_from:
secret_name_ref: "HUGGING_FACE_HUB_TOKEN"
CUSTOM_VAR: "direct_value"
```
### Storage Mounts
Configure persistent storage for model caching:
```yaml
deployment:
lepton_config:
mounts:
enabled: true
cache_path: "/path/to/storage"
mount_path: "/opt/nim/.cache"
```
## Monitoring and Management
### Check Evaluation Status
Use NeMo Evaluator Launcher commands to monitor your evaluations:
```bash
# Check status using invocation ID
nemo-evaluator-launcher status
# Kill running evaluations and cleanup endpoints
nemo-evaluator-launcher kill
```
### Monitor Lepton Resources
Use Lepton AI CLI commands to monitor platform resources:
```bash
# List all deployments in your workspace
lepton deployment list
# Get details about a specific deployment
lepton deployment get
# View deployment logs
lepton deployment logs
# Check resource availability
lepton resource list --available
```
Refer to the [Lepton AI CLI documentation](https://docs.nvidia.com/dgx-cloud/lepton/reference/cli/get-started/) for the complete command reference.
## Exporting Results
After evaluation completes, export results using the export command:
```bash
# Export results to MLflow
nemo-evaluator-launcher export --dest mlflow
```
Refer to the {ref}`exporters-overview` for additional export options and configurations.
## Troubleshooting
### Common Issues
**Deployment Timeout:**
If endpoints take too long to become ready, check deployment logs:
```bash
# Check deployment logs via Lepton CLI
lepton deployment logs
# Increase readiness timeout in configuration
# (in execution.lepton_platform.deployment.endpoint_readiness_timeout)
```
**Resource Unavailable:**
If your requested resource shape is unavailable:
```bash
# Check available resources in your workspace
lepton resource list --available
# Try a different resource shape in your config
```
**Authentication Issues:**
```bash
# Re-authenticate with Lepton
lep login
```
**Endpoint Not Found:**
If evaluation jobs cannot connect to the endpoint:
1. Verify endpoint is in "Ready" state using `lepton deployment get `
2. Confirm the endpoint URL is accessible
3. Verify API tokens are properly set in Lepton secrets
## Next Steps
- Compare with {ref}`launcher-orchestrated-slurm` for HPC cluster deployments
- Explore {ref}`launcher-orchestrated-local` for local development and testing
- Review complete configuration examples in the `examples/` directory
(launcher-orchestrated-slurm)=
# Slurm Deployment via Launcher
Deploy and evaluate models on HPC clusters using Slurm workload manager through NeMo Evaluator Launcher orchestration.
## Overview
Slurm launcher-orchestrated deployment:
- Submits jobs to Slurm-managed HPC clusters
- Supports multi-node evaluation runs
- Handles resource allocation and job scheduling
- Manages model deployment lifecycle within Slurm jobs
## Quick Start
```bash
# Deploy and evaluate on Slurm cluster
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/slurm_vllm_checkpoint_path.yaml \
-o deployment.checkpoint_path=/shared/models/llama-3.1-8b-instruct \
-o execution.partition=gpu
```
## vLLM Deployment
```yaml
# Slurm with vLLM deployment
defaults:
- execution: slurm/default
- deployment: vllm
- _self_
deployment:
type: vllm
checkpoint_path: /shared/models/llama-3.1-8b-instruct
served_model_name: meta-llama/Llama-3.1-8B-Instruct
tensor_parallel_size: 1
data_parallel_size: 8
port: 8000
execution:
account: my-account
output_dir: /shared/results
partition: gpu
num_nodes: 1
ntasks_per_node: 1
gres: gpu:8
walltime: "02:00:00"
target:
api_endpoint:
url: http://localhost:8000/v1/chat/completions
model_id: meta-llama/Llama-3.1-8B-Instruct
evaluation:
tasks:
- name: ifeval
- name: gpqa_diamond
- name: mbpp
```
## Slurm Configuration
### Supported Parameters
The following execution parameters are supported for Slurm deployments. See `configs/execution/slurm/default.yaml` in the launcher package for the base configuration:
```yaml
execution:
# Required parameters
hostname: ??? # Slurm cluster hostname
username: ${oc.env:USER} # SSH username (defaults to USER environment variable)
account: ??? # Slurm account for billing
output_dir: ??? # Results directory
# Resource allocation
partition: batch # Slurm partition/queue
num_nodes: 1 # Number of nodes
ntasks_per_node: 1 # Tasks per node
gres: gpu:8 # GPU resources
walltime: "01:00:00" # Wall time limit (HH:MM:SS)
# Environment variables and mounts
env_vars:
deployment: {} # Environment variables for deployment container
evaluation: {} # Environment variables for evaluation container
mounts:
deployment: {} # Mount points for deployment container (source:target format)
evaluation: {} # Mount points for evaluation container (source:target format)
mount_home: true # Whether to mount home directory
```
:::{note}
The `gpus_per_node` parameter can be used as an alternative to `gres` for specifying GPU resources. However, `gres` is the default in the base configuration.
:::
## Configuration Examples
### Benchmark Suite Evaluation
```yaml
# Run multiple benchmarks on a single model
defaults:
- execution: slurm/default
- deployment: vllm
- _self_
deployment:
type: vllm
checkpoint_path: /shared/models/llama-3.1-8b-instruct
served_model_name: meta-llama/Llama-3.1-8B-Instruct
tensor_parallel_size: 1
data_parallel_size: 8
port: 8000
execution:
account: my-account
output_dir: /shared/results
hostname: slurm.example.com
partition: gpu
num_nodes: 1
ntasks_per_node: 1
gres: gpu:8
walltime: "06:00:00"
target:
api_endpoint:
url: http://localhost:8000/v1/chat/completions
model_id: meta-llama/Llama-3.1-8B-Instruct
evaluation:
tasks:
- name: ifeval
- name: gpqa_diamond
- name: mbpp
- name: hellaswag
```
### Tasks Requiring Dataset Mounting
Some tasks require access to local datasets stored on the cluster's shared filesystem:
```yaml
evaluation:
tasks:
- name: mteb.techqa
dataset_dir: /shared/datasets/techqa # Path on shared filesystem
```
The system will automatically:
- Mount the dataset directory into the evaluation container
- Set the `NEMO_EVALUATOR_DATASET_DIR` environment variable
- Validate that all required environment variables are configured
**Custom mount path example:**
```yaml
evaluation:
tasks:
- name: mteb.techqa
dataset_dir: /shared/datasets/techqa
dataset_mount_path: /data/techqa # Optional: customize container mount point
```
:::{note}
Ensure the dataset directory is accessible from all cluster nodes via shared storage (e.g., NFS, Lustre).
:::
## Job Management
### Submitting Jobs
```bash
# Submit job with configuration
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/slurm_vllm_basic.yaml
# Submit with configuration overrides
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/slurm_vllm_basic.yaml \
-o execution.walltime="04:00:00" \
-o execution.partition=gpu-long
```
### Monitoring Jobs
```bash
# Check job status
nemo-evaluator-launcher status
# List all runs (optionally filter by executor)
nemo-evaluator-launcher ls runs --executor slurm
```
### Managing Jobs
```bash
# Cancel job
nemo-evaluator-launcher kill
```
### Native Slurm Commands
You can also use native Slurm commands to manage jobs directly:
```bash
# View job details
squeue -j -o "%.18i %.9P %.50j %.8u %.2t %.10M %.6D %R"
# Check job efficiency
seff
# Cancel Slurm job directly
scancel
# Hold/release job
scontrol hold
scontrol release
# View detailed job information
scontrol show job
```
## Shared Storage
Slurm evaluations require shared storage accessible from all cluster nodes:
### Model Storage
Store models in a shared filesystem accessible to all compute nodes:
```bash
# Example shared model directory
/shared/models/
├── llama-3.1-8b-instruct/
├── llama-3.1-70b-instruct/
└── custom-model.nemo
```
Specify the model path in your configuration:
```yaml
deployment:
checkpoint_path: /shared/models/llama-3.1-8b-instruct
```
### Results Storage
Evaluation results are written to the configured output directory:
```yaml
execution:
output_dir: /shared/results
```
Results are organized by timestamp and invocation ID in subdirectories.
## Troubleshooting
### Common Issues
**Job Pending:**
```bash
# Check node availability
sinfo -p gpu
# Try different partition
-o execution.partition="gpu-shared"
```
**Job Failed:**
```bash
# Check job status
nemo-evaluator-launcher status
# View Slurm job details
scontrol show job
# Check job output logs (location shown in status output)
```
**Job Timeout:**
```bash
# Increase walltime
-o execution.walltime="08:00:00"
# Check current walltime limit for partition
sinfo -p -o "%P %l"
```
**Resource Allocation:**
```bash
# Adjust GPU allocation via gres
-o execution.gres=gpu:4
-o deployment.tensor_parallel_size=4
```
### Debugging with Slurm Commands
```bash
# View job details
scontrol show job
# Monitor resource usage
sstat -j --format=AveCPU,AveRSS,MaxRSS,AveVMSize
# Job accounting information
sacct -j --format=JobID,JobName,State,ExitCode,DerivedExitCode
# Check job efficiency after completion
seff
```
# Evaluate Automodel Checkpoints Trained by NeMo Framework
This guide provides step-by-step instructions for evaluating checkpoints trained using the NeMo Framework with the Automodel backend. This section specifically covers evaluation with [nvidia-lm-eval](https://pypi.org/project/nvidia-lm-eval/), a wrapper around the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) tool.
Here, we focus on benchmarks within the `lm-evaluation-harness` that depend on text generation. Evaluation on log-probability-based benchmarks is available in [Evaluate Automodel Checkpoints on Log-probability benchmarks](#evaluate-automodel-checkpoints-on-log-probability-benchmarks).
## Deploy Automodel Checkpoints
This section outlines the steps to deploy Automodel checkpoints using Python commands.
Automodel checkpoint deployment uses Ray Serve as the serving backend. It also offers an OpenAI API (OAI)-compatible endpoint, similar to deployments of checkpoints trained with the Megatron Core backend. An example deployment command is shown below.
```{literalinclude} _snippets/deploy_hf.sh
:language: bash
:start-after: "# [snippet-start]"
:end-before: "# [snippet-end]"
```
The `--model_path` can refer to either a local checkpoint path or a Hugging Face model ID, as shown in the example above. In the example above, checkpoint deployment uses the `vLLM` backend. To enable accelerated inference, install `vLLM` in your environment. To install `vLLM` inside the NeMo Framework container, follow the steps below as shared in [Export-Deploy's README](https://github.com/NVIDIA-NeMo/Export-Deploy?tab=readme-ov-file#install-tensorrt-llm-vllm-or-trt-onnx-backend:~:text=cd%20/opt/export%2ddeploy%0auv%20sync%20%2d%2dinexact%20%2d%2dlink%2dmode%20symlink%20%2d%2dlocked%20%2d%2dextra%20vllm%20%24(cat%20/opt/uv_args.txt)):
```shell
cd /opt/Export-Deploy
uv sync --inexact --link-mode symlink --locked --extra vllm $(cat /opt/uv_args.txt)
```
To install `vLLM` outside of the NeMo Framework container, follow the steps mentioned [here](https://github.com/NVIDIA-NeMo/Export-Deploy?tab=readme-ov-file#install-tensorrt-llm-vllm-or-trt-onnx-backend:~:text=install%20transformerengine%20%2b%20vllm).
:::{note}
25.11 release of NeMo Framework container comes with `vLLM` pre-installed and its not necessary to explicitly install it. However for all previous releases, please refer to the instructions above to install `vLLM` inside the NeMo Framework container.
:::
If you prefer to evaluate the Automodel checkpoint without using the `vLLM` backend, remove the `--use_vllm_backend` flag from the command above.
:::{note}
To speed up evaluation using multiple instances, increase the `num_replicas` parameter.
For additional guidance, refer to {ref}`nemo-fw-ray`.
:::
## Evaluate Automodel Checkpoints
This section outlines the steps to evaluate Automodel checkpoints using Python commands. This method is quick and easy, making it ideal for interactive evaluations.
Once deployment is successful, you can run evaluations using the {ref}`lib-core` API.
Before starting the evaluation, it’s recommended to use the [`check_endpoint`](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator/src/nemo_evaluator/core/utils.py) function to verify that the endpoint is responsive and ready to accept requests.
```{literalinclude} _snippets/mmlu.py
:language: python
:start-after: "## Run the evaluation"
```
## Evaluate Automodel Checkpoints on Log-probability Benchmarks
To evaluate Automodel checkpoints on benchmarks that require log-probabilities, use the same deployment command provided in [Deploy Automodel Checkpoints](#deploy-automodel-checkpoints). These benchmarks are supported by both the `vLLM` backend (enabled via the `--use_vllm_backend` flag) and by directly deploying the Automodel checkpoint.
For evaluation, you must specify the path to the `tokenizer` and set the `tokenizer_backend` parameter as shown below. The `tokenizer` files are located within the checkpoint directory.
```{literalinclude} _snippets/arc_challenge_hf.py
:language: python
:start-after: "## Run the evaluation"
```
## Evaluate Automodel Checkpoints on Chat Benchmarks
To evaluate Automodel checkpoints on chat benchmarks you need the chat endpoint (`/v1/chat/completions/`). The deployment command provided in [Deploy Automodel Checkpoints](#deploy-automodel-checkpoints) also exposes the chat endpoint, and the same command can be used for evaluating on chat benchmarks.
For evaluation, update the URL by replacing `/v1/completions/` with `/v1/chat/completions/` as shown below. Additionally, set the `type` field to `"chat"` to indicate a chat benchmark.
```{literalinclude} _snippets/ifeval.py
:language: python
:start-after: "## Run the evaluation"
```
(deployment-nemo-fw)=
# Deploy and Evaluate Checkpoints Trained by NeMo Framework
The NeMo Framework is NVIDIA’s GPU-accelerated, end-to-end training platform for large language models (LLMs), multimodal models, and speech models. It enables seamless scaling of both pretraining and post-training workloads, from a single GPU to clusters with thousands of nodes, supporting Hugging Face/PyTorch and Megatron models. NeMo includes a suite of libraries and curated training recipes to help users build models from start to finish.
The NeMo Evaluator is integrated within NeMo Framework, offering streamlined deployment and advanced evaluation capabilities for models trained using NeMo, leveraging state-of-the-art evaluation harnesses.
## Features
- **Multi-Backend Deployment**: Supports PyTriton and multi-instance evaluations using the Ray Serve deployment backend
- **Production-Ready**: Supports high-performance inference with CUDA graphs and flash decoding for Megatron models, vLLM backend for Hugging Face models and TRTLLM engine for TRTLLM models
- **Multi-GPU and Multi-Node Support**: Enables distributed inference across multiple GPUs and compute nodes
- **OpenAI-Compatible API**: Provides RESTful endpoints aligned with OpenAI API specifications
## Architecture
### 1. Deployment Layer
- **PyTriton Backend**: Provides high-performance inference through the NVIDIA Triton Inference Server, with OpenAI API compatibility via a FastAPI interface. Supports model parallelism across single-node and multi-node configurations. Note: Multi-instance evaluation is not supported.
- **Ray Backend**: Enables multi-instance evaluation with model parallelism on a single node using Ray Serve, while maintaining OpenAI API compatibility. Multi-node support is coming soon.
For more information on the deployment, please see [NeMo Export-Deploy](https://github.com/NVIDIA-NeMo/Export-Deploy).
### 2. Evaluation Layer
- **NeMo Evaluator**: Provides standardized benchmark evaluations using packages from NVIDIA Eval Factory, bundled in the NeMo Framework container. The `lm-evaluation-harness` is pre-installed by default, and additional evaluation packages can be added as needed. For more information, see {ref}`core-wheels` and {ref}`lib-core`.
```{toctree}
:maxdepth: 1
:hidden:
Introduction
PyTriton Serving Backend
Ray Serving Backend
Evaluate Megatron Bridge Checkpoints
Evaluate Automodel Checkpoints
Evaluate TRTLLM Checkpoints
```
# Use PyTriton Server for Evaluations
This guide explains how to deploy and evaluate NeMo Framework models, trained with the Megatron-Core backend, using PyTriton to serve the model.
## Introduction
Deploymement with PyTriton serving backend provides high-performance inference through the NVIDIA Triton Inference Server, with OpenAI API compatibility via a FastAPI interface.
It supports model parallelism across single-node and multi-node configurations, facilitating deployment of large models that cannot fit into a single device.
## Key Benefits of PyTriton Deployment
- **Multi-Node support**: Deploy large models on multiple nodes using pipeline-, tensor-, context- or expert-parallelism.
- **Automatic Requests Batching**: PyTriton automatically groups your requests into batches for efficient inference.
## Deploy Models Using PyTriton
The deployment scripts are available inside [`/opt/Export-Deploy/scripts/deploy/nlp/`](https://github.com/NVIDIA-NeMo/Export-Deploy/tree/main/scripts/deploy/nlp) directory.
The example command below uses a Hugging Face LLaMA 3 8B checkpoint that has been converted to the Megatron Bridge format. To evaluate a checkpoint saved during [pre-training or fine-tuning](https://docs.nvidia.com/nemo/megatron-bridge/latest/recipe-usage.html), provide the path to the saved checkpoint using the `--megatron_checkpoint` flag in the command below.
```{literalinclude} _snippets/deploy_pytriton.sh
:language: bash
:start-after: "# [snippet-start]"
:end-before: "# [snippet-end]"
```
When working with a larger model, you can use model parallelism to distribute the model across available devices.
The example below deploys the [Llama-3_3-Nemotron-Super-49B-v1](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1) model (converted to the Megatron Bridge format) with 8 devices and tensor parallelism:
```bash
python \
/opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py \
--megatron_checkpoint "/workspace/Llama-3_3-Nemotron-Super-49B-v1/iter_0000000" \
--triton_model_name "megatron_model" \
--server_port 8080 \
--num_gpus 8 \
--tensor_model_parallel_size 8 \
--max_batch_size 4 \
--inference_max_seq_length 4096
```
Make sure to adjust the parameters to match your available resource and model architecture.
## Run Evaluations on PyTriton-Deployed Models
The entry point for evaluation is the [`evaluate`](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator/src/nemo_evaluator/core/evaluate.py) function. To run evaluations on the deployed model, use the following command. Make sure to open a new terminal within the same container to execute it. For longer evaluations, it is advisable to run both the deploy and evaluate commands in tmux sessions to prevent the processes from being terminated unexpectedly and aborting the runs.
It is recommended to use [`check_endpoint`](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator/src/nemo_evaluator/core/utils.py) function to verify that the endpoint is responsive and ready to accept requests before starting the evaluation.
```{literalinclude} _snippets/mmlu.py
:language: python
:start-after: "## Run the evaluation"
```
To evaluate the chat endpoint, update the URL by replacing `/v1/completions/` with `/v1/chat/completions/`. Additionally, set the `type` field to `"chat"` to indicate a chat benchmark.
To evaluate log-probability benchmarks (e.g., `arc_challenge`), run the following code snippet after deployment.
Make sure to open a new terminal within the same container to execute it.
```{literalinclude} _snippets/arc_challenge_mbridge.py
:language: python
:start-after: "## Run the evaluation"
```
Note that in the example above, you must provide a path to the tokenizer:
```python
extra={
"tokenizer": "/workspace/mbridge_llama3_8b/iter_0000000/tokenizer",
"tokenizer_backend": "huggingface",
},
```
Please refer to [`deploy_inframework_triton.py`](https://github.com/NVIDIA-NeMo/Export-Deploy/blob/main/scripts/deploy/nlp/deploy_inframework_triton.py) script and [`evaluate`](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator/src/nemo_evaluator/core/evaluate.py) function to review all available argument options, as the provided commands are only examples and do not include all arguments or their default values. For more detailed information on the arguments used in the `ApiEndpoint` and `ConfigParams` classes for evaluation, see [`api_dataclasses`](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator/src/nemo_evaluator/api/api_dataclasses.py) submodule.
:::{tip}
If you encounter a TimeoutError on the eval client side, please increase the `request_timeout` parameter in `ConfigParams` class to a larger value like `1000` or `1200` seconds (the default is 300).
:::
(nemo-fw-ray)=
# Use Ray Serve for Multi-Instance Evaluations
This guide explains how to deploy and evaluate NeMo Framework models, trained with the Megatron-Core backend, using Ray Serve to enable multi-instance evaluation across available GPUs.
## Introduction
Deployment with Ray Serve provides support for multiple replicas of your model across available GPUs, enabling higher throughput and better resource utilization during evaluation. This approach is particularly beneficial for evaluation scenarios where you need to process large datasets efficiently and would like to accelerate evaluation.
## Key Benefits of Ray Deployment
- **Multiple Model Replicas**: Deploy multiple instances of your model to handle concurrent requests.
- **Automatic Load Balancing**: Ray automatically distributes requests across available replicas.
- **Scalable Architecture**: Easily scale up or down based on your hardware resources.
- **Resource Optimization**: Better utilization of available GPUs.
## Deploy Models Using Ray Serve
To deploy your model using Ray, use the `deploy_ray_inframework.py` script from [NeMo Export-Deploy](https://github.com/NVIDIA-NeMo/Export-Deploy):
```shell
python \
/opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
--megatron_checkpoint "/workspace/mbridge_llama3_8b/iter_0000000/" \ # Llama3 8B HF checkpoint converted to MBridge
--model_id "megatron_model" \
--port 8080 \ # Ray server port
--num_gpus 4 \ # Total GPUs available
--num_replicas 2 \ # Number of model replicas
--tensor_model_parallel_size 2 \ # Tensor parallelism per replica
--pipeline_model_parallel_size 1 \ # Pipeline parallelism per replica
--context_parallel_size 1 # Context parallelism per replica
```
:::{note}
Adjust `num_replicas` based on the number of instances/replicas needed. Ensure that total `num_gpus` is equal to the `num_replicas` times model parallelism configuration (i.e., `tensor_model_parallel_size * pipeline_model_parallel_size * context_parallel_size`).
:::
## Run Evaluations on Ray-Deployed Models
Once your model is deployed with Ray, you can run evaluations using the same evaluation API as with PyTriton deployment. It is recommended to use the [`check_endpoint`](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator/src/nemo_evaluator/core/utils.py) function to verify that the endpoint is responsive and ready to accept requests before starting the evaluation.
To evaluate on generation benchmarks use the code snippet below:
```{literalinclude} _snippets/mmlu.py
:language: python
:start-after: "## Run the evaluation"
```
To evaluate the chat endpoint, update the url by replacing `/v1/completions/` with `/v1/chat/completions/`. Additionally, set the `type` field to `"chat"` in both `ApiEndpoint` and `EvaluationConfig` to indicate a chat benchmark.
To evaluate log-probability benchmarks (e.g., `arc_challenge`), run the following code snippet after deployment.
Make sure to open a new terminal within the same container to execute it.
```{literalinclude} _snippets/arc_challenge_mbridge.py
:language: python
:start-after: "## Run the evaluation"
```
Note that in the example above, you must provide a path to the tokenizer:
```python
extra={
"tokenizer": "/workspace/mbridge_llama3_8b/iter_0000000/tokenizer",
"tokenizer_backend": "huggingface",
},
```
:::{tip}
To get a performance boost from multiple replicas in Ray, increase the parallelism value in your `EvaluationConfig`. You won't see any speed improvement if `parallelism=1`. Try setting it to a higher value, such as 4 or 8.
:::
# Evaluate Megatron Bridge Checkpoints Trained by NeMo Framework
This guide provides step-by-step instructions for evaluating [Megatron Bridge](https://docs.nvidia.com/nemo/megatron-bridge/latest/index.html) checkpoints trained using the NeMo Framework with the Megatron Core backend. This section specifically covers evaluation with [nvidia-lm-eval](https://pypi.org/project/nvidia-lm-eval/), a wrapper around the [
lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) tool.
First, we focus on benchmarks within the `lm-evaluation-harness` that depend on text generation. Evaluation on log-probability-based benchmarks is available in the subsequent section [Evaluate Megatron Bridge Checkpoints on Log-probability benchmarks](#evaluate-megatron-bridge-checkpoints-on-log-probability-benchmarks).
## Deploy Megatron Bridge Checkpoints
To evaluate a checkpoint saved during pretraining or fine-tuning with [Megatron-Bridge](https://docs.nvidia.com/nemo/megatron-bridge/latest/recipe-usage.html), provide the path to the saved checkpoint using the `--megatron_checkpoint` flag in the deployment command below. Otherwise, Hugging Face checkpoints can be converted to Megatron Bridge using the single shell command:
```bash
huggingface-cli login --token
python -c "from megatron.bridge import AutoBridge; AutoBridge.import_ckpt('meta-llama/Meta-Llama-3-8B','/workspace/mbridge_llama3_8b/')"
```
The deployment scripts are available inside the [`/opt/Export-Deploy/scripts/deploy/nlp/`](https://github.com/NVIDIA-NeMo/Export-Deploy/tree/main/scripts/deploy/nlp) directory. Below is an example command for deployment. It uses a Hugging Face LLaMA 3 8B checkpoint that has been converted to Megatron Bridge format using the command shared above.
```{literalinclude} _snippets/deploy_mbridge.sh
:language: bash
:start-after: "# [snippet-start]"
:end-before: "# [snippet-end]"
```
:::{note}
Megatron Bridge creates checkpoints in directories named `iter_N`, where *N* is the iteration number. Each `iter_N` directory contains model weights and related artifacts. When using a checkpoint, make sure to provide the path to the appropriate `iter_N` directory. Hugging Face checkpoints converted for Megatron Bridge are typically stored in a directory named `iter_0000000`, as shown in the command above.
:::
:::{note}
Megatron Bridge deployment for evaluation is supported only with Ray Serve and not PyTriton.
:::
## Evaluate Megatron Bridge Checkpoints
Once deployment is successful, you can run evaluations using the NeMo Evaluator API. See {ref}`lib-core` for more details.
Before starting the evaluation, it’s recommended to use the [`check_endpoint`](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator/src/nemo_evaluator/core/utils.py) function to verify that the endpoint is responsive and ready to accept requests.
```{literalinclude} _snippets/mmlu.py
:language: python
:start-after: "## Run the evaluation"
```
## Evaluate Megatron Bridge Checkpoints on Log-probability Benchmarks
To evaluate Megatron Bridge checkpoints on benchmarks that require log-probabilities, use the same deployment command provided in [Deploy Megatron Bridge Checkpoints](#deploy-megatron-bridge-checkpoints).
For evaluation, you must specify the path to the `tokenizer` and set the `tokenizer_backend` parameter as shown below. The `tokenizer` files are located within the `tokenizer` directory of the checkpoint.
```{literalinclude} _snippets/arc_challenge_mbridge.py
:language: python
:start-after: "## Run the evaluation"
```
## Evaluate Megatron Bridge Checkpoints on Chat Benchmarks
To evaluate Megatron Bridge checkpoints on chat benchmarks you need the chat endpoint (/v1/chat/completions/). The deployment command provided in [Deploy Megatron Bridge Checkpoints](#deploy-megatron-bridge-checkpoints) also exposes the chat endpoint, and the same command can be used for evaluating on chat benchmarks.
For evaluation, update the URL by replacing `/v1/completions/` with `/v1/chat/completions/` as shown below. Additionally, set the `type` field to `"chat"` to indicate a chat benchmark.
```{literalinclude} _snippets/ifeval.py
:language: python
:start-after: "## Run the evaluation"
```
# Evaluate TensorRT-LLM checkpoints with NeMo Framework
This guide provides step-by-step instructions for evaluating TensorRT-LLM (TRTLLM) checkpoints or models inside NeMo Framework.
This guide focuses on benchmarks within the `lm-evaluation-harness` that depend on text generation. For a detailed comparison between generation-based and log-probability-based benchmarks, refer to {ref}`eval-run`.
:::{note}
Evaluation on log-probability-based benchmarks for TRTLLM models is currently planned for a future release.
:::
## Deploy TRTLLM Checkpoints
This section outlines the steps to deploy TRTLLM checkpoints using Python commands.
TRTLLM checkpoint deployment uses Ray Serve as the serving backend. It also offers an OpenAI API (OAI)-compatible endpoint, similar to deployments of checkpoints trained with the Megatron Core backend. An example deployment command is shown below.
```{literalinclude} _snippets/deploy_trtllm.sh
:language: bash
:start-after: "# [snippet-start]"
:end-before: "# [snippet-end]"
```
## Evaluate TRTLLM Checkpoints
This section outlines the steps to evaluate TRTLLM checkpoints using Python commands. This method is quick and easy, making it ideal for interactive evaluations.
Once deployment is successful, you can run evaluations using the same evaluation API described in other sections.
Before starting the evaluation, it’s recommended to use the [`check_endpoint`](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator/src/nemo_evaluator/core/utils.py) function to verify that the endpoint is responsive and ready to accept requests.
```{literalinclude} _snippets/mmlu.py
:language: python
:start-after: "## Run the evaluation"
```
(lib)=
# NeMo Evaluator Libraries
Select a library for your evaluation workflow:
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` NeMo Evaluator Launcher
:link: nemo-evaluator-launcher/index
:link-type: doc
**Start here** - Unified CLI and Python API for running evaluations across local, cluster, and hosted environments.
:::
:::{grid-item-card} {octicon}`beaker;1.5em;sd-mr-1` NeMo Evaluator
:link: nemo-evaluator/index
:link-type: doc
**Advanced usage** - Direct access to core evaluation logic for custom integrations.
:::
::::
The Launcher orchestrates the NeMo Evaluator containers using identical underlying code to ensure consistent results.
(deployment-generic)=
# Generic Deployment
Generic deployment provides flexible configuration for deploying any custom server that isn't covered by built-in deployment configurations.
## Configuration
See `configs/deployment/generic.yaml` for all available parameters.
### Basic Settings
Key arguments:
- **`image`**: Docker image to use for deployment (required)
- **`command`**: Command to run the server with template variables (required)
- **`served_model_name`**: Name of the served model (required)
- **`endpoints`**: API endpoint paths (chat, completions, health)
- **`checkpoint_path`**: Path to model checkpoint for mounting (default: null)
- **`extra_args`**: Additional command line arguments
- **`env_vars`**: Environment variables as {name: value} dict
## Best Practices
- Ensure server responds to health check endpoint (ensure that health endpoint is correctly parametrized)
- Test configuration with `--dry_run`
## Contributing Permanent Configurations
If you've successfully applied the generic deployment to serve a specific model or framework, contributions are welcome! We'll turn your working configuration into a permanent config file for the community.
# Deployment Configuration
Deployment configurations define how to provision and host model endpoints for evaluation.
## Deployment Types
Choose the deployment type for your evaluation:
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`globe;1.5em;sd-mr-1` None (External)
:link: none
:link-type: doc
Use existing API endpoints. No model deployment needed.
:::
:::{grid-item-card} {octicon}`broadcast;1.5em;sd-mr-1` vLLM
:link: vllm
:link-type: doc
Deploy models using the vLLM serving framework.
:::
:::{grid-item-card} {octicon}`zap;1.5em;sd-mr-1` SGLang
:link: sglang
:link-type: doc
Deploy models using the SGLang serving framework.
:::
:::{grid-item-card} {octicon}`cpu;1.5em;sd-mr-1` NIM
:link: nim
:link-type: doc
Deploy models using NVIDIA Inference Microservices.
:::
:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` TRT-LLM
:link: trtllm
:link-type: doc
Deploy models using NVIDIA TensorRT LLM.
:::
:::{grid-item-card} {octicon}`package;1.5em;sd-mr-1` Generic
:link: generic
:link-type: doc
Deploy models using a fully custom setup.
:::
::::
## Quick Reference
```yaml
deployment:
type: vllm # or sglang, nim, none
# ... deployment-specific settings
```
```{toctree}
:caption: Deployment Types
:hidden:
vLLM
SGLang
NIM
TensorRT-LLM
Generic
None (External)
```
(deployment-vllm)=
# vLLM Deployment
Configure vLLM as the deployment backend for serving models during evaluation.
## Configuration Parameters
### Basic Settings
```yaml
deployment:
type: vllm
image: vllm/vllm-openai:latest
hf_model_handle: hf-model/handle # HuggingFace ID
checkpoint_path: null # or provide a path to the stored checkpoint
served_model_name: your-model-name
port: 8000
```
**Required Fields:**
- `checkpoint_path` or `hf_model_handle`: Model path or HuggingFace model ID (e.g., `meta-llama/Llama-3.1-8B-Instruct`)
- `served_model_name`: Name for the served model
### Performance Settings
```yaml
deployment:
tensor_parallel_size: 8
pipeline_parallel_size: 1
data_parallel_size: 1
gpu_memory_utilization: 0.95
```
- **tensor_parallel_size**: Number of GPUs to split the model across (default: 8)
- **pipeline_parallel_size**: Number of pipeline stages (default: 1)
- **data_parallel_size**: Number of model replicas (default: 1)
- **gpu_memory_utilization**: Fraction of GPU memory to use for the model (default: 0.95)
### Extra Arguments and Endpoints
```yaml
deployment:
extra_args: "--max-model-len 4096"
endpoints:
chat: /v1/chat/completions
completions: /v1/completions
health: /health
```
The `extra_args` field passes extra arguments to the `vllm serve` command.
## Complete Example
```yaml
defaults:
- execution: slurm/default
- deployment: vllm
- _self_
deployment:
checkpoint_path: Qwen/Qwen3-4B-Instruct-2507
served_model_name: qwen3-4b-instruct-2507
tensor_parallel_size: 1
data_parallel_size: 8
extra_args: "--max-model-len 4096"
execution:
hostname: your-cluster-headnode
account: your-account
output_dir: /path/to/output
walltime: 02:00:00
evaluation:
tasks:
- name: ifeval
- name: gpqa_diamond
env_vars:
HF_TOKEN: HF_TOKEN_FOR_GPQA_DIAMOND # Click request access for GPQA-Diamond: https://huggingface.co/datasets/Idavidrein/gpqa
```
## Reference
The following example configuration files are available in the `examples/` directory:
- `lepton_vllm_llama_3_1_8b_instruct.yaml` - vLLM deployment on Lepton platform
- `slurm_llama_3_1_8b_instruct.yaml` - vLLM deployment on SLURM cluster
- `slurm_llama_3_1_8b_instruct_hf.yaml` - vLLM deployment using HuggingFace model ID
Use `nemo-evaluator-launcher run --dry-run` to check your configuration before running.
(deployment-nim)=
# NIM Deployment
NIM (NVIDIA Inference Microservices) provides optimized inference microservices with OpenAI-compatible application programming interfaces. NIM deployments automatically handle model optimization, scaling, and resource management on supported platforms.
## Configuration
### Basic Settings
```yaml
deployment:
image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.6
served_model_name: meta/llama-3.1-8b-instruct
port: 8000
```
- **`image`**: NIM container image from [NVIDIA NIM Containers](https://catalog.ngc.nvidia.com/containers?filters=nvidia_nim) (required)
- **`served_model_name`**: Name used for serving the model (required)
- **`port`**: Port for the NIM server (default: 8000)
### Endpoints
```yaml
endpoints:
chat: /v1/chat/completions
completions: /v1/completions
health: /health
```
## Integration with Lepton
NIM deployment with Lepton executor:
```yaml
defaults:
- execution: lepton/default
- deployment: nim
- _self_
deployment:
image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.6
served_model_name: meta/llama-3.1-8b-instruct
# Platform-specific settings
lepton_config:
endpoint_name: nim-llama-3-1-8b-eval
resource_shape: gpu.1xh200
# ... additional platform settings
```
### Environment Variables
Configure environment variables for NIM container operation:
```yaml
deployment:
lepton_config:
envs:
HF_TOKEN:
value_from:
secret_name_ref: "HUGGING_FACE_HUB_TOKEN"
```
**Auto-populated Variables:**
The launcher automatically sets these environment variables from your deployment configuration:
- `SERVED_MODEL_NAME`: Set from `deployment.served_model_name`
- `NIM_MODEL_NAME`: Set from `deployment.served_model_name`
- `MODEL_PORT`: Set from `deployment.port` (default: 8000)
### Resource Management
#### Auto-scaling Configuration
```yaml
deployment:
lepton_config:
min_replicas: 1
max_replicas: 3
auto_scaler:
scale_down:
no_traffic_timeout: 3600
scale_from_zero: false
target_gpu_utilization_percentage: 0
target_throughput:
qpm: 2.5
```
#### Storage Mounts
Enable model caching for faster startup:
```yaml
deployment:
lepton_config:
mounts:
enabled: true
cache_path: "/path/to/model/cache"
mount_path: "/opt/nim/.cache"
```
### Security Configuration
#### API Tokens
```yaml
deployment:
lepton_config:
api_tokens:
- value: "UNIQUE_ENDPOINT_TOKEN"
```
#### Image Pull Secrets
```yaml
execution:
lepton_platform:
tasks:
image_pull_secrets:
- "lepton-nvidia-registry-secret"
```
### Complete Example
```yaml
defaults:
- execution: lepton/default
- deployment: nim
- _self_
execution:
output_dir: lepton_nim_llama_3_1_8b_results
deployment:
image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.6
served_model_name: meta/llama-3.1-8b-instruct
lepton_config:
endpoint_name: llama-3-1-8b
resource_shape: gpu.1xh200
min_replicas: 1
max_replicas: 3
api_tokens:
- value_from:
token_name_ref: "ENDPOINT_API_KEY"
envs:
HF_TOKEN:
value_from:
secret_name_ref: "HUGGING_FACE_HUB_TOKEN"
mounts:
enabled: true
cache_path: "/path/to/model/cache"
mount_path: "/opt/nim/.cache"
evaluation:
tasks:
- name: ifeval
```
## Examples
Refer to `packages/nemo-evaluator-launcher/examples/lepton_nim.yaml` for a complete NIM deployment example.
## Reference
- [NIM Documentation](https://docs.nvidia.com/nim/)
- [NIM Deployment Guide](https://docs.nvidia.com/nim/large-language-models/latest/deployment-guide.html)
(deployment-none)=
# None Deployment
The "none" deployment option means **no model deployment is performed**. Instead, you provide an existing OpenAI-compatible endpoint. The launcher handles running evaluation tasks while connecting to your existing endpoint.
## When to Use None Deployment
- **Existing Endpoints**: You have a running model endpoint to evaluate
- **Third-Party Services**: Testing models from NVIDIA API Catalog, OpenAI, or other providers
- **Custom Infrastructure**: Using your own deployment solution outside the launcher
- **Cost Optimization**: Reusing existing deployments across multiple evaluation runs
- **Separation of Concerns**: Keeping model deployment and evaluation as separate processes
## Key Benefits
- **No Resource Management**: No need to provision or manage model deployment resources
- **Platform Flexibility**: Works with Local, Lepton, and SLURM execution platforms
- **Quick Setup**: Minimal configuration required - just point to your endpoint
- **Cost Effective**: Leverage existing deployments without additional infrastructure
## Universal Configuration
These configuration patterns apply to all execution platforms when using "none" deployment.
### Target Endpoint Setup
```yaml
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct # Model identifier (required)
url: https://your-endpoint.com/v1/chat/completions # Endpoint URL (required)
api_key_name: API_KEY # Environment variable name (recommended)
```
## Platform Examples
Choose your execution platform and see the specific configuration needed:
::::{tab-set}
:::{tab-item} Local
**Best for**: Development, testing, small-scale evaluations
```yaml
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: results
target:
api_endpoint:
model_id: meta/llama-3.2-3b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY
evaluation:
tasks:
- name: gpqa_diamond
env_vars:
HF_TOKEN: HF_TOKEN_FOR_GPQA_DIAMOND # Click request access for GPQA-Diamond: https://huggingface.co/datasets/Idavidrein/gpqa
```
**Key Points:**
- Minimal configuration required
- Set environment variables in your shell
- Limited by local machine resources
:::
:::{tab-item} Lepton
**Best for**: Production evaluations, team environments, scalable workloads
```yaml
defaults:
- execution: lepton/default
- deployment: none
- _self_
execution:
output_dir: results
lepton_platform:
tasks:
api_tokens:
- value_from:
token_name_ref: "ENDPOINT_API_KEY"
env_vars:
HF_TOKEN:
value_from:
secret_name_ref: "HUGGING_FACE_HUB_TOKEN"
API_KEY:
value_from:
secret_name_ref: "ENDPOINT_API_KEY"
node_group: "your-node-group"
mounts:
- from: "node-nfs:shared-fs"
path: "/workspace/path"
mount_path: "/workspace"
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://your-endpoint.lepton.run/v1/chat/completions
api_key_name: API_KEY
evaluation:
tasks:
- name: gpqa_diamond
```
**Key Points:**
- Requires Lepton credentials (`lep login`)
- Use `secret_name_ref` for secure credential storage
- Configure node groups and storage mounts
- Handles larger evaluation workloads
:::
:::{tab-item} SLURM
**Best for**: HPC environments, large-scale evaluations, batch processing
```yaml
defaults:
- execution: slurm/default
- deployment: none
- _self_
execution:
account: your-slurm-account
output_dir: /shared/filesystem/results
walltime: "02:00:00"
partition: cpu_short
gpus_per_node: null # No GPUs needed
target:
api_endpoint:
model_id: meta/llama-3.2-3b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY # API Key with access to build.nvidia.com
evaluation:
tasks:
- name: gpqa_diamond
env_vars:
HF_TOKEN: HF_TOKEN_FOR_GPQA_DIAMOND # Click request access for GPQA-Diamond: https://huggingface.co/datasets/Idavidrein/gpqa
```
**Key Points:**
- Requires SLURM account and accessible output directory
- Creates one job per benchmark evaluation
- Uses CPU partitions (no GPUs needed for none deployment)
:::
::::
(deployment-sglang)=
# SGLang Deployment
SGLang is a serving framework for large language models. This deployment type launches SGLang servers using the `lmsysorg/sglang` Docker image.
## Configuration
### Required Settings
See the complete configuration structure in `configs/deployment/sglang.yaml`.
```yaml
deployment:
type: sglang
image: lmsysorg/sglang:latest
hf_model_handle: hf-model/handle # HuggingFace ID
checkpoint_path: null # or provide a path to the stored checkpoint
served_model_name: your-model-name
port: 8000
```
**Required Fields:**
- `checkpoint_path` or `hf_model_handle`: Model path or HuggingFace model ID (e.g., `meta-llama/Llama-3.1-8B-Instruct`)
- `served_model_name`: Name for the served model
### Optional Settings
```yaml
deployment:
tensor_parallel_size: 8 # Default: 8
data_parallel_size: 1 # Default: 1
extra_args: "" # Extra SGLang server arguments
env_vars: {} # Environment variables (key: value dict)
```
**Configuration Fields:**
- `tensor_parallel_size`: Number of GPUs for tensor parallelism (default: 8)
- `data_parallel_size`: Number of data parallel replicas (default: 1)
- `extra_args`: Extra command-line arguments to pass to SGLang server
- `env_vars`: Environment variables for the container
### API Endpoints
The SGLang deployment exposes OpenAI-compatible endpoints:
```yaml
endpoints:
chat: /v1/chat/completions
completions: /v1/completions
health: /health
```
## Example Configuration
```yaml
defaults:
- execution: slurm/default
- deployment: sglang
- _self_
deployment:
checkpoint_path: Qwen/Qwen3-4B-Instruct-2507
served_model_name: qwen3-4b-instruct-2507
tensor_parallel_size: 8
data_parallel_size: 8
extra_args: ""
execution:
hostname: your-cluster-headnode
account: your-account
output_dir: /path/to/output
walltime: 02:00:00
evaluation:
tasks:
- name: gpqa_diamond
- name: ifeval
env_vars:
HF_TOKEN: HF_TOKEN_FOR_GPQA_DIAMOND # or use HF_HOME
```
## Command Template
The launcher uses the following command template to start the SGLang server (from `configs/deployment/sglang.yaml`):
```bash
python3 -m sglang.launch_server \
--model-path ${oc.select:deployment.hf_model_handle,/checkpoint} \
--host 0.0.0.0 \
--port ${deployment.port} \
--served-model-name ${deployment.served_model_name} \
--tp ${deployment.tensor_parallel_size} \
--dp ${deployment.data_parallel_size} \
${deployment.extra_args}
```
:::{note}
The `${oc.select:deployment.hf_model_handle,/checkpoint}` syntax uses OmegaConf's select resolver. In practice, set `checkpoint_path` with your model path or HuggingFace model ID.
:::
## Reference
**Configuration File:**
- Source: `packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/deployment/sglang.yaml`
**Related Documentation:**
- [Deployment Configuration Overview](index.md)
- [Execution Platform Configuration](../executors/index.md)
- [SGLang Documentation](https://docs.sglang.ai/)
(deployment-trtllm)=
# TensorRT LLM (TRT-LLM) Deployment
Configure TRT-LLM as the deployment backend for serving models during evaluation.
## Configuration Parameters
### Basic Settings
```yaml
deployment:
type: trtllm
image: nvcr.io/nvidia/tensorrt-llm/release:1.0.0
checkpoint_path: /path/to/model
served_model_name: your-model-name
port: 8000
```
### Parallelism Configuration
```yaml
deployment:
tensor_parallel_size: 4
pipeline_parallel_size: 1
```
- **tensor_parallel_size**: Number of GPUs to split the model across (default: 4)
- **pipeline_parallel_size**: Number of pipeline stages (default: 1)
### Extra Arguments and Endpoints
```yaml
deployment:
extra_args: "--ep_size 2"
endpoints:
chat: /v1/chat/completions
completions: /v1/completions
health: /health
```
The `extra_args` field passes extra arguments to the `trtllm-serve serve ` command.
## Complete Example
```yaml
defaults:
- execution: slurm/default
- deployment: trtllm
- _self_
deployment:
checkpoint_path: /path/to/checkpoint
served_model_name: llama-3.1-8b-instruct
tensor_parallel_size: 1
extra_args: ""
execution:
account: your-account
output_dir: /path/to/output
walltime: 02:00:00
evaluation:
tasks:
- name: ifeval
- name: gpqa_diamond
env_vars:
HF_TOKEN: HF_TOKEN_FOR_GPQA_DIAMOND # Click request access for GPQA-Diamond: https://huggingface.co/datasets/Idavidrein/gpqa
```
Use `nemo-evaluator-launcher run --dry-run` to check your configuration before running.
(evaluation-configuration)=
# Evaluation Configuration
Evaluation configuration defines which benchmarks to run and their configuration. It is common for all executors and can be reused between them to launch the exact same tasks.
**Important**: Each task has its own default values that you can override. For comprehensive override options, see {ref}`parameter-overrides`.
## Configuration Structure
```yaml
evaluation:
nemo_evaluator_config: # Global overrides for all tasks
config:
params:
request_timeout: 3600
tasks:
- name: task_name # Use default benchmark configuration
- name: another_task
nemo_evaluator_config: # Task-specific overrides
config:
params: # Task-specific overrides
temperature: 0.6
top_p: 0.95
env_vars: # Task-specific environment variables
HF_TOKEN: MY_HF_TOKEN
```
## Key Components
### Global Overrides
- **`overrides`**: Parameter overrides that apply to all tasks
- **`env_vars`**: Environment variables that apply to all tasks
### Task Configuration
- **`tasks`**: List of evaluation tasks to run
- **`name`**: Name of the benchmark task
- **`overrides`**: Task-specific parameter overrides
- **`env_vars`**: Task-specific environment variables
For a comprehensive list of available tasks, their descriptions, and task-specific parameters, see {ref}`nemo-evaluator-containers`.
## Advanced Task Configuration
### Parameter Overrides
The overrides system is crucial for leveraging the full flexibility of the common endpoint interceptors and task configuration layer. This is where nemo-evaluator intersects with nemo-evaluator-launcher, providing a unified configuration interface.
#### Global Overrides
Settings applied to all tasks listed in the config.
```yaml
evaluation:
nemo_evaluator_config:
config:
params:
request_timeout: 3600
temperature: 0.7
```
#### Task-Specific Overrides
Parameters passed to a job for a single task. They take precedence over global evaluation settings.
```yaml
evaluation:
tasks:
- name: gpqa_diamond
nemo_evaluator_config:
config:
params:
temperature: 0.6
top_p: 0.95
max_new_tokens: 8192
parallelism: 32
- name: mbpp
nemo_evaluator_config:
config:
params:
temperature: 0.2
top_p: 0.95
max_new_tokens: 2048
extra:
n_samples: 5
```
### Environment Variables
Task-specific environment variables. These parameters are set for a single job and don't affect other tasks:
```yaml
evaluation:
tasks:
- name: task_name1
# HF_TOKEN and CUSTOM_VAR are available for task_name1
env_vars:
HF_TOKEN: MY_HF_TOKEN
CUSTOM_VAR: CUSTOM_VALUE
- name: task_name2 # HF_TOKEN and CUSTOM_VAR are not set for task_name2
```
### Dataset Directory Mounting
Some evaluation tasks require access to local datasets that must be mounted into the evaluation container. Tasks that require dataset mounting will have `NEMO_EVALUATOR_DATASET_DIR` in their `required_env_vars`.
When using such tasks, you must specify:
- **`dataset_dir`**: Path to the dataset on the host machine
- **`dataset_mount_path`** (optional): Path where the dataset should be mounted inside the container (defaults to `/datasets`)
```yaml
evaluation:
tasks:
- name: mteb.techqa
dataset_dir: /path/to/your/techqa/dataset
# dataset_mount_path: /datasets # Optional, defaults to /datasets
```
The system will:
1. Mount the host path (`dataset_dir`) to the container path (`dataset_mount_path`)
2. Automatically set the `NEMO_EVALUATOR_DATASET_DIR` environment variable to point to the mounted path inside the container
3. Validate that the required environment variable is properly configured
**Example with custom mount path:**
```yaml
evaluation:
tasks:
- name: mteb.techqa
dataset_dir: /mnt/data/techqa
dataset_mount_path: /data/techqa # Custom container path
```
## When to Use
Use evaluation configuration when you want to:
- **Change Default Sampling Parameters**: Adjust temperature, top_p, max_new_tokens for different tasks
- **Change Default Task Values**: Override benchmark-specific default configurations
- **Configure Task-Specific Parameters**: Set custom parameters for individual benchmarks (e.g., n_samples for code generation tasks)
- **Debug and Test**: Launch with limited samples for validation
- **Adjust Endpoint Capabilities**: Configure request timeouts, max retries, and parallel request limits
:::{tip}
For overriding long strings, use YAML multiline syntax with `>-`:
```yaml
config.params.extra.custom_field: >-
This is a long string that spans multiple lines
and will be passed as a single value with spaces
replacing the newlines.
```
This preserves formatting and allows for complex multi-line configurations.
:::
## Reference
- **Parameter Overrides**: {ref}`parameter-overrides` - Complete guide to available parameters and override syntax
- **Adapter Configuration**: For advanced request/response modification (system prompts, payload modification, reasoning handling), see {ref}`nemo-evaluator-interceptors`
- **Task Configuration**: {ref}`lib-core` - Complete nemo-evaluator documentation
- **Available Tasks**: {ref}`nemo-evaluator-containers` - Browse all available evaluation tasks and benchmarks
(executors-overview)=
# Executors
Executors run evaluations by orchestrating containerized benchmarks in different environments. They handle resource management, IO paths, and job scheduling across various execution backends, from local development to large-scale cluster deployments.
**Core concepts**:
- Your model is separate from the evaluation container; communication is via an OpenAI‑compatible API
- Each benchmark runs in a Docker container pulled from the NVIDIA NGC catalog
- Execution backends can optionally manage model deployment
## Choosing an Executor
Select the executor that best matches your environment and requirements:
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`desktop-download;1.5em;sd-mr-1` Local Executor
:link: local
:link-type: doc
Run evaluations on your local machine using Docker for rapid iteration and development workflows.
:::
:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` Slurm Executor
:link: slurm
:link-type: doc
Execute large-scale evaluations on Slurm-managed high-performance computing clusters with optional model deployment.
:::
:::{grid-item-card} {octicon}`cloud;1.5em;sd-mr-1` Lepton Executor
:link: lepton
:link-type: doc
Run evaluations on Lepton AI's hosted infrastructure with automatic model deployment and scaling.
:::
::::
:::{toctree}
:caption: Executors
:hidden:
Local Executor
Slurm Executor
Lepton Executor
:::
(executor-local)=
# Local Executor
The Local executor runs evaluations on your machine using Docker. It provides a fast way to iterate if you have Docker installed, evaluating existing endpoints.
See common concepts and commands in {ref}`executors-overview`.
## Prerequisites
- Docker
- Python environment with the NeMo Evaluator Launcher CLI available (install the launcher by following {ref}`gs-install`)
## Quick Start
For detailed step-by-step instructions on evaluating existing endpoints, refer to the {ref}`gs-quickstart-launcher` guide, which covers:
- Choosing models and tasks
- Setting up API keys (for NVIDIA APIs, see [Setting up API Keys](https://docs.omniverse.nvidia.com/guide-sdg/latest/setup.html#preview-and-set-up-an-api-key))
- Creating configuration files
- Running evaluations
Here's a quick overview for the Local executor:
### Run evaluation for existing endpoint
```bash
# Run evaluation
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
-o target.api_endpoint.api_key_name=NGC_API_KEY
```
## Environment Variables
The Local executor supports passing environment variables from your local machine to evaluation containers:
### How It Works
The executor passes environment variables to Docker containers using `docker run -e KEY=VALUE` flags. The executor automatically adds `$` to your variable names from the configuration `env_vars` (for example, `OPENAI_API_KEY` becomes `$OPENAI_API_KEY`).
### Configuration
```yaml
evaluation:
env_vars:
API_KEY: YOUR_API_KEY_ENV_VAR_NAME
CUSTOM_VAR: YOUR_CUSTOM_ENV_VAR_NAME
tasks:
- name: my_task
env_vars:
TASK_SPECIFIC_VAR: TASK_ENV_VAR_NAME
```
## Secrets and API Keys
The executor handles API keys the same way as environment variables - store them as environment variables on your machine and reference them in the `env_vars` configuration.
## Mounting and Storage
The Local executor uses Docker volume mounts for data persistence:
### Docker Volumes
- **Results Mount**: Each task's artifacts directory mounts as `/results` in evaluation containers
- **Custom Mounts**: Use to `extra_docker_args` field to define custom volume mounts (see [Advanced configuration](#advanced-configuration) )
## Advanced configuration
You can customize your local executor by specifying `extra_docker_args`.
This parameter allows you to pass any flag to the `docker run` command that is executed by the NeMo Evaluator Launcher.
You can use it to mount additional volumes, set environment variables or customize your network settings.
For example, if you would like your job to use a specific docker network, you can specify:
```yaml
execution:
extra_docker_args: "--network my-custom-network"
```
Replace `my-custom-network` with `host` to access the host network.
To mount additional custom volumes, do:
```yaml
execution:
extra_docker_args: "--volume /my/local/path:/my/container/path"
```
## Rerunning Evaluations
The Local executor generates reusable scripts for rerunning evaluations:
### Script Generation
The Local executor automatically generates scripts:
- **`run_all.sequential.sh`**: Script to run all evaluation tasks sequentially (in output directory)
- **`run.sh`**: Individual scripts for each task (in each task subdirectory)
- **Reproducible**: Scripts contain all necessary commands and configurations
### Manual Rerun
```bash
# Rerun all tasks
cd /path/to/output_dir/2024-01-15-10-30-45-abc12345/
bash run_all.sequential.sh
# Rerun specific task
cd /path/to/output_dir/2024-01-15-10-30-45-abc12345/task1/
bash run.sh
```
## Key Features
- **Docker-based execution**: Isolated, reproducible runs
- **Script generation**: Reusable scripts for rerunning evaluations
- **Real-time logs**: Status tracking via log files
## Monitoring and Job Management
For monitoring jobs, checking status, and managing evaluations, see {ref}`executors-overview`.
(executor-lepton)=
# Lepton Executor
The Lepton executor deploys endpoints and runs evaluations on Lepton AI. It's designed for fast, isolated, parallel evaluations using hosted or deployed endpoints.
See common concepts and commands in the executors overview.
## Prerequisites
- Lepton AI account and workspace access
- Lepton AI credentials configured
- Appropriate container images and permissions (for deployment flows)
## Install Lepton AI SDK
Install the Lepton AI SDK:
```bash
pip install leptonai
```
## Authenticate with Your Lepton Workspace
Log in to your Lepton AI workspace:
```bash
lep login
```
Follow the prompts to authenticate with your Lepton AI credentials.
## Quick Start
Run a Lepton evaluation using the provided examples:
```bash
# Deploy NIM model and run evaluation
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/lepton_nim.yaml
# Deploy vLLM model and run evaluation
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/lepton_vllm.yaml
# Use an existing endpoint (no deployment)
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/lepton_basic.yaml
```
## Parallel Deployment Strategy
- Dedicated endpoints: Each task gets its own endpoint of the same model
- Parallel deployment: All endpoints are created simultaneously (~3x faster)
- Resource isolation: Independent tasks avoid mutual interference
- Storage isolation: Per-invocation subdirectories are created in your configured mount paths
- Simple cleanup: Single command tears down endpoints and storage
```{mermaid}
graph TD
A["nemo-evaluator-launcher run"] --> B["Load Tasks"]
B --> D["Endpoints Deployment"]
D --> E1["Deployment 1: Create Endpoint 1"]
D --> E2["Deployment 2: Create Endpoint 2"]
D --> E3["Deployment 3: Create Endpoint 3"]
E1 --> F["Wait for All Ready"]
E2 --> F
E3 --> F
F --> G["Mount Storage per Task"]
G --> H["Parallel Tasks Creation as Jobs in Lepton"]
H --> J1["Task 1: Job 1 Evaluation"]
H --> J2["Task 2: Job 2 Evaluation"]
H --> J3["Task 3: Job 3 Evaluation"]
J1 --> K["Execute in Parallel"]
J2 --> K
J3 --> K
K --> L["Finish"]
```
## Configuration
Lepton executor configurations require:
- **Execution backend**: `execution: lepton/default`
- **Lepton platform settings**: Node groups, resource shapes, secrets, and storage mounts
Refer to the complete working examples in the `examples/` directory:
- `lepton_vllm_llama_3_1_8b_instruct.yaml` - vLLM deployment
- `lepton_nim_llama_3_1_8b_instruct.yaml` - NIM container deployment
- `lepton_none_llama_3_1_8b_instruct.yaml` - Use existing endpoint
These example files include:
- Lepton-specific resource configuration (`lepton_config.resource_shape`, node groups)
- Environment variable references to secrets (HuggingFace tokens, API keys)
- Storage mount configurations for model caching
- Auto-scaling settings for deployments
## Monitoring and Troubleshooting
Check the status of your evaluation runs:
```bash
# Check status of a specific invocation
nemo-evaluator-launcher status
# Kill running jobs and cleanup endpoints
nemo-evaluator-launcher kill
```
Common issues:
- Ensure Lepton credentials are valid (`lep login`)
- Verify container images are accessible from your Lepton workspace
- Check that endpoints reach Ready state before jobs start
- Confirm secrets are configured in Lepton UI (Settings → Secrets)
(executor-slurm)=
# Slurm Executor
The Slurm executor runs evaluations on high‑performance computing (HPC) clusters managed by Slurm, an open‑source workload manager widely used in research and enterprise environments. It schedules and executes jobs across cluster nodes, enabling parallel, large‑scale evaluation runs while preserving reproducibility via containerized benchmarks.
See common concepts and commands in {ref}`executors-overview`.
Slurm can optionally host your model for the scope of an evaluation by deploying a serving container on the cluster and pointing the benchmark to that temporary endpoint. In this mode, two containers are used: one for the evaluation harness and one for the model server. The evaluation configuration includes a deployment section when this is enabled. See the examples in the examples/ directory for ready‑to‑use configurations.
If you do not require deployment on Slurm, simply omit the deployment section from your configuration and set the model's endpoint URL directly (any OpenAI‑compatible endpoint that you host elsewhere).
## Prerequisites
- Access to a Slurm cluster (with appropriate partitions/queues)
- [Pyxis SPANK plugin](https://github.com/NVIDIA/pyxis) installed on the cluster
## Configuration Overview
### Connecting to Your Slurm Cluster
To run evaluations on Slurm, specify how to connect to your cluster
```yaml
execution:
hostname: your-cluster-headnode # Slurm headnode (login node)
username: your_username # Cluster username (defaults to $USER env var)
account: your_allocation # Slurm account or project name
output_dir: /shared/scratch/your_username/eval_results # Absolute, shared path
```
:::{note}
When specifying the parameters make sure to provide:
- `hostname`: Slurm headnode (login node) where you normally SSH to submit jobs.
- `output_dir`: must be an **absolute path** on a shared filesystem (e.g., /shared/scratch/ or /home/) accessible to both the headnode and compute nodes.
:::
### Model Deployment Options
When deploying models on Slurm, you have two options for specifying your model source:
#### Option 1: HuggingFace Models (Recommended - Automatic Download)
- Use valid Hugging Face model IDs for `hf_model_handle` (for example, `meta-llama/Llama-3.1-8B-Instruct`).
- Browse model IDs: [Hugging Face Models](https://huggingface.co/models).
```yaml
deployment:
checkpoint_path: null # Set to null when using hf_model_handle
hf_model_handle: meta-llama/Llama-3.1-8B-Instruct # HuggingFace model ID
```
**Benefits:**
- Model is automatically downloaded during deployment
- No need to pre-download or manage model files
- Works with any HuggingFace model (public or private with valid access tokens)
**Requirements:**
- Set `HF_TOKEN` environment variable if accessing gated models
- Internet access from compute nodes (or model cached locally)
#### Option 2: Local Model Files (Manual Setup Required)
If you work with a checkpoint stored on locally on the cluster, use `checkpoint_path`:
```yaml
deployment:
checkpoint_path: /shared/models/llama-3.1-8b-instruct # model directory accessible to compute nodes
# Do NOT set hf_model_handle when using checkpoint_path
```
**Note:**
- The directory must exist, be accessible from compute nodes, and contain model files
- Slurm does not automatically download models when using `checkpoint_path`
### Environment Variables
The Slurm executor supports environment variables through `execution.env_vars`:
```yaml
execution:
env_vars:
deployment:
CUDA_VISIBLE_DEVICES: "0,1,2,3,4,5,6,7"
USER: ${oc.env:USER} # References host environment variable
evaluation:
CUSTOM_VAR: "YOUR_CUSTOM_ENV_VAR_VALUE" # Set the value directly
evaluation:
env_vars:
CUSTOM_VAR: CUSTOM_ENV_VAR_NAME # Please note, this is an env var name!
tasks:
- name: my_task
env_vars:
TASK_SPECIFIC_VAR: TASK_ENV_VAR_NAME # Please note, this is an env var name!
```
**How to use environment variables:**
- **Deployment Variables**: Use `execution.env_vars.deployment` for model serving containers
- **Evaluation Variables**: Use `execution.env_vars.evaluation` for evaluation containers
- **Direct Values**: Use quoted strings for direct values
- **Hydra Environment Variables**: Use `${oc.env:VARIABLE_NAME}` to reference host environment variables
:::{note}
The `${oc.env:VARIABLE_NAME}` syntax reads variables defined in your local environment (the one you use to execute `nemo-evaluator-launcher run` command), not the environment on the SLURM cluster.
:::
### Secrets and API Keys
API keys are handled the same way as environment variables - store them as environment variables on your machine and reference them in the `execution.env_vars` configuration.
**Security Considerations:**
- **No Hardcoding**: Never put API keys directly in configuration files, use `${oc.env:ENV_VAR_NAME}` instead.
- **SSH Security**: Ensure secure SSH configuration for key transmission to the cluster.
- **File Permissions**: Ensure configuration files have appropriate permissions (not world-readable).
- **Public Clusters**: Secrets in `execution.env_vars` are stored in plain text in the batch script and saved under `output_dir` on the login node. Use caution when handling sensitive data on public clusters.
### Mounting and Storage
The Slurm executor provides sophisticated mounting capabilities:
```yaml
execution:
mounts:
deployment:
/path/to/checkpoints: /checkpoint
/path/to/cache: /cache
evaluation:
/path/to/data: /data
/path/to/results: /results
mount_home: true # Mount user home directory
```
**Mount Types:**:
- **Deployment Mounts**: For model checkpoints, cache directories, and model data.
- **Evaluation Mounts**: For input data, additional artifacts, and evaluation-specific files
- **Home Mount**: Optional mounting of user home directory (enabled by default)
## Complete Configuration Example
Here's a complete Slurm executor configuration using HuggingFace models:
```yaml
# examples/slurm_llama_3_1_8b_instruct.yaml
defaults:
- execution: slurm/default
- deployment: vllm
- _self_
execution:
hostname: your-cluster-headnode
account: your_account
output_dir: /shared/scratch/your_username/eval_results
partition: gpu
walltime: "04:00:00"
gpus_per_node: 8
env_vars:
deployment:
HF_TOKEN: ${oc.env:HF_TOKEN} # Needed to access meta-llama/Llama-3.1-8B-Instruct gated model
deployment:
hf_model_handle: meta-llama/Llama-3.1-8B-Instruct
checkpoint_path: null
served_model_name: meta-llama/Llama-3.1-8B-Instruct
tensor_parallel_size: 1
data_parallel_size: 8
evaluation:
tasks:
- name: hellaswag
- name: arc_challenge
- name: winogrande
```
This configuration:
- Uses the Slurm execution backend
- Deploys a vLLM model server on the cluster
- Requests GPU resources (8 GPUs per node, 4-hour time limit)
- Runs three benchmark tasks in parallel
- Saves benchmark artifacts to `output_dir`
## Resuming
The Slurm executor includes advanced auto-resume capabilities:
### Automatic Resumption
- **Timeout Handling**: Jobs automatically resume after timeout
- **Preemption Recovery**: Automatic resumption after job preemption
- **Node Failure Recovery**: Jobs resume after node failures
- **Dependency Management**: Uses Slurm job dependencies for resumption
### How It Works
1. **Initial Submission**: Job is submitted with auto-resume handler
2. **Failure Detection**: Script detects timeout/preemption/failure
3. **Automatic Resubmission**: New job is submitted with dependency on previous job
4. **Progress Preservation**: Evaluation continues from where it left off
### Maximum Total Walltime
To prevent jobs from resuming indefinitely, you can configure a maximum total wall-clock time across all resumes using the `max_walltime` parameter:
```yaml
execution:
walltime: "04:00:00" # Time limit per job submission
max_walltime: "24:00:00" # Maximum total time across all resumes (optional)
```
**How it works:**
- The actual runtime of each job is tracked using SLURM's `sacct` command
- When a job resumes, the previous job's actual elapsed time is added to the accumulated total
- Before starting each resumed job, the accumulated runtime is checked against `max_walltime`
- If the accumulated runtime exceeds `max_walltime`, the job chain stops with an error
- This prevents runaway jobs that might otherwise resume indefinitely
**Configuration:**
- `max_walltime`: Maximum total runtime in `HH:MM:SS` format (e.g., `"24:00:00"` for 24 hours)
- Defaults to `"120:00:00"` (120 hours). Set to `null` for unlimited resuming
:::{note}
The `max_walltime` tracks **actual job execution time only**, excluding time spent waiting in the queue. This ensures accurate runtime accounting even when jobs are repeatedly preempted or must wait for resources.
:::
## Monitoring and Job Management
For monitoring jobs, checking status, and managing evaluations, see the [Executors Overview](overview.md#job-management) section.
(configuration-overview)=
# Configuration
The nemo-evaluator-launcher uses [Hydra](https://hydra.cc/docs/intro/) for configuration management, enabling flexible composition and command-line overrides.
## How it Works
1. **Choose your deployment**: Start with `deployment: none` to use existing endpoints
2. **Set your execution platform**: Use `execution: local` for development
3. **Configure your target**: Point to your API endpoint
4. **Select benchmarks**: Add evaluation tasks
5. **Test first**: Always use `--dry-run` to verify
```bash
# Verify configuration
nemo-evaluator-launcher run --config your_config.yaml --dry-run
# Run evaluation
nemo-evaluator-launcher run --config your_config.yaml
```
### Basic Structure
Every configuration has four main sections:
```yaml
defaults:
- execution: local # Where to run: local, lepton, slurm
- deployment: none # How to deploy: none, vllm, sglang, nim, trtllm, generic
- _self_
execution:
output_dir: results # Required: where to save results
target: # Required for deployment: none
api_endpoint:
model_id: meta/llama-3.2-3b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY
evaluation: # Required: what benchmarks to run
- name: gpqa_diamond
- name: ifeval
```
## Deployment Options
Choose how to serve your model for evaluation:
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`globe;1.5em;sd-mr-1` None (External)
:link: deployment/none
:link-type: doc
Use existing API endpoints like NVIDIA API Catalog, OpenAI, or custom deployments. No model deployment needed.
:::
:::{grid-item-card} {octicon}`broadcast;1.5em;sd-mr-1` vLLM
:link: deployment/vllm
:link-type: doc
High-performance LLM serving with advanced parallelism strategies. Best for production workloads and large models.
:::
:::{grid-item-card} {octicon}`zap;1.5em;sd-mr-1` SGLang
:link: deployment/sglang
:link-type: doc
Fast serving framework optimized for structured generation and high-throughput inference with efficient memory usage.
:::
:::{grid-item-card} {octicon}`cpu;1.5em;sd-mr-1` NIM
:link: deployment/nim
:link-type: doc
NVIDIA-optimized inference microservices with automatic scaling, optimization, and enterprise-grade features.
:::
:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` TRT-LLM
:link: deployment/trtllm
:link-type: doc
NVIDIA TensorRT LLM.
:::
:::{grid-item-card} {octicon}`package;1.5em;sd-mr-1` Generic
:link: deployment/generic
:link-type: doc
Deploy models using a fully custom setup.
:::
::::
## Execution Platforms
Choose where to run your evaluations:
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`desktop-download;1.5em;sd-mr-1` Local
:link: executors/local
:link-type: doc
Docker-based evaluation on your local machine. Perfect for development, testing, and small-scale evaluations.
:::
:::{grid-item-card} {octicon}`cloud;1.5em;sd-mr-1` Lepton
:link: executors/lepton
:link-type: doc
Cloud execution with on-demand GPU provisioning. Ideal for production evaluations and scalable workloads.
:::
:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` SLURM
:link: executors/slurm
:link-type: doc
HPC cluster execution with resource management. Best for large-scale evaluations and batch processing.
:::
::::
## Evaluation Configuration
::::{grid} 1 1 1 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`checklist;1.5em;sd-mr-1` Tasks & Benchmarks
:link: evaluation/index
:link-type: doc
Configure evaluation tasks, parameter overrides, and environment variables for your benchmarks.
:::
::::
## Command Line Overrides
Override any configuration value using the `-o` flag:
```bash
# Basic override
nemo-evaluator-launcher run --config your_config.yaml \
-o execution.output_dir=my_results
# Multiple overrides
nemo-evaluator-launcher run --config your_config.yaml \
-o execution.output_dir=my_results \
-o target.api_endpoint.url="https://new-endpoint.com/v1/chat/completions"
```
```{toctree}
:caption: Configuration
:hidden:
Deployment
Executors
Evaluation
```
(exporter-gsheets)=
# Google Sheets Exporter (`gsheets`)
Exports accuracy metrics to a Google Sheet. Dynamically creates/extends header columns based on observed metrics and appends one row per job.
- **Purpose**: Centralized spreadsheet for tracking results across runs
- **Requirements**: `gspread` installed and a Google service account with access
## Usage
Export evaluation results to a Google Sheets spreadsheet for easy sharing and analysis.
::::{tab-set}
:::{tab-item} CLI
Export results from a specific evaluation run to Google Sheets:
```bash
# Export results using default spreadsheet name
nemo-evaluator-launcher export 8abcd123 --dest gsheets
# Export with custom spreadsheet name and ID
nemo-evaluator-launcher export 8abcd123 --dest gsheets \
-o export.gsheets.spreadsheet_name="My Results" \
-o export.gsheets.spreadsheet_id=1ABC...XYZ
```
:::
:::{tab-item} Python
Export results programmatically with custom configuration:
```python
from nemo_evaluator_launcher.api.functional import export_results
# Basic export to Google Sheets
export_results(
invocation_ids=["8abcd123"],
dest="gsheets",
config={
"spreadsheet_name": "NeMo Evaluator Launcher Results"
}
)
# Export with service account and filtered metrics
export_results(
invocation_ids=["8abcd123", "9def4567"],
dest="gsheets",
config={
"spreadsheet_name": "Model Comparison Results",
"service_account_file": "/path/to/service-account.json",
"log_metrics": ["accuracy", "f1_score"]
}
)
```
:::
:::{tab-item} YAML Config
Configure Google Sheets export in your evaluation YAML file for automatic export on completion:
```yaml
execution:
auto_export:
destinations: ["gsheets"]
export:
gsheets:
spreadsheet_name: "LLM Evaluation Results"
spreadsheet_id: "1ABC...XYZ" # optional: use existing sheet
service_account_file: "/path/to/service-account.json"
log_metrics: ["accuracy", "pass@1"]
```
::::
## Key Configuration
```{list-table}
:header-rows: 1
:widths: 25 25 25 25
* - Parameter
- Type
- Description
- Default/Notes
* - `service_account_file`
- str, optional
- Path to service account JSON
- Uses default credentials if omitted
* - `spreadsheet_name`
- str, optional
- Target spreadsheet name. Used to open existing sheets or name new ones.
- Default: "NeMo Evaluator Launcher Results"
* - `spreadsheet_id`
- str, optional
- Target spreadsheet ID. Find it in the spreadsheet URL: `https://docs.google.com/spreadsheets/d//edit`
- Required if your service account can't create sheets due to quota limits.
* - `log_metrics`
- list[str], optional
- Filter metrics to log
- All metrics if omitted
```
(exporters-overview)=
# Exporters
Exporters move evaluation results and artifacts from completed runs to external destinations for analysis, sharing, and reporting. They provide flexible options for integrating evaluation results into your existing workflows and tools.
## How to Set an Exporter
::::{tab-set}
:::{tab-item} CLI
```bash
nemo-evaluator-launcher export [ ...] \
--dest \
[options]
```
:::
:::{tab-item} Python
```python
from nemo_evaluator_launcher.api.functional import export_results
export_results(
invocation_ids=["8abcd123"],
dest="local",
config={
"format": "json",
"output_dir": "./out"
}
)
```
:::
::::
## Choosing an Exporter
Select exporters based on your analysis and reporting needs:
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`file-directory;1.5em;sd-mr-1` Local Files
:link: local
:link-type: doc
Export results and artifacts to local or network file systems for custom analysis and archival.
:::
:::{grid-item-card} {octicon}`graph;1.5em;sd-mr-1` Weights & Biases
:link: wandb
:link-type: doc
Track metrics, artifacts, and run metadata in W&B for comprehensive experiment management.
:::
:::{grid-item-card} {octicon}`database;1.5em;sd-mr-1` MLflow
:link: mlflow
:link-type: doc
Export metrics and artifacts to MLflow Tracking Server for centralized ML lifecycle management.
:::
:::{grid-item-card} {octicon}`table;1.5em;sd-mr-1` Google Sheets
:link: gsheets
:link-type: doc
Export metrics to Google Sheets for easy sharing, reporting, and collaborative analysis.
:::
::::
You can configure multiple exporters simultaneously to support different stakeholder needs and workflow integration points.
:::{toctree}
:caption: Exporters
:hidden:
Local Files
Weights & Biases
MLflow
Google Sheets
:::
(exporter-local)=
# Local Exporter (`local`)
Exports artifacts and optional summaries to the local filesystem. When used with remote executors, stages artifacts from remote locations. Can produce consolidated JSON or CSV summaries of metrics.
## Usage
Export evaluation results and artifacts to your local filesystem with optional summary reports.
::::{tab-set}
:::{tab-item} CLI
Export artifacts and generate summary reports locally:
```bash
# Basic export to current directory
nemo-evaluator-launcher export 8abcd123 --dest local
# Export with JSON summary to custom directory
nemo-evaluator-launcher export 8abcd123 --dest local --format json --output-dir ./evaluation-results/
# Export multiple runs with CSV summary and logs included
nemo-evaluator-launcher export 8abcd123 9def4567 --dest local --format csv --copy-logs --output-dir ./results
# Export only specific metrics to a custom filename
nemo-evaluator-launcher export 8abcd123 --dest local --format json --log-metrics accuracy --log-metrics bleu --output-filename model_metrics.json
```
:::
:::{tab-item} Python
Export results programmatically with flexible configuration:
```python
from nemo_evaluator_launcher.api.functional import export_results
# Basic local export with JSON summary
export_results(
invocation_ids=["8abcd123"],
dest="local",
config={
"format": "json",
"output_dir": "./results"
}
)
# Export multiple runs with comprehensive configuration
export_results(
invocation_ids=["8abcd123", "9def4567"],
dest="local",
config={
"output_dir": "./evaluation-outputs",
"format": "csv",
"copy_logs": True,
"only_required": False, # Include all artifacts
"log_metrics": ["accuracy", "f1_score", "perplexity"],
"output_filename": "comprehensive_results.csv"
}
)
# Export artifacts only (no summary)
export_results(
invocation_ids=["8abcd123"],
dest="local",
config={
"output_dir": "./artifacts-only",
"format": None, # No summary file
"copy_logs": True
}
)
```
:::
:::{tab-item} YAML Config
Configure local export in your evaluation YAML file for automatic export on completion:
```yaml
execution:
auto_export:
destinations: ["local"]
export:
local:
format: "json"
output_dir: "./results"
```
::::
## Key Configuration
```{list-table}
:header-rows: 1
:widths: 25 15 45 15
* - Parameter
- Type
- Description
- Default
* - `output_dir`
- str
- Destination directory for exported results
- `.` (CLI), `./nemo-evaluator-launcher-results` (Python API)
* - `copy_logs`
- bool
- Include logs alongside artifacts
- `false`
* - `only_required`
- bool
- Copy only required and optional artifacts; excludes other files
- `true`
* - `format`
- str | null
- Summary format: `json`, `csv`, or `null` for artifacts only
- `null`
* - `log_metrics`
- list[str]
- Filter metrics by name (exact or substring match)
- All metrics
* - `output_filename`
- str
- Override default summary filename (`processed_results.json` or `processed_results.csv`)
- `processed_results.`
```
(exporter-mlflow)=
# MLflow Exporter (`mlflow`)
Exports accuracy metrics and artifacts to an MLflow Tracking Server.
- **Purpose**: Centralize metrics, parameters, and artifacts in MLflow for experiment tracking
- **Requirements**: `mlflow` package installed and a reachable MLflow tracking server
:::{dropdown} **Prerequisites: MLflow Server Setup**
:open:
Before exporting results, ensure that an **MLflow Tracking Server** is running and reachable.
If no server is active, export attempts will fail with connection errors.
### Quick Start: Local Tracking Server
For local development or testing:
```bash
# Install MLflow
pip install nemo-evaluator-launcher[mlflow]
# Start a local tracking server (runs on: http://127.0.0.1:5000)
mlflow server --host 127.0.0.1 --port 5000
```
This starts MLflow with a local SQLite backend and a file-based artifact store under current directory.
### Production Deployments
For production or multi-user setups:
* **Remote MLflow Server**: Deploy MLflow on a dedicated VM or container.
* **Docker**:
```bash
docker run -p 5000:5000 ghcr.io/mlflow/mlflow:latest \
mlflow server --host 0.0.0.0
```
* **Cloud-Managed Services**: Use hosted options such as **Databricks MLflow** or **AWS SageMaker MLflow**.
For detailed deployment and configuration options, see the
[official MLflow Tracking Server documentation](https://mlflow.org/docs/latest/tracking/server.html).
:::
## Usage
Export evaluation results to MLflow Tracking Server for centralized experiment management.
::::{tab-set}
:::{tab-item} Auto-Export (Recommended)
Configure MLflow export to run automatically after evaluation completes. Add MLflow configuration to your run config YAML file:
```yaml
execution:
auto_export:
destinations: ["mlflow"]
export:
mlflow:
tracking_uri: "http://mlflow.example.com:5000"
target:
api_endpoint:
model_id: meta/llama-3.2-3b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
evaluation:
tasks:
- name: simple_evals.mmlu
```
Alternatively you can use `MLFLOW_TRACKING_URI` environment variable:
```yaml
execution:
auto_export:
destinations: ["mlflow"]
# Export-related env vars (placeholders expanded at runtime)
env_vars:
export:
# you can skip export.mlflow.tracking_uri if you set this var
MLFLOW_TRACKING_URI: MLFLOW_TRACKING_URI
```
Set optional fields to customize your export:
```yaml
execution:
auto_export:
destinations: ["mlflow"]
export:
mlflow:
tracking_uri: "http://mlflow.example.com:5000"
experiment_name: "llm-evaluation"
description: "Llama 3.1 8B evaluation"
log_metrics: ["mmlu_score_macro", "mmlu_score_micro"]
tags:
model_family: "llama"
version: "3.1"
extra_metadata:
hardware: "A100"
batch_size: 32
log_artifacts: true
```
Run the evaluation with auto-export enabled:
```bash
nemo-evaluator-launcher run --config ./my_config.yaml
```
:::
:::{tab-item} Manual Export (Python API)
Export results programmatically after evaluation completes:
```python
from nemo_evaluator_launcher.api.functional import export_results
# Basic MLflow export
export_results(
invocation_ids=["8abcd123"],
dest="mlflow",
config={
"tracking_uri": "http://mlflow:5000",
"experiment_name": "model-evaluation"
}
)
# Export with metadata and tags
export_results(
invocation_ids=["8abcd123"],
dest="mlflow",
config={
"tracking_uri": "http://mlflow:5000",
"experiment_name": "llm-benchmarks",
"run_name": "llama-3.1-8b-mmlu",
"description": "Evaluation of Llama 3.1 8B on MMLU",
"tags": {
"model_family": "llama",
"model_version": "3.1",
"benchmark": "mmlu"
},
"log_metrics": ["accuracy"],
"extra_metadata": {
"hardware": "A100-80GB",
"batch_size": 32
}
}
)
# Export with artifacts disabled
export_results(
invocation_ids=["8abcd123"],
dest="mlflow",
config={
"tracking_uri": "http://mlflow:5000",
"experiment_name": "model-comparison",
"log_artifacts": False
}
)
# Skip if run already exists
export_results(
invocation_ids=["8abcd123"],
dest="mlflow",
config={
"tracking_uri": "http://mlflow:5000",
"experiment_name": "nightly-evals",
"skip_existing": True
}
)
```
:::
:::{tab-item} Manual Export (CLI)
Export results after evaluation completes:
```shell
# Default export
nemo-evaluator-launcher export 8abcd123 --dest mlflow
# With overrides
nemo-evaluator-launcher export 8abcd123 --dest mlflow \
-o export.mlflow.tracking_uri=http://mlflow:5000 \
-o export.mlflow.experiment_name=my-exp
# With metric filtering
nemo-evaluator-launcher export 8abcd123 --dest mlflow --log-metrics accuracy pass@1
```
:::
::::
## Configuration Parameters
```{list-table}
:header-rows: 1
:widths: 25 15 45 15
* - Parameter
- Type
- Description
- Default
* - `tracking_uri`
- str
- MLflow tracking server URI
- Required if env var `MLFLOW_TRACKING_URI` is not set
* - `experiment_name`
- str
- MLflow experiment name
- `"nemo-evaluator-launcher"`
* - `run_name`
- str
- Run display name
- Auto-generated
* - `description`
- str
- Run description
- None
* - `tags`
- dict[str, str]
- Custom tags for the run
- None
* - `extra_metadata`
- dict
- Additional parameters logged to MLflow
- None
* - `skip_existing`
- bool
- Skip export if run exists for invocation. Useful to avoid creating duplicate runs when re-exporting.
- `false`
* - `log_metrics`
- list[str]
- Filter metrics by substring match
- All metrics
* - `log_artifacts`
- bool
- Upload evaluation artifacts
- `true`
* - `log_logs`
- bool
- Upload execution logs
- `false`
* - `only_required`
- bool
- Copy only required artifacts
- `true`
```
(exporter-wandb)=
# Weights & Biases Exporter (`wandb`)
Exports accuracy metrics and artifacts to W&B. Supports either per-task runs or a single multi-task run per invocation, with artifact logging and run metadata.
- **Purpose**: Track runs, metrics, and artifacts in W&B
- **Requirements**: `wandb` installed and credentials configured
## Usage
Export evaluation results to Weights & Biases for experiment tracking, visualization, and collaboration.
::::{tab-set}
:::{tab-item} CLI
Basic export to W&B using credentials and project settings from your evaluation configuration:
```bash
# Export to W&B (uses config from evaluation run)
nemo-evaluator-launcher export 8abcd123 --dest wandb
# Filter metrics to export specific measurements
nemo-evaluator-launcher export 8abcd123 --dest wandb --log-metrics accuracy f1_score
```
```{note}
Specify W&B configuration (entity, project, tags, etc.) in your evaluation YAML configuration file under `execution.auto_export.configs.wandb`. The CLI export command reads these settings from the stored job configuration.
```
:::
:::{tab-item} Python
Export results programmatically with W&B configuration:
```python
from nemo_evaluator_launcher.api.functional import export_results
# Basic W&B export
export_results(
invocation_ids=["8abcd123"],
dest="wandb",
config={
"entity": "myorg",
"project": "model-evaluations"
}
)
# Export with metadata and organization
export_results(
invocation_ids=["8abcd123"],
dest="wandb",
config={
"entity": "myorg",
"project": "llm-benchmarks",
"name": "llama-3.1-8b-eval",
"group": "llama-family-comparison",
"description": "Evaluation of Llama 3.1 8B on benchmarks",
"tags": ["llama-3.1", "8b"],
"log_mode": "per_task",
"log_metrics": ["accuracy"],
"log_artifacts": True,
"extra_metadata": {
"hardware": "A100-80GB"
}
}
)
# Multi-task mode: single run for all tasks
export_results(
invocation_ids=["8abcd123"],
dest="wandb",
config={
"entity": "myorg",
"project": "model-comparison",
"log_mode": "multi_task",
"log_artifacts": False
}
)
```
:::
:::{tab-item} YAML Config
Configure W&B export in your evaluation YAML file for automatic export on completion:
```yaml
execution:
auto_export:
destinations: ["wandb"]
# Export-related env vars (placeholders expanded at runtime)
env_vars:
export:
WANDB_API_KEY: WANDB_API_KEY
export:
wandb:
entity: "myorg"
project: "llm-benchmarks"
name: "llama-3.1-8b-instruct-v1"
group: "baseline-evals"
tags: ["llama-3.1", "baseline"]
description: "Baseline evaluation"
log_mode: "multi_task"
log_metrics: ["accuracy"]
log_artifacts: true
log_logs: true
only_required: false
extra_metadata:
hardware: "H100"
checkpoint: "path/to/checkpoint"
```
:::
::::
## Configuration Parameters
```{list-table}
:header-rows: 1
:widths: 20 15 50 15
* - Parameter
- Type
- Description
- Default
* - `entity`
- str
- W&B entity (organization or username)
- Required
* - `project`
- str
- W&B project name
- Required
* - `log_mode`
- str
- Logging mode: `per_task` creates separate runs for each evaluation task; `multi_task` creates a single run for all tasks
- `per_task`
* - `name`
- str
- Run display name. If not specified, auto-generated as `eval-{invocation_id}-{benchmark}` (per_task) or `eval-{invocation_id}` (multi_task)
- Auto-generated
* - `group`
- str
- Run group for organizing related runs
- Invocation ID
* - `tags`
- list[str]
- Tags for categorizing the run
- None
* - `description`
- str
- Run description (stored as W&B notes)
- None
* - `log_metrics`
- list[str]
- Metric name patterns to filter (e.g., `["accuracy", "f1"]`). Logs only metrics containing these substrings
- All metrics
* - `log_artifacts`
- bool
- Whether to upload evaluation artifacts (results files, configs) to W&B
- `true`
* - `log_logs`
- bool
- Upload execution logs
- `false`
* - `only_required`
- bool
- Copy only required artifacts
- `true`
* - `extra_metadata`
- dict
- Additional metadata stored in run config (e.g., hardware, hyperparameters)
- `{}`
* - `job_type`
- str
- W&B job type classification
- `evaluation`
```
(lib-launcher)=
# NeMo Evaluator Launcher
The *Orchestration Layer* empowers you to run AI model evaluations at scale. Use the unified CLI and programmatic interfaces to discover benchmarks, configure runs, submit jobs, monitor progress, and export results.
:::{tip}
**New to evaluation?** Start with {ref}`gs-quickstart-launcher` for a step-by-step walkthrough.
:::
## Get Started
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`play;1.5em;sd-mr-1` Quickstart
:link: ../../get-started/quickstart/launcher
:link-type: doc
Step-by-step guide to install, configure, and run your first evaluation in minutes.
:::
:::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Configuration
:link: configuration/index
:link-type: doc
Complete configuration schema, examples, and advanced patterns for all use cases.
:::
::::
## Execution
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`iterations;1.5em;sd-mr-1` Executors
:link: configuration/executors/index
:link-type: doc
Execute evaluations on your local machine, HPC cluster (Slurm), or cloud platform (Lepton AI).
:::
:::{grid-item-card} {octicon}`device-desktop;1.5em;sd-mr-1` Local Executor
:link: configuration/executors/local
:link-type: doc
Docker-based evaluation on your workstation. Perfect for development and testing.
:::
:::{grid-item-card} {octicon}`organization;1.5em;sd-mr-1` Slurm Executor
:link: configuration/executors/slurm
:link-type: doc
HPC cluster execution with automatic resource management and job scheduling.
:::
:::{grid-item-card} {octicon}`cloud;1.5em;sd-mr-1` Lepton Executor
:link: configuration/executors/lepton
:link-type: doc
Cloud execution with on-demand GPU provisioning and automatic scaling.
:::
::::
## Export
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`upload;1.5em;sd-mr-1` Exporters
:link: exporters/index
:link-type: doc
Export results to MLflow, Weights & Biases, Google Sheets, or local files with one command.
:::
:::{grid-item-card} {octicon}`database;1.5em;sd-mr-1` MLflow Export
:link: exporters/mlflow
:link-type: doc
Export evaluation results and metrics to MLflow for experiment tracking.
:::
:::{grid-item-card} {octicon}`graph;1.5em;sd-mr-1` W&B Export
:link: exporters/wandb
:link-type: doc
Integrate with Weights & Biases for advanced visualization and collaboration.
:::
:::{grid-item-card} {octicon}`table;1.5em;sd-mr-1` Sheets Export
:link: exporters/gsheets
:link-type: doc
Export to Google Sheets for easy sharing and analysis with stakeholders.
:::
::::
## Typical Workflow
1. **Choose execution backend** (local, Slurm, Lepton AI)
2. **Select example configuration** from the examples directory
3. **Point to your model endpoint** (OpenAI-compatible API)
4. **Launch evaluation** via CLI or Python API
5. **Monitor progress** and export results to your preferred platform
## When to Use the Launcher
Use the launcher whenever you want:
- **Unified interface** for running evaluations across different backends
- **Multi-benchmark coordination** with concurrent execution
- **Turnkey reproducibility** with saved configurations
- **Easy result export** to MLOps platforms and dashboards
- **Production-ready orchestration** with monitoring and lifecycle management
:::{toctree}
:caption: NeMo Evaluator Launcher
:hidden:
About NeMo Evaluator Launcher
Configuration
Exporters
:::
# API Reference
## Available Data Classes
The API provides several dataclasses for configuration:
```python
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig, # Main evaluation configuration
EvaluationTarget, # Target model configuration
ConfigParams, # Evaluation parameters
ApiEndpoint, # API endpoint configuration
EvaluationResult, # Evaluation results
TaskResult, # Individual task results
MetricResult, # Metric scores
Score, # Score representation
ScoreStats, # Score statistics
GroupResult, # Grouped results
EndpointType, # Endpoint type enum
Evaluation # Complete evaluation object
)
```
## `run_eval`
The main entry point for running evaluations. This is a CLI entry point that parses command line arguments.
```python
from nemo_evaluator.api.run import run_eval
def run_eval() -> None:
"""
CLI entry point for running evaluations.
This function parses command line arguments and executes evaluations.
It does not take parameters directly - all configuration is passed through CLI arguments.
CLI Arguments:
--eval_type: Type of evaluation to run (such as "mmlu_pro", "gpqa_diamond")
--model_id: Model identifier (such as "meta/llama-3.2-3b-instruct")
--model_url: API endpoint URL (such as "https://integrate.api.NVIDIA.com/v1/chat/completions" for chat endpoint type)
--model_type: Endpoint type ("chat", "completions", "vlm", "embedding")
--api_key_name: Environment variable name for API key integration with endpoints (optional)
--output_dir: Output directory for results
--run_config: Path to YAML Run Configuration file (optional)
--overrides: Comma-separated dot-style parameter overrides (optional)
--dry_run: Show rendered config without running (optional)
--debug: Enable debug logging (optional, deprecated, use NV_LOG_LEVEL=DEBUG env var)
Usage:
run_eval() # Parses sys.argv automatically
"""
```
:::{note} The `run_eval()` function is designed as a CLI entry point. For programmatic usage, use the underlying configuration objects and the `evaluate()` function directly.
:::
## `evaluate`
The core evaluation function for programmatic usage.
```python
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import EvaluationConfig, EvaluationTarget
def evaluate(
eval_cfg: EvaluationConfig,
target_cfg: EvaluationTarget
) -> EvaluationResult:
"""
Run an evaluation using configuration objects.
Args:
eval_cfg: Evaluation configuration object
target_cfg: Target configuration object
Returns:
EvaluationResult: Evaluation results and metadata
"""
```
**Example Programmatic Usage:**
```python
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig,
EvaluationTarget,
ConfigParams,
ApiEndpoint
)
# Create evaluation configuration
eval_config = EvaluationConfig(
type="simple_evals.mmlu_pro",
output_dir="./results",
params=ConfigParams(
limit_samples=100,
temperature=0.1
)
)
# Create target configuration
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
url="https://integrate.api.NVIDIA.com/v1/chat/completions",
model_id="meta/llama-3.2-3b-instruct",
type="chat",
api_key="MY_API_KEY" # Name of the environment variable that stores api_key
)
)
# Run evaluation
result = evaluate(eval_config, target_config)
```
## Data Structures
### `EvaluationConfig`
Configuration for evaluation runs, defined in `api_dataclasses.py`.
```python
from nemo_evaluator.api.api_dataclasses import EvaluationConfig
class EvaluationConfig:
"""Configuration for evaluation runs."""
type: str # Type of evaluation - benchmark to be run
output_dir: str # Output directory
params: str # parameter overrides
```
### `EvaluationTarget`
Target configuration for API endpoints, defined in `api_dataclasses.py`.
```python
from nemo_evaluator.api.api_dataclasses import EvaluationTarget, EndpointType
class EvaluationTarget:
"""Target configuration for API endpoints."""
api_endpoint: ApiEndpoint # API endpoint to be used for evaluation
class ApiEndpoint:
url: str # API endpoint URL
model_id: str # Model name or identifier
type: str # Endpoint type (chat, completions, vlm, or embedding)
api_key: str # Name of the env variable that stores API key
adapter_config: AdapterConfig # Adapter configuration
```
In the ApiEndpoint dataclass, `type` should be one of: `EndpointType.CHAT`, `EndpointType.COMPLETIONS`, `EndpointType.VLM`, `EndpointType.EMBEDDING`:
- `CHAT` endpoint accepts structured input as a sequence of messages (such as system, user, assistant roles). It returns a model-generated message, enabling controlled multi-turn interactions.
- `COMPLETIONS` endpoint takes a single prompt string and returns a text continuation, typically used for one-shot or single-turn tasks without conversational structure.
- `VLM` endpoint hosts a model that has vision capabilities.
- `EMBEDDING` endpoint hosts an embedding model.
## Adapter System
### `AdapterConfig`
Configuration for the adapter system, defined in `adapter_config.py`.
```python
from nemo_evaluator.adapters.adapter_config import AdapterConfig
class AdapterConfig:
"""Configuration for the adapter system."""
discovery: DiscoveryConfig # Module discovery configuration
interceptors: list[InterceptorConfig] # List of interceptors
post_eval_hooks: list[PostEvalHookConfig] # Post-evaluation hooks
endpoint_type: str # Default endpoint type
caching_dir: str | None # Legacy caching directory
```
### `InterceptorConfig`
Configuration for individual interceptors.
```python
from nemo_evaluator.adapters.adapter_config import InterceptorConfig
class InterceptorConfig:
"""Configuration for a single interceptor."""
name: str # Interceptor name
enabled: bool # Whether enabled
config: dict[str, Any] # Interceptor-specific configuration
```
### `DiscoveryConfig`
Configuration for discovering third-party modules and directories.
```python
from nemo_evaluator.adapters.adapter_config import DiscoveryConfig
class DiscoveryConfig:
"""Configuration for discovering 3rd party modules and directories."""
modules: list[str] # List of module paths to discover
dirs: list[str] # List of directory paths to discover
```
## Available Interceptors
### 1. Request Logging Interceptor
```python
from nemo_evaluator.adapters.interceptors.logging_interceptor import LoggingInterceptor
# Configuration
interceptor_config = {
"name": "request_logging",
"enabled": True,
"config": {
"output_dir": "/tmp/logs",
"max_requests": 1000,
"log_failed_requests": True
}
}
```
**Features:**
- Logs all API requests and responses
- Configurable output directory
- Request/response count limits
- Failed request logging
### 2. Caching Interceptor
```python
from nemo_evaluator.adapters.interceptors.caching_interceptor import CachingInterceptor
# Configuration
interceptor_config = {
"name": "caching",
"enabled": True,
"config": {
"cache_dir": "/tmp/cache",
"reuse_cached_responses": True,
"save_requests": True,
"save_responses": True,
"max_saved_requests": 1000,
"max_saved_responses": 1000
}
}
```
**Features:**
- Response caching for performance
- Persistent storage - responses are saved to disk, allowing resumption after process termination
- Configurable cache directory
- Request/response persistence
- Cache size limits
### 3. Reasoning Interceptor
```python
from nemo_evaluator.adapters.interceptors.reasoning_interceptor import ReasoningInterceptor
# Configuration
interceptor_config = {
"name": "reasoning",
"enabled": True,
"config": {
"start_reasoning_token": "",
"end_reasoning_token": "",
"add_reasoning": True,
"enable_reasoning_tracking": True
}
}
```
**Features:**
- Reasoning chain support
- Custom reasoning tokens
- Reasoning tracking and analysis
- Chain-of-thought prompting
### 4. System Message Interceptor
```python
from nemo_evaluator.adapters.interceptors.system_message_interceptor import SystemMessageInterceptor
# Configuration
interceptor_config = {
"name": "system_message",
"enabled": True,
"config": {
"system_message": "You are a helpful AI assistant.",
"strategy": "prepend" # Optional: "replace", "append", or "prepend" (default)
}
}
```
**Features:**
- Custom system prompt injection
- Multiple strategies for handling existing system messages (replace, prepend, append)
- Consistent system behavior
- Flexible system message composition
**Use Cases:**
- Modify system prompts for different evaluation scenarios
- Test different prompt variations without code changes
- Replace existing system messages for consistent behavior
- Prepend or append instructions to existing system messages
- A/B testing of different prompt strategies
### 5. Endpoint Interceptor
```python
from nemo_evaluator.adapters.interceptors.endpoint_interceptor import EndpointInterceptor
# Configuration
interceptor_config = {
"name": "endpoint",
"enabled": True,
"config": {
"endpoint_url": "https://api.example.com/v1/chat/completions",
"timeout": 30
}
}
```
**Features:**
- Endpoint URL management
- Request timeout configuration
- Endpoint validation
### 6. Payload Modifier Interceptor
```python
from nemo_evaluator.adapters.interceptors.payload_modifier_interceptor import PayloadModifierInterceptor
# Configuration
interceptor_config = {
"name": "payload_modifier",
"enabled": True,
"config": {
"params_to_add": {
"extra_body": {
"chat_template_kwargs": {
"enable_thinking": False
}
}
},
"params_to_remove": ["field_in_msgs_to_remove"],
"params_to_rename": {"max_tokens": "max_completion_tokens"}
}
}
```
**Explanation:**
This interceptor is particularly useful when custom behavior is needed. In this example, the `enable_thinking` parameter is a custom key that controls the reasoning mode of the model. When set to `False`, it disables the model's internal reasoning/thinking process, which can be useful for scenarios where you want more direct responses without the model's step-by-step reasoning output.
The `field_in_msgs_to_remove` field would be removed recursively from all messages in the payload.
**Features:**
- Custom parameter injection
- Remove fields recursively at all levels of the payload
- Rename top-level payload keys
### 7. Client Error Interceptor
```python
from nemo_evaluator.adapters.interceptors.raise_client_error_interceptor import RaiseClientErrorInterceptor
# Configuration
interceptor_config = {
"name": "raise_client_error",
"enabled": True,
"config": {
"raise_on_error": True,
"error_threshold": 400
}
}
```
**Features:**
- Error handling and propagation
- Configurable error thresholds
- Client error management
# Code Generation Containers
Containers specialized for evaluating code generation models and programming language capabilities.
---
## BigCode Evaluation Harness Container
**NGC Catalog**: [bigcode-evaluation-harness](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bigcode-evaluation-harness)
Container specialized for evaluating code generation models and programming language models.
**Use Cases:**
- Code generation quality assessment
- Programming problem solving
- Code completion evaluation
- Software engineering task assessment
**Pull Command:**
```bash
docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:{{ docker_compose_latest }}
```
**Default Parameters:**
| Parameter | Value |
|-----------|-------|
| `limit_samples` | `None` |
| `max_new_tokens` | `512` |
| `temperature` | `1e-07` |
| `top_p` | `0.9999999` |
| `parallelism` | `10` |
| `max_retries` | `5` |
| `request_timeout` | `30` |
| `do_sample` | `True` |
| `n_samples` | `1` |
---
## Compute Eval Container
**NGC Catalog**: [compute-eval](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/compute-eval)
Container specialized for evaluating CUDA code generation.
**Use Cases:**
- CUDA code generation
- CCCL programming problems
**Pull Command:**
```bash
docker pull nvcr.io/nvidia/eval-factory/compute-eval:{{ docker_compose_latest }}
```
**Default Parameters:**
| Parameter | Value |
|-----------|-------|
| `limit_samples` | `None` |
| `max_new_tokens` | `2048` |
| `temperature` | `0` |
| `top_p` | `0.00001` |
| `parallelism` | `1` |
| `max_retries` | `2` |
| `request_timeout` | `3600` |
---
## LiveCodeBench Container
**NGC Catalog**: [livecodebench](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/livecodebench)
LiveCodeBench provides holistic and contamination-free evaluation of coding capabilities of LLMs. It continuously collects new problems from contests across three competition platforms -- LeetCode, AtCoder, and CodeForces.
**Use Cases:**
- Holistic coding capability evaluation
- Contamination-free assessment
- Contest-style problem solving
- Code generation and execution
- Test output prediction
- Self-repair capabilities
**Pull Command:**
```bash
docker pull nvcr.io/nvidia/eval-factory/livecodebench:{{ docker_compose_latest }}
```
**Default Parameters:**
| Parameter | Value |
|-----------|-------|
| `limit_samples` | `None` |
| `max_new_tokens` | `4096` |
| `temperature` | `0.0` |
| `top_p` | `1e-05` |
| `parallelism` | `10` |
| `max_retries` | `5` |
| `request_timeout` | `60` |
| `n_samples` | `10` |
| `num_process_evaluate` | `5` |
| `cache_batch_size` | `10` |
| `support_system_role` | `False` |
| `cot_code_execution` | `False` |
**Supported Versions:** v1-v6, 0724_0125, 0824_0225
---
## SciCode Container
**NGC Catalog**: [scicode](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/scicode)
SciCode is a challenging benchmark designed to evaluate the capabilities of language models in generating code for solving realistic scientific research problems with diverse coverage across 16 subdomains from six domains.
**Use Cases:**
- Scientific research code generation
- Multi-domain scientific programming
- Research workflow automation
- Scientific computation evaluation
- Domain-specific coding tasks
**Pull Command:**
```bash
docker pull nvcr.io/nvidia/eval-factory/scicode:{{ docker_compose_latest }}
```
**Default Parameters:**
| Parameter | Value |
|-----------|-------|
| `limit_samples` | `None` |
| `temperature` | `0` |
| `max_new_tokens` | `2048` |
| `top_p` | `1e-05` |
| `request_timeout` | `60` |
| `max_retries` | `2` |
| `with_background` | `False` |
| `include_dev` | `False` |
| `n_samples` | `1` |
| `eval_threads` | `None` |
**Supported Domains:** Physics, Math, Material Science, Biology, Chemistry (16 subdomains from five domains)
# Model Efficiency
Containers specialized in evaluating Large Language Model efficiency.
---
## GenAIPerf Container
**NGC Catalog**: [genai-perf](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/genai-perf)
Container for assessing the speed of processing requests by the server.
**Use Cases:**
- Analysis time to first token (TTF) and inter-token latency (ITL)
- Assessment of server efficiency under load
- Summarization scenario: long input, short output
- Generation scenatio: short input, long output
**Pull Command:**
```bash
docker pull nvcr.io/nvidia/eval-factory/genai-perf:{{ docker_compose_latest }}
```
**Default Parameters:**
| Parameter | Value |
|-----------|-------|
| `parallelism` | `1` |
Benchmark-specific parameters (passed via `extra` field):
| Parameter | Description |
|-----------|-------|
| `tokenizer` | HuggingFace tokenizer to use for calculating the number of tokens. **Requied parameter** (default: `None`)|
| `warmup` | Whether to run warmup (default: `True`) |
| `isl` | Input sequence length (default: task-specific, see below) |
| `osl` | Output sequence length (default: task-specific, see below) |
**Supported Benchmarks:**
- `genai_perf_summarization` - Speed analysis with `isl: 5000` and `osl: 500`.
- `genai_perf_generation` - Speed analysis with `isl: 500` and `osl: 5000`.
(nemo-evaluator-containers)=
# NeMo Evaluator Containers
NeMo Evaluator provides a collection of specialized containers for different evaluation frameworks and tasks. Each container is optimized and tested to work seamlessly with NVIDIA hardware and software stack, providing consistent, reproducible environments for AI model evaluation.
## Container Categories
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`cpu;1.5em;sd-mr-1` Language Models
:link: language-models
:link-type: doc
Containers for evaluating large language models across academic benchmarks and custom tasks.
:::
:::{grid-item-card} {octicon}`file-code;1.5em;sd-mr-1` Code Generation
:link: code-generation
:link-type: doc
Specialized containers for evaluating code generation and programming capabilities.
:::
:::{grid-item-card} {octicon}`eye;1.5em;sd-mr-1` Vision-Language
:link: vision-language
:link-type: doc
Multimodal evaluation containers for vision-language understanding and reasoning.
:::
:::{grid-item-card} {octicon}`shield-check;1.5em;sd-mr-1` Safety & Security
:link: safety-security
:link-type: doc
Containers focused on safety evaluation, bias detection, and security testing.
:::
:::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Specialized Tools
:link: specialized-tools
:link-type: doc
Containers focused on agentic AI capabilities and advanced reasoning.
:::
:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` Efficiency
:link: efficiency
:link-type: doc
Containers for evaluating speed of input processing and output generation.
:::
::::
---
## Quick Start
### Basic Container Usage
```bash
# Pull a container
docker pull nvcr.io/nvidia/eval-factory/:
# Example: Pull simple-evals container
docker pull nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }}
# Run with GPU support
docker run -it nvcr.io/nvidia/eval-factory/:
```
### Prerequisites
- Docker and NVIDIA Container Toolkit (for GPU support)
- NVIDIA GPU (for GPU-accelerated evaluation)
- Sufficient disk space for models and datasets
For detailed usage instructions, refer to the {ref}`cli-workflows` guide.
:::{toctree}
:caption: Container Reference
:hidden:
Language Models
Code Generation
Vision-Language
Safety & Security
Specialized Tools
Efficiency
:::
# Language Model Containers
Containers specialized for evaluating large language models across academic benchmarks, custom tasks, and conversation scenarios.
---
## Simple-Evals Container
**NGC Catalog**: [simple-evals](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/simple-evals)
Container for lightweight evaluation tasks and simple model assessments.
**Use Cases:**
- Simple question-answering evaluation
- Math and reasoning capabilities
- Basic Python coding
**Pull Command:**
```bash
docker pull nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }}
```
**Default Parameters:**
| Parameter | Value |
|-----------|-------|
| `limit_samples` | `None` |
| `max_new_tokens` | `4096` |
| `temperature` | `0` |
| `top_p` | `1e-05` |
| `parallelism` | `10` |
| `max_retries` | `5` |
| `request_timeout` | `60` |
| `downsampling_ratio` | `None` |
| `add_system_prompt` | `False` |
| `custom_config` | `None` |
| `judge` | `{'url': None, 'model_id': None, 'api_key': None, 'backend': 'openai', 'request_timeout': 600, 'max_retries': 16, 'temperature': 0.0, 'top_p': 0.0001, 'max_tokens': 1024, 'max_concurrent_requests': None}` |
---
## LM-Evaluation-Harness Container
**NGC Catalog**: [lm-evaluation-harness](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/lm-evaluation-harness)
Container based on the Language Model Evaluation Harness framework for comprehensive language model evaluation.
**Use Cases:**
- Standard NLP benchmarks
- Language model performance evaluation
- Multi-task assessment
- Academic benchmark evaluation
**Pull Command:**
```bash
docker pull nvcr.io/nvidia/eval-factory/lm-evaluation-harness:{{ docker_compose_latest }}
```
**Default Parameters:**
| Parameter | Value |
|-----------|-------|
| `limit_samples` | `None` |
| `max_new_tokens` | `None` |
| `temperature` | `1e-07` |
| `top_p` | `0.9999999` |
| `parallelism` | `10` |
| `max_retries` | `5` |
| `request_timeout` | `30` |
| `tokenizer` | `None` |
| `tokenizer_backend` | `None` |
| `downsampling_ratio` | `None` |
| `tokenized_requests` | `False` |
---
## MT-Bench Container
**NGC Catalog**: [mtbench](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mtbench)
Container for MT-Bench evaluation framework, designed for multi-turn conversation evaluation.
**Use Cases:**
- Multi-turn dialogue evaluation
- Conversation quality assessment
- Context maintenance evaluation
- Interactive AI system testing
**Pull Command:**
```bash
docker pull nvcr.io/nvidia/eval-factory/mtbench:{{ docker_compose_latest }}
```
**Default Parameters:**
| Parameter | Value |
|-----------|-------|
| `max_new_tokens` | `1024` |
| `parallelism` | `10` |
| `max_retries` | `5` |
| `request_timeout` | `30` |
| `judge` | `{'url': None, 'model_id': 'gpt-4', 'api_key': None, 'request_timeout': 60, 'max_retries': 16, 'temperature': 0.0, 'top_p': 0.0001, 'max_tokens': 2048}` |
---
## HELM Container
**NGC Catalog**: [helm](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/helm)
Container for the Holistic Evaluation of Language Models (HELM) framework, with a focus on MedHELM - an extensible evaluation framework for assessing LLM performance for medical tasks.
**Use Cases:**
- Medical AI model evaluation
- Clinical task assessment
- Healthcare-specific benchmarking
- Diagnostic decision-making evaluation
- Patient communication assessment
- Medical knowledge evaluation
**Pull Command:**
```bash
docker pull nvcr.io/nvidia/eval-factory/helm:{{ docker_compose_latest }}
```
**Default Parameters:**
| Parameter | Value |
|-----------|-------|
| `limit_samples` | `None` |
| `parallelism` | `1` |
| `data_path` | `None` |
| `num_output_tokens` | `None` |
| `subject` | `None` |
| `condition` | `None` |
| `max_length` | `None` |
| `num_train_trials` | `None` |
| `subset` | `None` |
| `gpt_judge_api_key` | `GPT_JUDGE_API_KEY` |
| `llama_judge_api_key` | `LLAMA_JUDGE_API_KEY` |
| `claude_judge_api_key` | `CLAUDE_JUDGE_API_KEY` |
---
## RAG Retriever Evaluation Container
**NGC Catalog**: [rag_retriever_eval](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/rag_retriever_eval)
Container for evaluating Retrieval-Augmented Generation (RAG) systems and their retrieval capabilities.
**Use Cases:**
- Document retrieval accuracy
- Context relevance assessment
- RAG pipeline evaluation
- Information retrieval performance
**Pull Command:**
```bash
docker pull nvcr.io/nvidia/eval-factory/rag_retriever_eval:{{ docker_compose_latest }}
```
---
## HLE Container
**NGC Catalog**: [hle](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/hle)
Container for Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark with broad subject coverage.
**Use Cases:**
- Academic knowledge and problem solving evaluation
- Multi-modal benchmark testing
- Frontier knowledge assessment
- Subject-matter expertise evaluation
**Pull Command:**
```bash
docker pull nvcr.io/nvidia/eval-factory/hle:{{ docker_compose_latest }}
```
**Default Parameters:**
| Parameter | Value |
|-----------|-------|
| `limit_samples` | `None` |
| `max_new_tokens` | `4096` |
| `temperature` | `0.0` |
| `top_p` | `1.0` |
| `parallelism` | `100` |
| `max_retries` | `30` |
| `request_timeout` | `600.0` |
---
## IFBench Container
**NGC Catalog**: [ifbench](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/ifbench)
Container for a challenging benchmark for precise instruction following evaluation.
**Use Cases:**
- Precise instruction following evaluation
- Out-of-distribution constraint verification
- Multiturn constraint isolation testing
- Instruction following robustness assessment
- Verifiable instruction compliance testing
**Pull Command:**
```bash
docker pull nvcr.io/nvidia/eval-factory/ifbench:{{ docker_compose_latest }}
```
**Default Parameters:**
| Parameter | Value |
|-----------|-------|
| `limit_samples` | `None` |
| `max_new_tokens` | `4096` |
| `temperature` | `0.01` |
| `top_p` | `0.95` |
| `parallelism` | `8` |
| `max_retries` | `5` |
---
## MMATH Container
**NGC Catalog**: [mmath](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mmath)
Container for multilingual mathematical reasoning evaluation across multiple languages.
**Use Cases:**
- Multilingual mathematical reasoning evaluation
- Cross-lingual mathematical problem solving assessment
- Mathematical reasoning robustness across languages
- Complex mathematical reasoning capability testing
- Translation quality validation for mathematical content
**Pull Command:**
```bash
docker pull nvcr.io/nvidia/eval-factory/mmath:{{ docker_compose_latest }}
```
**Default Parameters:**
| Parameter | Value |
|-----------|-------|
| `limit_samples` | `None` |
| `max_new_tokens` | `32768` |
| `temperature` | `0.6` |
| `top_p` | `0.95` |
| `parallelism` | `8` |
| `max_retries` | `5` |
| `language` | `en` |
**Supported Languages:** EN, ZH, AR, ES, FR, JA, KO, PT, TH, VI
## ProfBench Container
**NGC Catalog**: [profbench](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/profbench)
Container for assessing performance accross professional domains in business and scientific research.
**Use Cases:**
- Evaluation of professional knowledge accross Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA
- Report generation capabilities
- Quality assessment of LLM judges
**Pull Command:**
```bash
docker pull nvcr.io/nvidia/eval-factory/profbench:{{ docker_compose_latest }}
```
**Default Parameters:**
| Parameter | Value |
|-----------|-------|
| `limit_samples` | `None` |
| `max_new_tokens` | `4096` |
| `temperature` | `0.0` |
| `top_p` | `0.00001` |
| `parallelism` | `10` |
| `max_retries` | `5` |
| `request_timeout` | `600` |
---
## NeMo Skills Container
**NGC Catalog**: [nemo-skills](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/nemo-skills)
Container for assessing LLM capabilities in science, maths and agentic workflows.
**Use Cases:**
- Evaluation of reasoning capabilities
- Advanced math and coding skills
- Agentic workflow
**Pull Command:**
```bash
docker pull nvcr.io/nvidia/eval-factory/nemo-skills:{{ docker_compose_latest }}
```
**Default Parameters:**
| Parameter | Value |
|-----------|-------|
| `limit_samples` | `None` |
| `max_new_tokens` | `65536` |
| `temperature` | `None` |
| `top_p` | `None` |
| `parallelism` | `16` |
# Safety and Security Containers
Containers specialized for evaluating AI model safety, security, and robustness against various threats and biases.
---
## Garak Container
**NGC Catalog**: [garak](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/garak)
Container for security and robustness evaluation of AI models.
**Use Cases:**
- Security testing
- Adversarial attack evaluation
- Robustness assessment
- Safety evaluation
**Pull Command:**
```bash
docker pull nvcr.io/nvidia/eval-factory/garak:{{ docker_compose_latest }}
```
**Default Parameters:**
| Parameter | Value |
|-----------|-------|
| `max_new_tokens` | `150` |
| `temperature` | `0.1` |
| `top_p` | `0.7` |
| `parallelism` | `32` |
| `probes` | `None` |
**Key Features:**
- Automated security testing
- Vulnerability detection
- Prompt injection testing
- Adversarial robustness evaluation
- Comprehensive security reporting
**Security Test Categories:**
- Prompt Injection Attacks
- Data Extraction Attempts
- Jailbreak Techniques
- Adversarial Prompts
- Social Engineering Tests
---
## Safety Harness Container
**NGC Catalog**: [safety-harness](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/safety-harness)
Container for comprehensive safety evaluation of AI models.
**Use Cases:**
- Safety alignment evaluation
- Harmful content detection
- Bias and fairness assessment
- Ethical AI evaluation
**Pull Command:**
```bash
docker pull nvcr.io/nvidia/eval-factory/safety-harness:{{ docker_compose_latest }}
```
**Required Environment Variables:**
- `HF_TOKEN`: Required for aegis_v2 safety evaluation tasks
**Default Parameters:**
| Parameter | Value |
|-----------|-------|
| `limit_samples` | `None` |
| `max_new_tokens` | `6144` |
| `temperature` | `0.6` |
| `top_p` | `0.95` |
| `parallelism` | `8` |
| `max_retries` | `5` |
| `request_timeout` | `30` |
| `judge` | `{'url': None, 'model_id': None, 'api_key': None, 'parallelism': 32, 'request_timeout': 60, 'max_retries': 16}` |
**Key Features:**
- Comprehensive safety benchmarks
- Bias detection and measurement
- Harmful content classification
- Ethical alignment assessment
- Detailed safety reporting
**Safety Evaluation Areas:**
- Bias and Fairness
- Harmful Content Generation
- Toxicity Detection
- Hate Speech Identification
- Ethical Decision Making
- Social Impact Assessment
# Specialized Tools Containers
Containers for specialized evaluation tasks including agentic AI capabilities and advanced reasoning assessments.
---
## Agentic Evaluation Container
**NGC Catalog**: [agentic_eval](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/agentic_eval)
Container for evaluating agentic AI models on tool usage and planning tasks.
**Use Cases:**
- Tool usage evaluation
- Planning tasks assessment
**Pull Command:**
```bash
docker pull nvcr.io/nvidia/eval-factory/agentic_eval:{{ docker_compose_latest }}
```
**Supported Benchmarks:**
- `agentic_eval_answer_accuracy`
- `agentic_eval_goal_accuracy_with_reference`
- `agentic_eval_goal_accuracy_without_reference`
- `agentic_eval_topic_adherence`
- `agentic_eval_tool_call_accuracy`
---
## BFCL Container
**NGC Catalog**: [bfcl](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bfcl)
Container for Berkeley Function-Calling Leaderboard evaluation framework.
**Use Cases:**
- Tool usage evaluation
- Multi-turn interactions
- Native support for function/tool calling
- Function calling evaluation
**Pull Command:**
```bash
docker pull nvcr.io/nvidia/eval-factory/bfcl:{{ docker_compose_latest }}
```
**Default Parameters:**
| Parameter | Value |
|-----------|-------|
| `limit_samples` | `None` |
| `parallelism` | `10` |
| `native_calling` | `False` |
| `custom_dataset` | `{'path': None, 'format': None, 'data_template_path': None}` |
---
## ToolTalk Container
**NGC Catalog**: [tooltalk](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/tooltalk)
Container for evaluating AI models' ability to use tools and APIs effectively.
**Use Cases:**
- Tool usage evaluation
- API interaction assessment
- Function calling evaluation
- External tool integration testing
**Pull Command:**
```bash
docker pull nvcr.io/nvidia/eval-factory/tooltalk:{{ docker_compose_latest }}
```
**Default Parameters:**
| Parameter | Value |
|-----------|-------|
| `limit_samples` | `None` |
# Vision-Language Containers
Containers specialized for evaluating multimodal models that process both visual and textual information.
---
## VLMEvalKit Container
**NGC Catalog**: [vlmevalkit](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/vlmevalkit)
Container for Vision-Language Model evaluation toolkit.
**Use Cases:**
- Multimodal model evaluation
- Image-text understanding assessment
- Visual reasoning evaluation
- Cross-modal performance testing
**Pull Command:**
```bash
docker pull nvcr.io/nvidia/eval-factory/vlmevalkit:{{ docker_compose_latest }}
```
**Default Parameters:**
| Parameter | Value |
|-----------|-------|
| `limit_samples` | `None` |
| `max_new_tokens` | `2048` |
| `temperature` | `0` |
| `top_p` | `None` |
| `parallelism` | `4` |
| `max_retries` | `5` |
| `request_timeout` | `60` |
**Supported Benchmarks:**
- `ocrbench` - Optical character recognition and text understanding
- `slidevqa` - Slide-based visual question answering (requires `OPENAI_CLIENT_ID`, `OPENAI_CLIENT_SECRET`)
- `chartqa` - Chart and graph interpretation
- `ai2d_judge` - AI2 Diagram understanding (requires `OPENAI_CLIENT_ID`, `OPENAI_CLIENT_SECRET`)
(advanced-features)=
# Advanced Features
This section covers advanced FDF features including conditional parameter handling, parameter inheritance, and dynamic configuration.
## Conditional Parameter Handling
Use Jinja2 conditionals to handle optional parameters:
```yaml
command: >-
example_eval --model {{target.api_endpoint.model_id}}
{% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}
{% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %}
{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
```
### Common Conditional Patterns
**Check for null/none values**:
```jinja
{% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}
```
**Check for boolean flags**:
```jinja
{% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %}
```
**Check if variable is defined**:
```jinja
{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %}
```
**Check for specific values**:
```jinja
{% if target.api_endpoint.type == "chat" %} --use_chat_format {% endif %}
```
## Parameter Inheritance
Parameters follow a hierarchical override system:
1. **Framework defaults** (4th priority) - Lowest priority
2. **Evaluation defaults** (3rd priority)
3. **User configuration** (2nd priority)
4. **CLI overrides** (1st priority) - Highest priority
### Inheritance Example
**Framework defaults (framework.yml)**:
```yaml
defaults:
config:
params:
temperature: 0.0
max_new_tokens: 4096
```
**Evaluation defaults (framework.yml)**:
```yaml
evaluations:
- name: humaneval
defaults:
config:
params:
max_new_tokens: 1024 # Overrides framework default
```
**User configuration (config.yaml)**:
```yaml
config:
params:
max_new_tokens: 512 # Overrides evaluation default
temperature: 0.7 # Overrides framework default
```
**CLI overrides**:
```bash
nemo-evaluator run_eval --overrides config.params.temperature=1.0
# Overrides all previous values
```
For more information on how to use these overrides, refer to the {ref}`parameter-overrides` documentation.
## Dynamic Configuration
Use template variables to reference other configuration sections. For example, re-use `config.output_dir` for `--cache` input argument:
```yaml
command: >-
example_eval --output {{config.output_dir}} --cache {{config.output_dir}}/cache
```
### Dynamic Configuration Patterns
**Reference output directory**:
```yaml
--results {{config.output_dir}}/results.json
--logs {{config.output_dir}}/logs
```
**Compose complex paths**:
```yaml
--data_dir {{config.output_dir}}/data/{{config.params.task}}
```
**Use task type in paths**:
```yaml
--cache {{config.output_dir}}/cache/{{config.type}}
```
**Reference model information**:
```yaml
--model_name {{target.api_endpoint.model_id}}
--endpoint {{target.api_endpoint.url}}
```
## Environment Variable Handling
**Export API keys conditionally**:
```jinja
{% if target.api_endpoint.api_key is not none %}export API_KEY=${{target.api_endpoint.api_key}} && {% endif %}
```
**Set multiple environment variables**:
```jinja
{% if target.api_endpoint.api_key is not none %}export API_KEY=${{target.api_endpoint.api_key}} && {% endif %}
{% if config.params.extra.custom_env is defined %}export CUSTOM_VAR={{config.params.extra.custom_env}} && {% endif %}
```
## Complex Command Templates
**Multi-line commands with conditionals**:
```yaml
command: >-
{% if target.api_endpoint.api_key is not none %}export API_KEY=${{target.api_endpoint.api_key}} && {% endif %}
example_eval
--model {{target.api_endpoint.model_id}}
--task {{config.params.task}}
--url {{target.api_endpoint.url}}
{% if config.params.limit_samples is not none %}--first_n {{config.params.limit_samples}}{% endif %}
{% if config.params.extra.add_system_prompt %}--add_system_prompt{% endif %}
{% if target.api_endpoint.type == "chat" %}--use_chat_format{% endif %}
--output {{config.output_dir}}
{% if config.params.extra.args is defined %}{{ config.params.extra.args }}{% endif %}
```
## Best Practices
- Always check if optional parameters are defined before using them
- Use `is not none` for nullable parameters with default values
- Use `is defined` for truly optional parameters that may not exist
- Keep conditional logic simple and readable
- Document custom parameters in the framework's README
- Test all conditional branches with different configurations
- Use parameter inheritance to avoid duplication
- Leverage dynamic paths to organize output files
(defaults-section)=
# Defaults Section
The `defaults` section defines the default configuration and execution command that will be used across all evaluations unless overridden. Overriding is supported either through `--overrides` flag (refer to {ref}`parameter-overrides`) or {ref}`run-configuration`.
## Command Template
The `command` field uses Jinja2 templating to dynamically generate execution commands based on configuration parameters.
```yaml
defaults:
command: >-
{% if target.api_endpoint.api_key is not none %}export API_KEY=${{target.api_endpoint.api_key}} && {% endif %}
example_eval --model {{target.api_endpoint.model_id}}
--task {{config.params.task}}
--url {{target.api_endpoint.url}}
--temperature {{config.params.temperature}}
# ... additional parameters
```
**Important Note**: `example_eval` is a placeholder representing your actual CLI command. When onboarding your harness, replace this with your real command (e.g., `lm-eval`, `bigcode-eval`, `gorilla-eval`, etc.).
## Template Variables
### Target API Endpoint Variables
- **`{{target.api_endpoint.api_key}}`**: Name of the environment variable storing API key
- **`{{target.api_endpoint.model_id}}`**: Target model identifier
- **`{{target.api_endpoint.stream}}`**: Whether responses should be streamed
- **`{{target.api_endpoint.type}}`**: The type of the target endpoint
- **`{{target.api_endpoint.url}}`**: URL of the model
- **`{{target.api_endpoint.adapter_config}}`**: Adapter configuration
### Evaluation Configuration Variables
- **`{{config.output_dir}}`**: Output directory for results
- **`{{config.type}}`**: Type of the task
- **`{{config.supported_endpoint_types}}`**: Supported endpoint types (chat/completions)
### Configuration Parameters
- **`{{config.params.task}}`**: Evaluation task type
- **`{{config.params.temperature}}`**: Model temperature setting
- **`{{config.params.limit_samples}}`**: Sample limit for evaluation
- **`{{config.params.max_new_tokens}}`**: Maximum tokens to generate
- **`{{config.params.max_retries}}`**: Number of REST request retries
- **`{{config.params.parallelism}}`**: Parallelism to be used
- **`{{config.params.request_timeout}}`**: REST response timeout
- **`{{config.params.top_p}}`**: Top-p sampling parameter
- **`{{config.params.extra}}`**: Framework-specific parameters
## Configuration Defaults
The following example shows common parameter defaults. Each framework defines its own default values in the framework.yml file.
```yaml
defaults:
config:
params:
limit_samples: null # No limit on samples by default
max_new_tokens: 4096 # Maximum tokens to generate
temperature: 0.0 # Deterministic generation
top_p: 0.00001 # Nucleus sampling parameter
parallelism: 10 # Number of parallel requests
max_retries: 5 # Maximum API retry attempts
request_timeout: 60 # Request timeout in seconds
extra: # Framework-specific parameters
n_samples: null # Number of sampled responses per input
downsampling_ratio: null # Data downsampling ratio
add_system_prompt: false # Include system prompt
```
## Parameter Categories
### Core Parameters
Basic evaluation settings that control model behavior:
- `temperature`: Controls randomness in generation (0.0 = deterministic)
- `max_new_tokens`: Maximum length of generated output
- `top_p`: Nucleus sampling parameter for diversity
### Performance Parameters
Settings that affect execution speed and reliability:
- `parallelism`: Number of parallel API requests
- `request_timeout`: Maximum wait time for API responses
- `max_retries`: Number of retry attempts for failed requests
### Framework Parameters
Task-specific configuration options:
- `task`: Specific evaluation task to run
- `limit_samples`: Limit number of samples for testing
### Extra Parameters
Custom parameters specific to your framework. Use it for:
- specifying number of sampled responses per input query
- judge configuration
- configuring few-shot settings
## Target Configuration
```yaml
defaults:
target:
api_endpoint:
type: chat # Default endpoint type
supported_endpoint_types: # All supported types
- chat
- completions
- vlm
- embedding
```
### Endpoint Types
**chat**: Multi-turn conversation format following the OpenAI chat completions API (`/v1/chat/completions`). Use this for models that support conversational interactions with role-based messages (system, user, assistant).
**completions**: Single-turn text completion format following the OpenAI completions API (`/v1/completions`). Use this for models that generate text based on a single prompt without conversation context. Often used for log-probability evaluations.
**vlm**: Vision-language model endpoints that support image inputs alongside text (`/v1/chat/completions`). Use this for multimodal evaluations that include visual content.
**embedding**: Embedding generation endpoints for retrieval and similarity evaluations (`/v1/embeddings`). Use this for tasks that require vector representations of text.
(evaluations-section)=
# Evaluations Section
The `evaluations` section defines the specific evaluation types available in your framework, each with its own configuration defaults.
## Structure
```yaml
evaluations:
- name: example_task_1 # Evaluation name
description: Basic functionality demo # Human-readable description
defaults:
config:
type: "example_task_1" # Evaluation identifier
supported_endpoint_types: # Supported endpoints for this task
- chat
- completions
params:
task: "example_task_1" # Task identifier used by the harness
temperature: 0.0 # Task-specific temperature
max_new_tokens: 1024 # Task-specific token limit
extra:
custom_key: "custom_value" # Task-specific custom param
```
## Fields
### name
**Type**: String
**Required**: Yes
Name for the evaluation type.
**Example**:
```yaml
name: HumanEval
```
### description
**Type**: String
**Required**: Yes
Clear description of what the evaluation measures. This helps users understand the purpose and scope of the evaluation.
**Example**:
```yaml
description: Evaluates code generation capabilities using the HumanEval benchmark dataset
```
### type
**Type**: String
**Required**: Yes
Unique configuration identifier used by the framework. This is used to reference the evaluation in CLI commands and configurations. This typically matches the `name` field but may differ based on your framework's conventions.
**Example**:
```yaml
type: "humaneval"
```
### supported_endpoint_types
**Type**: List of strings
**Required**: Yes
API endpoint types compatible with this evaluation. Specify which endpoint types work with this evaluation task:
- `chat` - Conversational format with role-based messages
- `completions` - Single-turn text completion
- `vlm` - Vision-language model with image support
- `embedding` - Embedding generation for retrieval tasks
**Example**:
```yaml
supported_endpoint_types:
- chat
- completions
```
### params
**Type**: Object
**Required**: No
Task-specific parameter overrides that differ from the framework-level defaults. Use this to customize settings for individual evaluation types.
**Example**:
```yaml
params:
task: "humaneval"
temperature: 0.0
max_new_tokens: 1024
extra:
custom_key: "custom_value"
```
## Multiple Evaluations
You can define multiple evaluation types in a single FDF:
```yaml
evaluations:
- name: humaneval
description: Code generation evaluation
defaults:
config:
type: "humaneval"
supported_endpoint_types:
- chat
- completions
params:
task: "humaneval"
max_new_tokens: 1024
- name: mbpp
description: Python programming evaluation
defaults:
config:
type: "mbpp"
supported_endpoint_types:
- chat
params:
task: "mbpp"
max_new_tokens: 512
```
## Best Practices
- Use descriptive names that indicate the evaluation purpose
- Provide comprehensive descriptions for each evaluation type
- List endpoint types that are actually supported and tested
- Override parameters when they differ from framework defaults
- Use the `extra` object for framework-specific custom parameters
- Group related evaluations together in the same FDF
- Test each evaluation type with all specified endpoint types
(fdf-troubleshooting)=
# Troubleshooting
This section covers common issues encountered when creating and using Framework Definition Files.
## Common Issues
::::{dropdown} Template Errors
:icon: code-square
**Symptom**: Template rendering fails with syntax errors.
**Causes**:
- Missing closing braces in Jinja2 templates
- Invalid variable references
- Incorrect conditional syntax
**Solutions**:
Check that all template variables use correct syntax:
```yaml
# Correct
{{target.api_endpoint.model_id}}
# Incorrect
{{target.api_endpoint.model_id}
{target.api_endpoint.model_id}}
```
Verify conditional statements are properly formatted:
```jinja
# Correct
{% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}
# Incorrect
{% if config.params.limit_samples != none %} --first_n {{config.params.limit_samples}}{% end %}
```
::::
::::{dropdown} Parameter Conflicts
:icon: code-square
**Symptom**: Parameters are not overriding as expected.
**Causes**:
- Incorrect parameter paths in overrides
- Type mismatches between default and override values
- Missing parameter definitions in defaults section
- Incorrect indentation in the YAML config
**Solutions**:
Ensure parameter paths are correct:
```bash
# Correct
--overrides config.params.temperature=0.7
# Incorrect
--overrides params.temperature=0.7
--overrides config.temperature=0.7
```
Verify parameter types match:
```yaml
# Correct
temperature: 0.7 # Float
# Incorrect
temperature: "0.7" # String
```
Make sure to use the correct indentation:
```yaml
# Correct
defaults:
config:
params:
limit_samples: null
max_new_tokens: 4096 # max_new_tokens belongs to params
# Incorrect
defaults:
config:
params:
limit_samples: null
max_new_tokens: 4096 # max_new_tokens is outside of params
```
::::
::::{dropdown} Type Mismatches
:icon: code-square
**Symptom**: Validation errors about incorrect parameter types.
**Causes**:
- String values used for numeric parameters
- Missing quotes for string values
- Boolean values as strings
**Solutions**:
Use correct types for each parameter:
```yaml
# Correct
temperature: 0.7 # Float
max_new_tokens: 1024 # Integer
add_system_prompt: false # Boolean
task: "humaneval" # String
# Incorrect
temperature: "0.7" # String instead of float
max_new_tokens: "1024" # String instead of integer
add_system_prompt: "false" # String instead of boolean
```
::::
::::{dropdown} Missing Fields
:icon: code-square
**Symptom**: Validation fails with "required field missing" errors.
**Causes**:
- Incomplete framework section
- Missing required parameters
- Omitted evaluation configurations
**Solutions**:
Ensure all required framework fields are present:
```yaml
framework:
name: your-framework # Required
pkg_name: your_framework # Required
full_name: Your Framework # Required
description: Description... # Required
url: https://github.com/... # Required
```
Include all required evaluation fields:
```yaml
evaluations:
- name: task_name # Required
description: Task description # Required
defaults:
config:
type: "task_type" # Required
supported_endpoint_types: # Required
- chat
```
::::
## Debug Mode
Enable debug logging to see how your FDF is processed. Use the `--debug` flag or set the logging level:
```bash
# Using debug flag
nemo-evaluator run_eval --eval_type your_evaluation --debug
# Or set log level environment variable
export LOG_LEVEL=DEBUG
nemo-evaluator run_eval --eval_type your_evaluation
```
### Debug Output
Debug mode provides detailed information about:
- FDF discovery and loading
- Template variable resolution
- Parameter inheritance and overrides
- Command generation
- Validation errors with stack traces
### Interpreting Debug Logs
Debug logs show the FDF loading and processing workflow. Key information includes:
**FDF Loading**: Shows which framework.yml files are discovered and loaded
**Template Rendering**: Displays template variable substitution and final rendered commands
**Parameter Overrides**: Shows how configuration values cascade through the inheritance hierarchy
**Validation Errors**: Provides detailed error messages when FDF structure or templates are invalid
## Validation Tips
**Test incrementally**: Start with a minimal FDF and add sections progressively.
**Validate templates separately**: Test Jinja2 templates in isolation before adding to FDF.
**Check references**: Ensure all template variables reference existing configuration paths.
**Use examples**: Base your FDF on existing, working examples from the NeMo Evaluator repository.
**Verify syntax**: Use a YAML validator to catch formatting errors.
## Getting Help
If you encounter issues not covered here:
1. Check the FDF examples in the NeMo Evaluator repository
2. Review debug logs for specific error messages
3. Verify your framework's CLI works independently
4. Consult the {ref}`extending-evaluator` documentation
5. Search for similar issues in the project's issue tracker
(framework-section)=
# Framework Section
The `framework` section contains basic identification and metadata for your evaluation framework.
## Structure
```yaml
framework:
name: example-evaluation-framework # Internal framework identifier
pkg_name: example_evaluation_framework # Python package name
full_name: Example Evaluation Framework # Human-readable display name
description: A comprehensive example... # Detailed description
url: https://github.com/example/... # Original repository URL
```
## Fields
### name
**Type**: String
**Required**: Yes
Unique identifier used internally by the system. This should be a lowercase, hyphenated string that identifies your framework.
**Example**:
```yaml
name: bigcode-evaluation-harness
```
### pkg_name
**Type**: String
**Required**: Yes
Python package name for your framework. This typically matches the `name` field but uses underscores instead of hyphens to follow Python naming conventions.
**Example**:
```yaml
pkg_name: bigcode_evaluation_harness
```
### full_name
**Type**: String
**Required**: Recommended
Human-readable name displayed in the UI and documentation. Use proper capitalization and spacing.
**Example**:
```yaml
full_name: BigCode Evaluation Harness
```
### description
**Type**: String
**Required**: Recommended
Comprehensive description of the framework's purpose, capabilities, and use cases. This helps users understand when to use your framework.
**Example**:
```yaml
description: A comprehensive evaluation harness for code generation models, supporting multiple programming languages and diverse coding tasks.
```
### url
**Type**: String (URL)
**Required**: Recommended
Link to the original benchmark or framework repository. This provides users with access to more documentation and source code.
**Example**:
```yaml
url: https://github.com/bigcode-project/bigcode-evaluation-harness
```
## Best Practices
- Use consistent naming across `name`, `pkg_name`, and `full_name`
- Keep the `name` field URL-friendly (lowercase, hyphens)
- Write clear, concise descriptions that highlight unique features
- Link to the canonical upstream repository when available
- Verify that the URL is accessible and up-to-date
## Minimal Requirements
At minimum, an FDF requires the `name` and `pkg_name` fields. However, including `full_name`, `description`, and `url` is strongly recommended for better documentation and user experience.
(framework-definition-file)=
# Framework Definition File (FDF)
Framework Definition Files are YAML configuration files that integrate evaluation frameworks into NeMo Evaluator. They define framework metadata, execution commands, and evaluation tasks.
**New to FDFs?** Learn about {ref}`the concepts and architecture ` before creating one.
## Prerequisites
Before creating an FDF, you should:
- Understand YAML syntax and structure
- Be familiar with your evaluation framework's CLI interface
- Have basic knowledge of Jinja2 templating
- Know the API endpoint types your framework supports
## Getting Started
**Creating your first FDF?** Follow this sequence:
1. {ref}`framework-section` - Define framework metadata
2. {ref}`defaults-section` - Configure command templates and parameters
3. {ref}`evaluations-section` - Define evaluation tasks
4. {ref}`integration` - Integrate with Eval Factory
**Need help?** Refer to {ref}`fdf-troubleshooting` for debugging common issues.
## Complete Example
The FDF follows a hierarchical structure with three main sections. Here's a minimal but complete example:
```yaml
# 1. Framework Identification
framework:
name: my-custom-eval
pkg_name: my_custom_eval
full_name: My Custom Evaluation Framework
description: Evaluates domain-specific capabilities
url: https://github.com/example/my-eval
# 2. Default Command and Configuration
defaults:
command: >-
{% if target.api_endpoint.api_key is not none %}export API_KEY=${{target.api_endpoint.api_key}} && {% endif %}
my-eval-cli --model {{target.api_endpoint.model_id}}
--task {{config.params.task}}
--output {{config.output_dir}}
config:
params:
temperature: 0.0
max_new_tokens: 1024
target:
api_endpoint:
type: chat
supported_endpoint_types:
- chat
- completions
# 3. Evaluation Types
evaluations:
- name: my_task_1
description: First evaluation task
defaults:
config:
type: my_task_1
supported_endpoint_types:
- chat
params:
task: my_task_1
```
## Reference Documentation
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`package;1.5em;sd-mr-1` Framework Section
:link: framework-section
:link-type: ref
Define framework metadata including name, package information, and repository URL.
:::
:::{grid-item-card} {octicon}`list-unordered;1.5em;sd-mr-1` Defaults Section
:link: defaults-section
:link-type: ref
Configure default parameters, command templates, and target endpoint settings.
:::
:::{grid-item-card} {octicon}`checklist;1.5em;sd-mr-1` Evaluations Section
:link: evaluations-section
:link-type: ref
Define specific evaluation types with task-specific configurations and parameters.
:::
:::{grid-item-card} {octicon}`telescope;1.5em;sd-mr-1` Advanced Features
:link: advanced-features
:link-type: ref
Use conditionals, parameter inheritance, and dynamic configuration in your FDF.
:::
:::{grid-item-card} {octicon}`plug;1.5em;sd-mr-1` Integration
:link: integration
:link-type: ref
Learn how to integrate your FDF with the Eval Factory system.
:::
:::{grid-item-card} {octicon}`question;1.5em;sd-mr-1` Troubleshooting
:link: fdf-troubleshooting
:link-type: ref
Debug common issues with template errors, parameters, and validation.
:::
::::
## Related Documentation
- {ref}`eval-custom-tasks` - Learn how to create custom evaluation tasks
- {ref}`extending-evaluator` - Overview of extending the NeMo Evaluator
- {ref}`parameter-overrides` - Using parameter overrides in evaluations
:::{toctree}
:maxdepth: 1
:hidden:
Framework Section
Defaults Section
Evaluations Section
Advanced Features
Integration
Troubleshooting
:::
(integration)=
# Integration with Eval Factory
This section describes how to integrate your Framework Definition File with the Eval Factory system.
## File Location
Place your FDF in the `core_evals//` directory of your framework package:
```
your-framework/
core_evals/
your_framework/
framework.yml # This is your FDF
output.py # Output parser (custom)
__init__.py # Empty init file
setup.py # Package configuration
README.md # Framework documentation
```
### Directory Structure Explanation
**core_evals/**: Root directory for evaluation framework definitions. This directory name is required by the Eval Factory system.
**your_framework/**: Subdirectory named after your framework (must match `framework.name` from your FDF).
**framework.yml**: Your Framework Definition File. This exact filename is required.
**output.py**: Custom output parser for processing evaluation results. This file should implement the parsing logic specific to your framework's output format.
**__init__.py**: Empty initialization file to make the directory a Python package.
## Validation
The FDF is validated by the NeMo Evaluator system when loaded. Validation occurs through Pydantic models that ensure:
- Required fields are present (`name`, `pkg_name`, `command`)
- Parameter types are correct (strings, integers, floats, lists)
- Template syntax is valid (Jinja2 parsing)
- Configuration consistency (endpoint types, parameter references)
### Validation Checks
**Schema Validation**: Pydantic models ensure required fields exist and have correct types when the FDF is parsed.
**Template Validation**: Jinja2 templates are rendered with `StrictUndefined`, which raises errors for undefined variables.
**Reference Validation**: Template variables must reference valid fields in the `Evaluation` model (`config`, `target`, `framework_name`, `pkg_name`).
**Consistency Validation**: Endpoint types and parameters should be consistent across framework defaults and evaluation-specific configurations.
## Registration
Once your FDF is properly located and validated, the Eval Factory system automatically:
1. Discovers your framework during initialization
2. Parses the FDF and validates its structure
3. Registers available evaluation types
4. Makes your framework available via CLI commands
## Using Your Framework
After successful integration, you can use your framework with the Eval Factory CLI:
```bash
# List available frameworks and tasks
nemo-evaluator ls
# Run an evaluation
nemo-evaluator run_eval --eval_type your_evaluation --model_id my-model ...
```
## Package Configuration
Ensure your `setup.py` or `pyproject.toml` includes the FDF in package data:
```python
from setuptools import setup, find_packages
setup(
name="your-framework",
packages=find_packages(),
package_data={
"core_evals": ["**/*.yml"],
},
include_package_data=True,
)
```
```toml
[tool.setuptools.package-data]
core_evals = ["**/*.yml"]
```
## Best Practices
- Follow the exact directory structure and naming conventions
- Test your FDF validation locally before deployment
- Document your framework's output format in README.md
- Include example configurations in your documentation
- Provide sample commands for common use cases
- Version your FDF changes alongside framework updates
- Keep the FDF synchronized with your framework's capabilities
(extending-evaluator)=
# Extending NeMo Evaluator
Extend NeMo Evaluator with custom benchmarks, evaluation frameworks, and integrations. Learn how to define new evaluation frameworks and integrate them into the NeMo Evaluator ecosystem using standardized configuration patterns.
::::{grid} 1 1 1 1
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`tools;1.5em;sd-mr-1` Framework Definition File
:link: framework-definition-file
:link-type: ref
Learn how to create Framework Definition Files (FDF) to integrate custom evaluation frameworks and benchmarks into the NeMo Evaluator ecosystem.
:::
::::
## Extension Patterns
NeMo Evaluator supports several patterns for extending functionality:
### Framework Definition Files (FDF)
The primary extension mechanism uses YAML configuration files to define:
- Framework metadata and dependencies
- Default configurations and parameters
- Evaluation types and task definitions
- Container integration specifications
### Integration Benefits
- **Standardization**: Follow established patterns for configuration and execution
- **Reproducibility**: Leverage the same deterministic configuration system
- **Compatibility**: Work seamlessly with existing launchers and exporters
- **Community**: Share frameworks through the standard FDF format
## Start with Extensions
**Building a production framework?** Follow these steps:
1. **Review Existing Frameworks**: Study existing FDF files to understand the structure
2. **Define Your Framework**: Create an FDF that describes your evaluation framework
3. **Test Integration**: Validate that your framework works with NeMo Evaluator workflows
4. **Container Packaging**: Package your framework as a container for distribution
For detailed reference documentation, refer to {ref}`framework-definition-file`.
:::{toctree}
:caption: Extending NeMo Evaluator
:hidden:
Framework Definition File
:::
(lib-core)=
# NeMo Evaluator
The *Core Evaluation Engine* delivers standardized, reproducible AI model evaluation through containerized benchmarks and a flexible adapter architecture.
:::{tip}
**Need orchestration?** For CLI and multi-backend execution, use the [NeMo Evaluator Launcher](../nemo-evaluator-launcher/index.md).
:::
## Get Started
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`workflow;1.5em;sd-mr-1` Workflows
:link: workflows/index
:link-type: doc
Run evaluations using pre-built containers directly or integrate them through the Python API.
:::
:::{grid-item-card} {octicon}`container;1.5em;sd-mr-1` Containers
:link: containers/index
:link-type: doc
Ready-to-use evaluation containers with curated benchmarks and frameworks.
:::
::::
## Reference and Customization
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`plug;1.5em;sd-mr-1` Interceptors
:link: interceptors/index
:link-type: doc
Set up interceptors to handle requests, responses, logging, caching, and custom processing.
:::
:::{grid-item-card} {octicon}`log;1.5em;sd-mr-1` Logging
:link: logging
:link-type: doc
Comprehensive logging setup for evaluation runs, debugging, and audit trails.
:::
:::{grid-item-card} {octicon}`tools;1.5em;sd-mr-1` Extending
:link: extending/index
:link-type: doc
Add custom benchmarks and frameworks by defining configuration and interfaces.
:::
:::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` Python API Reference
:link: ../../references/api/nemo-evaluator/api/index
:link-type: doc
Python API documentation for programmatic evaluation control and integration.
:::
:::{grid-item-card} {octicon}`terminal;1.5em;sd-mr-1` CLI Reference
:link: ../../references/api/nemo-evaluator/cli
:link-type: doc
Command-line interface for direct container and evaluation execution.
:::
::::
:::{toctree}
:caption: NeMo Evaluator Core
:hidden:
About NeMo Evaluator
Workflows
Benchmark Containers
Interceptors
Logging
Extending
:::
(workflows-overview)=
# Workflows
Learn how to use NeMo Evaluator through different workflow patterns. Whether you prefer programmatic control through Python APIs or CLI, these guides provide practical examples for integrating evaluations into your ML pipelines.
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`command-palette;1.5em;sd-mr-1` CLI
:link: cli
:link-type: doc
Run evaluations using the pre-built NGC containers and command line interface.
:::
:::{grid-item-card} {octicon}`file-code;1.5em;sd-mr-1` Python API
:link: python-api
:link-type: doc
Use the NeMo Evaluator Python API to integrate evaluations directly into your existing ML pipelines and applications.
:::
::::
## Choose Your Workflow
- **Python API**: Integrate evaluations directly into your existing Python applications when you need dynamic configuration management or programmatic control
- **CLI**: Use CLI when you work with CI/CD systems, container orchestration platforms, or other non-interactive workflows.
Both approaches use the same underlying evaluation package and produce identical, reproducible results. Choose based on your integration requirements and preferred level of abstraction.
:::{toctree}
:caption: Workflows
:hidden:
CLI
Python API
:::
(cli-workflows)=
# CLI Workflows
This document explains how to use evaluation containers within NeMo Evaluator workflows, focusing on command execution and configuration.
## Overview
Evaluation containers provide consistent, reproducible environments for running AI model evaluations. For a comprehensive list of all available containers, refer to {ref}`nemo-evaluator-containers`.
## Basic CLI
### Using YAML Configuration
Define your config:
```yaml
config:
type: mmlu_pro
output_dir: /workspace/results
params:
limit_samples: 10
target:
api_endpoint:
url: https://integrate.api.nvidia.com/v1/chat/completions
model_id: meta/llama-3.2-3b-instruct
type: chat
api_key: NGC_API_KEY
```
Run evaluation:
```bash
export HF_TOKEN=hf_xxx
export NGC_API_KEY=nvapi-xxx
nemo-evaluator run_eval \
--run_config /workspace/my_config.yml
```
### Using CLI overrides
Provide all arguments through CLI:
```bash
export HF_TOKEN=hf_xxx
export NGC_API_KEY=nvapi-xxx
nemo-evaluator run_eval \
--eval_type mmlu_pro \
--model_id meta/llama-3.2-3b-instruct \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--model_type chat \
--api_key_name NGC_API_KEY \
--output_dir /workspace/results \
--overrides 'config.params.limit_samples=10'
```
## Interceptor Configuration
The adapter system uses interceptors to modify requests and responses. Configure interceptors using the `--overrides` parameter.
For detailed interceptor configuration, refer to {ref}`nemo-evaluator-interceptors`.
:::{note}
Always remember to include `endpoint` Interceptor at the and of your custom Interceptors chain.
:::
### Enable Request Logging
```yaml
config:
type: mmlu_pro
output_dir: /workspace/results
params:
limit_samples: 10
target:
api_endpoint:
url: https://integrate.api.nvidia.com/v1/chat/completions
model_id: meta/llama-3.2-3b-instruct
type: chat
api_key: NGC_API_KEY
adapter_config:
interceptors:
- name: "request_logging"
enabled: true
config:
max_requests: 1000
- name: "endpoint"
enabled: true
config: {}
```
```bash
export HF_TOKEN=hf_xxx
export NGC_API_KEY=nvapi-xxx
nemo-evaluator run_eval \
--run_config /workspace/my_config.yml
```
### Enable Caching
```yaml
config:
type: mmlu_pro
output_dir: /workspace/results
params:
limit_samples: 10
target:
api_endpoint:
url: https://integrate.api.nvidia.com/v1/chat/completions
model_id: meta/llama-3.2-3b-instruct
type: chat
api_key: NGC_API_KEY
adapter_config:
interceptors:
- name: "caching"
enabled: true
config:
cache_dir: "./evaluation_cache"
reuse_cached_responses: true
save_requests: true
save_responses: true
max_saved_requests: 1000
max_saved_responses: 1000
- name: "endpoint"
enabled: true
config: {}
```
```bash
export HF_TOKEN=hf_xxx
export NGC_API_KEY=nvapi-xxx
nemo-evaluator run_eval \
--run_config /workspace/my_config.yml
```
### Multiple Interceptors
```yaml
config:
type: mmlu_pro
output_dir: /workspace/results
params:
limit_samples: 10
target:
api_endpoint:
url: https://integrate.api.nvidia.com/v1/chat/completions
model_id: meta/llama-3.2-3b-instruct
type: chat
api_key: NGC_API_KEY
adapter_config:
interceptors:
- name: "caching"
enabled: true
config:
cache_dir: "./evaluation_cache"
reuse_cached_responses: true
save_requests: true
save_responses: true
max_saved_requests: 1000
max_saved_responses: 1000
- name: "request_logging"
enabled: true
config:
max_requests: 1000
- name: "reasoning"
config:
start_reasoning_token: ""
end_reasoning_token: ""
add_reasoning: true
enable_reasoning_tracking: true
- name: "endpoint"
enabled: true
config: {}
```
```bash
export HF_TOKEN=hf_xxx
export NGC_API_KEY=nvapi-xxx
nemo-evaluator run_eval \
--run_config /workspace/my_config.yml
```
### Legacy Configuration Support
Provide Interceptor configuration with `--overrides` flag:
```bash
nemo-evaluator run_eval \
--eval_type mmlu_pro \
--model_id meta/llama-3.2-3b-instruct \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--model_type chat \
--api_key_name NGC_API_KEY \
--output_dir ./results \
--overrides 'target.api_endpoint.adapter_config.use_request_logging=True,target.api_endpoint.adapter_config.max_saved_requests=1000,target.api_endpoint.adapter_config.use_caching=True,target.api_endpoint.adapter_config.caching_dir=./cache,target.api_endpoint.adapter_config.reuse_cached_responses=True'
```
:::{note}
Legacy parameters will be automatically converted to the modern interceptor-based configuration. For new projects, use the YAML interceptor configutation shown above.
:::
## Troubleshooting
### Port Conflicts
If you manually specify the adapter server port, you can encounter port conflicts. Try selecting a differnt port:
```bash
export ADAPTER_PORT=3828
export ADAPTER_HOST=localhost
```
:::{note}
You can also rely on NeMo Evaluator's dynamic port binding feature.
:::
### API Key Issues
Verify your API key environment variable:
```bash
echo $MY_API_KEY
```
## Environment Variables
### Adapter Server Configuration
```bash
export ADAPTER_PORT=3828 # Default: 3825
export ADAPTER_HOST=localhost
```
### API Key Management
```bash
export MY_API_KEY=your_api_key_here
export HF_TOKEN=your_hf_token_here
```
(python-api-workflows)=
# Python API
The NeMo Evaluator Python API provides programmatic access to evaluation capabilities through the `nemo-evaluator` package, allowing you to integrate evaluations into existing ML pipelines, automate workflows, and build custom evaluation applications.
## Overview
The Python API is built on top of NeMo Evaluator and provides:
- **Programmatic Evaluation**: Run evaluations from Python code using `evaluate`
- **Configuration Management**: Dynamic configuration and parameter management
- **Adapter Integration**: Access to the full adapter system capabilities
- **Result Processing**: Programmatic access to evaluation results
- **Pipeline Integration**: Seamless integration with existing ML workflows
## Basic Usage
### Basic Evaluation
Run a simple evaluation with minimal configuration:
```python
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig,
EvaluationTarget,
ApiEndpoint,
EndpointType,
ConfigParams
)
# Configure evaluation
eval_config = EvaluationConfig(
type="mmlu_pro",
output_dir="./results",
params=ConfigParams(
limit_samples=3,
temperature=0.0,
max_new_tokens=1024,
parallelism=1
)
)
# Configure target endpoint
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
model_id="meta/llama-3.2-3b-instruct",
url="https://integrate.api.nvidia.com/v1/chat/completions",
type=EndpointType.CHAT,
api_key="nvapi-your-key-here"
)
)
# Run evaluation
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
print(f"Evaluation completed: {result}")
```
### Evaluation With Adapter Interceptors
Use interceptors for advanced features such as caching, logging, and reasoning:
```python
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig,
EvaluationTarget,
ApiEndpoint,
EndpointType,
ConfigParams
)
from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig
# Configure evaluation
eval_config = EvaluationConfig(
type="mmlu_pro",
output_dir="./results",
params=ConfigParams(
limit_samples=10,
temperature=0.0,
max_new_tokens=1024,
parallelism=1
)
)
# Configure adapter with interceptors
adapter_config = AdapterConfig(
interceptors=[
# Add custom system message
InterceptorConfig(
name="system_message",
config={
"system_message": "You are a helpful AI assistant. Please provide accurate and detailed answers."
}
),
# Enable request logging
InterceptorConfig(
name="request_logging",
config={"max_requests": 50}
),
# Enable caching
InterceptorConfig(
name="caching",
config={
"cache_dir": "./evaluation_cache",
"reuse_cached_responses": True
}
),
# Enable response logging
InterceptorConfig(
name="response_logging",
config={"max_responses": 50}
),
# Enable reasoning extraction
InterceptorConfig(
name="reasoning",
config={
"start_reasoning_token": "",
"end_reasoning_token": ""
}
),
# Enable progress tracking
InterceptorConfig(
name="progress_tracking"
),
InterceptorConfig(
name="endpoint"
),
]
)
# Configure target with adapter
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
model_id="meta/llama-3.2-3b-instruct",
url="https://integrate.api.nvidia.com/v1/chat/completions",
type=EndpointType.CHAT,
api_key="nvapi-your-key-here",
adapter_config=adapter_config
)
)
# Run evaluation
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
print(f"Evaluation completed: {result}")
```
## Related Documentation
- **API Reference**: For complete API documentation, refer to the [API Reference](../api.md) page
- **Adapter Configuration**: For detailed interceptor configuration options, refer to the {ref}`adapters-usage` page
- **Interceptor Documentation**: For information about available interceptors, refer to the [Interceptors](../interceptors/index.md) page
(nemo-evaluator-interceptors)=
# Interceptors
Interceptors provide fine-grained control over request and response processing during model evaluation through a configurable pipeline architecture.
## Overview
The adapter system processes model API calls through a configurable pipeline of interceptors. Each interceptor can inspect, modify, or augment requests and responses as they flow through the evaluation process.
```{mermaid}
graph LR
A[Evaluation Request] --> B[Adapter System]
B --> C[Interceptor Pipeline]
C --> D[Model API]
D --> E[Response Pipeline]
E --> F[Processed Response]
subgraph "Request Processing"
C --> G[System Message]
G --> H[Payload Modifier]
H --> I[Request Logging]
I --> J[Caching Check]
J --> K[Endpoint Call]
end
subgraph "Response Processing"
E --> L[Response Logging]
L --> M[Reasoning Extraction]
M --> N[Progress Tracking]
N --> O[Cache Storage]
end
style B fill:#f3e5f5
style C fill:#e1f5fe
style E fill:#e8f5e8
```
## Request Interceptors
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`comment;1.5em;sd-mr-1` System Messages
:link: system-messages
:link-type: doc
Modify system messages in requests.
:::
:::{grid-item-card} {octicon}`pencil;1.5em;sd-mr-1` Payload Modification
:link: payload-modification
:link-type: doc
Add, remove, or modify request parameters.
:::
:::{grid-item-card} {octicon}`sign-in;1.5em;sd-mr-1` Request Logging
:link: request-logging
:link-type: doc
Logs requests for debugging, analysis, and audit purposes.
:::
::::
## Request-Response Interceptors
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`cache;1.5em;sd-mr-1` Caching
:link: caching
:link-type: doc
Cache requests and responses to improve performance and reduce API calls.
:::
:::{grid-item-card} {octicon}`broadcast;1.5em;sd-mr-1` Endpoint
:link: endpoint
:link-type: doc
Communicates with the model endpoint.
:::
::::
## Response
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`sign-out;1.5em;sd-mr-1` Response Logging
:link: response-logging
:link-type: doc
Logs responses for debugging, analysis, and audit purposes.
:::
:::{grid-item-card} {octicon}`pulse;1.5em;sd-mr-1` Progress Tracking
:link: progress-tracking
:link-type: doc
Track evaluation progress and status updates.
:::
:::{grid-item-card} {octicon}`alert;1.5em;sd-mr-1` Raising on Client Errors
:link: raise-client-error
:link-type: doc
Allows to fail fast on non-retryable client errors
:::
:::{grid-item-card} {octicon}`comment-discussion;1.5em;sd-mr-1` Reasoning
:link: reasoning
:link-type: doc
Handle reasoning tokens and track reasoning metrics.
:::
:::{grid-item-card} {octicon}`meter;1.5em;sd-mr-1` Response Statistics
:link: response-stats
:link-type: doc
Collects statistics from API responses for metrics collection and analysis.
:::
::::
## Process Post-Evaluation Results
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`report;1.5em;sd-mr-1` Post-Evaluation Hooks
:link: post-evaluation-hooks
:link-type: doc
Run additional processing, reporting, or cleanup after evaluations complete.
:::
::::
:::{toctree}
:caption: Interceptors
:hidden:
System Messages
Payload Modification
Request Logging
Caching
Endpoint
Response Logging
Progress Tracking
Raising on Client Errors
Reasoning
Response Statistics
Post-Evaluation Hooks
:::
(interceptor-system-messages)=
# System Messages
## Overview
The `SystemMessageInterceptor` modifies incoming requests to include custom system messages. This interceptor works with chat-format requests, replacing any existing system messages with the configured message.
:::{tip}
Add {ref}`interceptor-request-logging` to your interceptor chain to verify if your requests are modified correctly.
:::
## Configuration
### CLI Configuration
```bash
--overrides 'target.api_endpoint.adapter_config.use_system_prompt=True,target.api_endpoint.adapter_config.custom_system_prompt="You are a helpful assistant."'
```
### YAML Configuration
```yaml
target:
api_endpoint:
adapter_config:
interceptors:
- name: system_message
config:
system_message: "You are a helpful AI assistant."
strategy: "prepend" # Optional: "replace", "append", or "prepend" (default)
- name: "endpoint"
enabled: true
config: {}
```
**Example with different strategies:**
```yaml
# Replace existing system message
- name: system_message
config:
system_message: "You are a precise assistant."
strategy: "replace"
# Prepend to existing system message (default)
- name: system_message
config:
system_message: "Important: "
strategy: "prepend"
# Append to existing system message
- name: system_message
config:
system_message: "Remember to be concise."
strategy: "append"
```
## Configuration Options
For detailed configuration options, please refer to the {ref}`interceptor_reference` Python API reference.
## Behavior
The interceptor modifies chat-format requests by:
1. Removing any existing system messages from the messages array
2. Inserting the configured system message as the first message
3. Preserving all other request parameters
### Example Request Transformation
```python
# Original request
{
"messages": [
{"role": "user", "content": "What is 2+2?"}
]
}
# After system message interceptor
{
"messages": [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What is 2+2?"}
]
}
```
If an existing system message is present, the interceptor replaces it:
```python
# Original request with existing system message
{
"messages": [
{"role": "system", "content": "Old system message"},
{"role": "user", "content": "What is 2+2?"}
]
}
# After system message interceptor
{
"messages": [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What is 2+2?"}
]
}
```
(interceptor-caching)=
# Caching
## Overview
The `CachingInterceptor` implements a caching system that can store responses based on request content, enabling faster re-runs of evaluations and reducing costs when using paid APIs.
## Configuration
### CLI Configuration
```bash
--overrides 'target.api_endpoint.adapter_config.use_caching=True,target.api_endpoint.adapter_config.caching_dir=./cache,target.api_endpoint.adapter_config.reuse_cached_responses=True'
```
### YAML Configuration
```yaml
target:
api_endpoint:
adapter_config:
interceptors:
- name: "caching"
enabled: true
config:
cache_dir: "./evaluation_cache"
reuse_cached_responses: true
save_requests: true
save_responses: true
max_saved_requests: 1000
max_saved_responses: 1000
- name: "endpoint"
enabled: true
config: {}
```
## Configuration Options
For detailed configuration options, please refer to the {ref}`interceptor_reference` Python API reference.
## Cache Key Generation
The interceptor generates the cache key by creating a SHA256 hash of the JSON-serialized request data using `json.dumps()` with `sort_keys=True` for consistent ordering.
```python
import hashlib
import json
# Request data
request_data = {
"messages": [{"role": "user", "content": "What is 2+2?"}],
"temperature": 0.0,
"max_new_tokens": 512
}
# Generate cache key
data_str = json.dumps(request_data, sort_keys=True)
cache_key = hashlib.sha256(data_str.encode("utf-8")).hexdigest()
# Result: "abc123def456..." (64-character hex string)
```
## Cache Storage Format
The caching interceptor stores data in three separate disk-backed key-value stores within the configured cache directory:
- **Response Cache** (`{cache_dir}/responses/`): Stores raw response content (bytes) keyed by cache key (when `save_responses=True` or `reuse_cached_responses=True`)
- **Headers Cache** (`{cache_dir}/headers/`): Stores response headers (dictionary) keyed by cache key (when `save_requests=True`)
- **Request Cache** (`{cache_dir}/requests/`): Stores request data (dictionary) keyed by cache key (when `save_requests=True`)
Each cache uses a SHA256 hash of the request data as the lookup key. When a cache hit occurs, the interceptor retrieves both the response content and headers using the same cache key.
## Cache Behavior
### Cache Hit Process
1. **Request arrives** at the caching interceptor
2. **Generate cache key** from request parameters
3. **Check cache** for existing response
4. **Return cached response** if found (sets `cache_hit=True`)
5. **Skip API call** and continue to next interceptor
### Cache Miss Process
1. **Request continues** to endpoint interceptor
2. **Response received** from model API
3. **Store response** in cache with generated key
4. **Continue processing** with response interceptors
(interceptor-endpoint)=
# Endpoint Interceptor
## Overview
**Required interceptor** that handles the actual API communication. This interceptor must be present in every configuration as it performs the final request to the target API endpoint.
**Important**: This interceptor should always be placed after the last request interceptor and before the first response interceptor.
## Configuration
### CLI Configuration
```bash
# The endpoint interceptor is automatically enabled and requires no additional CLI configuration
```
### YAML Configuration
```yaml
target:
api_endpoint:
adapter_config:
interceptors:
- name: "endpoint"
enabled: true
config: {}
```
## Configuration Options
The Endpoint Interceptor is configured automatically. No configuration is required.
(interceptor-payload-modification)=
# Payload Modification
## Overview
`PayloadParamsModifierInterceptor` adds, removes, or modifies request parameters before sending them to the model endpoint.
:::{tip}
Add {ref}`interceptor-request-logging` to your interceptor chain to verify if your requests are modified correctly.
:::
## Configuration
### CLI Configuration
```bash
--overrides 'target.api_endpoint.adapter_config.params_to_add={"temperature":0.7},target.api_endpoint.adapter_config.params_to_remove=["max_tokens"]'
```
### YAML Configuration
```yaml
target:
api_endpoint:
adapter_config:
interceptors:
- name: "payload_modifier"
enabled: true
config:
params_to_add:
temperature: 0.7
top_p: 0.9
params_to_remove:
- "top_k" # top-level field in the payload to remove
- "reasoning_content" # field in the message to remove
params_to_rename:
old_param: "new_param"
- name: "endpoint"
enabled: true
config: {}
```
:::{note}
In the example above, the `reasoning_content` field will be removed recursively from all messages in the payload.
:::
## Configuration Options
For detailed configuration options, please refer to the {ref}`interceptor_reference` Python API reference.
:::{note}
The interceptor applies operations in the following order: remove → add → rename. This means you can remove a parameter and then add a different value for the same parameter name.
:::
## Use Cases
### Parameter Standardization
Ensure consistent parameters across evaluations by adding or removing parameters:
```yaml
config:
params_to_add:
temperature: 0.7
top_p: 0.9
params_to_remove:
- "frequency_penalty"
- "presence_penalty"
```
### Model-Specific Configuration
Add parameters required by specific model endpoints, such as chat template configuration:
```yaml
config:
params_to_add:
extra_body:
chat_template_kwargs:
enable_thinking: false
```
### API Compatibility
Rename parameters for compatibility with different API versions or endpoint specifications:
```yaml
config:
params_to_rename:
max_new_tokens: "max_tokens"
num_return_sequences: "n"
```
# Post-Evaluation Hooks
Run processing or reporting tasks after evaluations complete.
Post-evaluation hooks execute after the main evaluation finishes. The built-in `PostEvalReportHook` hook generates HTML and JSON reports from cached request-response pairs.
## Report Generation
Generate HTML and JSON reports with evaluation request-response examples.
### YAML Configuration
```yaml
target:
api_endpoint:
adapter_config:
post_eval_hooks:
- name: "post_eval_report"
enabled: true
config:
report_types: ["html", "json"]
html_report_size: 10
```
### CLI Configuration
```bash
--overrides 'target.api_endpoint.adapter_config.generate_html_report=True'
```
## Configuration Options
For detailed configuration options, please refer to the {ref}`interceptor_reference` Python API reference.
## Report Output
The hook generates reports in the evaluation output directory:
- **HTML Report**: `{output_dir}/report.html` - Interactive report with request-response pairs and curl commands
- **JSON Report**: `{output_dir}/report.json` - Machine-readable report with structured data
# Progress Tracking
## Overview
`ProgressTrackingInterceptor` tracks evaluation progress by counting processed samples and optionally sending updates to a webhook endpoint.
## Configuration
### CLI Configuration
```bash
--overrides 'target.api_endpoint.adapter_config.use_progress_tracking=True,target.api_endpoint.adapter_config.progress_tracking_url=http://monitoring:3828/progress'
```
### YAML Configuration
```yaml
target:
api_endpoint:
adapter_config:
interceptors:
- name: "endpoint"
enabled: true
config: {}
- name: "progress_tracking"
enabled: true
config:
progress_tracking_url: "http://monitoring:3828/progress"
progress_tracking_interval: 10
request_method: "PATCH"
output_dir: "/tmp/output"
```
## Configuration Options
For detailed configuration options, please refer to the {ref}`interceptor_reference` Python API reference.
## Behavior
The interceptor tracks the number of responses processed and:
1. **Sends webhook updates**: Posts progress updates to the configured URL at the specified interval
2. **Saves progress to disk**: If `output_dir` is configured, writes progress count to a `progress` file in that directory
3. **Resumes from checkpoint**: If a progress file exists on initialization, resumes counting from that value
(interceptor-raise-client-error)=
# Raise Client Error Interceptor
## Overview
The Raise `RaiseClientErrorInterceptor` handles non-retryable client errors by raising exceptions instead of continuing the benchmark evaluation. By default, it will raise exceptions on 4xx HTTP status codes (excluding 408 Request Timeout and 429 Too Many Requests, which are typically retryable).
This interceptor is useful when you want to fail fast on client errors that indicate configuration issues, authentication problems, or other non-recoverable errors rather than continuing the evaluation with failed requests.
## Configuration
### CLI Configuration
```bash
--overrides 'target.api_endpoint.adapter_config.use_raise_client_errors=True'
```
### YAML Configuration
::::{tab-set}
:::{tab-item} Default Configuration
Raises on 4xx status codes except 408 (Request Timeout) and 429 (Too Many Requests).
```yaml
target:
api_endpoint:
adapter_config:
interceptors:
- name: "endpoint"
enabled: true
config: {}
- name: "raise_client_errors"
enabled: true
config:
# Default configuration - raises on 4xx except 408, 429
exclude_status_codes: [408, 429]
status_code_range_start: 400
status_code_range_end: 499
```
:::
:::{tab-item} Specific Status Codes
Raises only on specific status codes rather than a range.
```yaml
target:
api_endpoint:
adapter_config:
interceptors:
- name: "raise_client_errors"
enabled: true
config:
# Custom configuration - only specific status codes
status_codes: [400, 401, 403, 404]
- name: "endpoint"
enabled: true
config: {}
```
:::
:::{tab-item} Custom Exclusions
Uses a status code range with custom exclusions, including 404 Not Found.
```yaml
target:
api_endpoint:
adapter_config:
interceptors:
- name: "raise_client_errors"
enabled: true
config:
# Custom range with different exclusions
status_code_range_start: 400
status_code_range_end: 499
exclude_status_codes: [408, 429, 404] # Also exclude 404 not found
- name: "endpoint"
enabled: true
config: {}
```
:::
::::
## Configuration Options
For detailed configuration options, please refer to the {ref}`interceptor_reference` Python API reference.
## Behavior
### Default Behavior
- Raises exceptions on HTTP status codes 400-499
- Excludes 408 (Request Timeout) and 429 (Too Many Requests) as these are typically retryable
- Logs critical errors before raising the exception
### Configuration Logic
1. If `status_codes` is specified, only those exact status codes will trigger exceptions
2. If `status_codes` is not specified, the range defined by `status_code_range_start` and `status_code_range_end` is used
3. `exclude_status_codes` are always excluded from raising exceptions
4. Cannot have the same status code in both `status_codes` and `exclude_status_codes`
### Error Handling
- Raises `FatalErrorException` when a matching status code is encountered
- Logs critical error messages with status code and URL information
- Stops the evaluation process immediately
## Examples
::::{tab-set}
:::{tab-item} Auth Failures Only
Raises exceptions only on authentication and authorization failures.
```yaml
config:
status_codes: [401, 403]
```
:::
:::{tab-item} All Client Errors Except Rate Limiting
Raises on all 4xx errors except timeout and rate limit errors.
```yaml
config:
status_code_range_start: 400
status_code_range_end: 499
exclude_status_codes: [408, 429]
```
:::
:::{tab-item} Strict Mode - All Client Errors
Raises exceptions on any 4xx status code without exclusions.
```yaml
config:
status_code_range_start: 400
status_code_range_end: 499
exclude_status_codes: []
```
:::
::::
## Common Use Cases
- **API Configuration Validation**: Fail immediately on authentication errors (401, 403)
- **Input Validation**: Stop evaluation on bad request errors (400)
- **Resource Existence**: Fail on not found errors (404) for critical resources
- **Development/Testing**: Use strict mode to catch all client-side issues
- **Production**: Use default settings to allow retryable errors while catching configuration issues
(interceptor-reasoning)=
# Reasoning
## Overview
The `ResponseReasoningInterceptor` handles models that generate explicit reasoning steps, typically enclosed in special tokens. It removes reasoning content from the final response and tracks reasoning metrics for analysis.
## Configuration
### Python Configuration
### CLI Configuration
```bash
--overrides 'target.api_endpoint.adapter_config.process_reasoning_traces=True,target.api_endpoint.adapter_config.end_reasoning_token="",target.api_endpoint.adapter_config.start_reasoning_token=""'
```
### YAML Configuration
```yaml
target:
api_endpoint:
adapter_config:
interceptors:
- name: "endpoint"
enabled: true
config: {}
- name: reasoning
config:
start_reasoning_token: ""
end_reasoning_token: ""
add_reasoning: true
enable_reasoning_tracking: true
```
## Configuration Options
For detailed configuration options, please refer to the {ref}`interceptor_reference` Python API reference.
## Processing Examples
### Basic Reasoning Stripping
```python
# Original response from model
original_content = "Let me solve this step by step. 2+2 is basic addition. 2 plus 2 equals 4.The answer is 4."
# After reasoning interceptor processing
# The content field has reasoning removed
processed_content = "The answer is 4."
```
### Multi-Step Reasoning
```python
# Original response with multi-line reasoning
original_content = """
This is a word problem. Let me break it down:
1. John has 5 apples
2. He gives away 2 apples
3. So he has 5 - 2 = 3 apples left
John has 3 apples remaining."""
# After processing: reasoning tokens and content are removed
processed_content = "John has 3 apples remaining."
```
## Tracked Metrics
The interceptor automatically tracks the following statistics:
| Metric | Description |
|--------|-------------|
| `total_responses` | Total number of responses processed |
| `responses_with_reasoning` | Number of responses containing reasoning content |
| `reasoning_finished_count` | Number of responses where reasoning completed (end token found) |
| `reasoning_finished_ratio` | Percentage (expressed as ratio within 0-1) of responses where reasoning completed to all responses with reasoning |
| `reasoning_started_count` | Number of responses where reasoning started |
| `reasoning_unfinished_count` | Number of responses where reasoning started but did not complete (end token not found) |
| `avg_reasoning_words` | Average word count in reasoning content |
| `avg_reasoning_tokens` | Average token count in reasoning content |
| `avg_original_content_words` | Average word count in original content (before processing) |
| `avg_updated_content_words` | Average word count in updated content (after processing) |
| `avg_updated_content_tokens` | Average token count in updated content |
| `max_reasoning_words` | Maximum word count in reasoning content |
| `max_reasoning_tokens` | Maximum token count in reasoning content |
| `max_original_content_words` | |
| `max_updated_content_words` | |
| `max_updated_content_tokens` | Maximum token count in updated content |
| `total_reasoning_words` | Total word count across all reasoning content |
| `total_reasoning_tokens` | Total token count across all reasoning content |
| `total_original_content_words` | Total word count in original content (before processing) |
| `total_updated_content_words` | Total word count in updated content (after processing) |
| `total_updated_content_tokens` | Total token count in updated content |
These statistics are saved to `eval_factory_metrics.json` under the `reasoning` key after evaluation completes.
## Example: Custom Reasoning Tokens
```yaml
target:
api_endpoint:
adapter_config:
interceptors:
- name: reasoning
config:
start_reasoning_token: "[REASONING]"
end_reasoning_token: "[/REASONING]"
add_reasoning: true
enable_reasoning_tracking: true
- name: "endpoint"
enabled: true
config: {}
```
(interceptor-request-logging)=
# Request Logging Interceptor
## Overview
The `RequestLoggingInterceptor` captures and logs incoming API requests for debugging, analysis, and audit purposes. This interceptor is essential for troubleshooting evaluation issues and understanding request patterns.
## Configuration
### CLI Configuration
```bash
--overrides 'target.api_endpoint.adapter_config.use_request_logging=True,target.api_endpoint.adapter_config.max_saved_requests=1000'
```
### YAML Configuration
```yaml
target:
api_endpoint:
adapter_config:
interceptors:
- name: "request_logging"
enabled: true
config:
max_requests: 1000
- name: "endpoint"
enabled: true
config: {}
```
## Configuration Options
For detailed configuration options, please refer to the {ref}`interceptor_reference` Python API reference.
(interceptor-response-logging)=
# Response Logging Interceptor
## Overview
The `ResponseLoggingInterceptor` captures and logs API responses for analysis and debugging. Use this interceptor to examine model outputs and identify response patterns.
## Configuration
### CLI Configuration
```bash
--overrides 'target.api_endpoint.adapter_config.use_response_logging=True,target.api_endpoint.adapter_config.max_saved_responses=1000'
```
### YAML Configuration
```yaml
target:
api_endpoint:
adapter_config:
interceptors:
- name: "endpoint"
enabled: true
config: {}
- name: "response_logging"
enabled: true
config:
max_responses: 1000
```
## Configuration Options
For detailed configuration options, please refer to the {ref}`interceptor_reference` Python API reference.
(interceptor-response-stats)=
# Response Stats Interceptor
## Overview
The `ResponseStatsInterceptor` collects comprehensive aggregated statistics from API responses for metrics collection and analysis. It tracks detailed metrics about token usage, response patterns, performance characteristics, and API behavior throughout the evaluation process.
This interceptor is essential for understanding API performance, cost analysis, and monitoring evaluation runs. It provides both real-time aggregated statistics and detailed per-request tracking capabilities.
**Key Statistics Tracked:**
- Token usage (prompt, completion, total) with averages and maximums
- Response status codes and counts
- Finish reasons and stop reasons
- Tool calls and function calls counts
- Response latency (average and maximum)
- Total response count and successful responses
- Inference run times and timing analysis
## Configuration
### CLI Configuration
```bash
--overrides 'target.api_endpoint.adapter_config.tracking_requests_stats=True,target.api_endpoint.adapter_config.response_stats_cache=/tmp/response_stats_interceptor,target.api_endpoint.adapter_config.logging_aggregated_stats_interval=100'
```
### YAML Configuration
```yaml
target:
api_endpoint:
adapter_config:
interceptors:
- name: "response_stats"
enabled: true
config:
# Default configuration - collect all statistics
collect_token_stats: true
collect_finish_reasons: true
collect_tool_calls: true
save_individuals: true
cache_dir: "/tmp/response_stats_interceptor"
logging_aggregated_stats_interval: 100
- name: "endpoint"
enabled: true
config: {}
```
```yaml
target:
api_endpoint:
adapter_config:
interceptors:
- name: "response_stats"
enabled: true
config:
# Minimal configuration - only basic stats
collect_token_stats: false
collect_finish_reasons: false
collect_tool_calls: false
save_individuals: false
logging_aggregated_stats_interval: 50
- name: "endpoint"
enabled: true
config: {}
```
```yaml
target:
api_endpoint:
adapter_config:
interceptors:
- name: "endpoint"
enabled: true
config: {}
- name: "response_stats"
enabled: true
config:
# Custom configuration with periodic saving
collect_token_stats: true
collect_finish_reasons: true
collect_tool_calls: true
stats_file_saving_interval: 100
save_individuals: true
cache_dir: "/custom/stats/cache"
logging_aggregated_stats_interval: 25
```
## Configuration Options
For detailed configuration options, please refer to the {ref}`interceptor_reference` Python API reference.
## Behavior
### Statistics Collection
The interceptor automatically collects statistics from successful API responses (HTTP 200) and tracks basic information for all responses regardless of status code.
**For Successful Responses (200):**
- Parses JSON response body
- Extracts token usage from `usage` field
- Collects finish reasons from `choices[].finish_reason`
- Counts tool calls and function calls
- Calculates running averages and maximums
**For All Responses:**
- Tracks status code distribution
- Measures response latency
- Records response timestamps
- Maintains response counts
### Data Storage
- **Aggregated Stats**: Continuously updated running statistics stored in cache
- **Individual Stats**: Per-request details stored with request IDs (if enabled)
- **Metrics File**: Final statistics saved to `eval_factory_metrics.json`
- **Thread Safety**: All operations are thread-safe using locks
### Timing Analysis
- Tracks inference run times across multiple evaluation runs
- Calculates time from first to last request per run
- Estimates time to first request from adapter initialization
- Provides detailed timing breakdowns for performance analysis
## Statistics Output
### Aggregated Statistics
```json
{
"response_stats": {
"description": "Response statistics saved during processing",
"avg_prompt_tokens": 150.5,
"avg_total_tokens": 200.3,
"avg_completion_tokens": 49.8,
"avg_latency_ms": 1250.2,
"max_prompt_tokens": 300,
"max_total_tokens": 450,
"max_completion_tokens": 150,
"max_latency_ms": 3000,
"count": 1000,
"successful_count": 995,
"tool_calls_count": 50,
"function_calls_count": 25,
"finish_reason": {
"stop": 800,
"length": 150,
"tool_calls": 45
},
"status_codes": {
"200": 995,
"429": 3,
"500": 2
},
"inference_time": 45.6,
"run_id": 0
}
}
```
### Individual Request Statistics (if enabled)
```json
{
"request_id": "req_123",
"timestamp": 1698765432.123,
"status_code": 200,
"prompt_tokens": 150,
"total_tokens": 200,
"completion_tokens": 50,
"finish_reason": "stop",
"tool_calls_count": 0,
"function_calls_count": 0,
"run_id": 0
}
```
## Common Use Cases
- **Cost Analysis**: Track token usage patterns to estimate API costs
- **Performance Monitoring**: Monitor response times and throughput
- **Quality Assessment**: Analyze finish reasons and response patterns
- **Tool Usage Analysis**: Track function and tool call frequencies
- **Debugging**: Individual request tracking for troubleshooting
- **Capacity Planning**: Understand API usage patterns and limits
- **A/B Testing**: Compare statistics across different configurations
- **Production Monitoring**: Real-time visibility into API behavior
## Integration Notes
- **Post-Evaluation Hook**: Automatically saves final statistics after evaluation completes
- **Cache Persistence**: Statistics survive across runs and can be aggregated
- **Thread Safety**: Safe for concurrent request processing
- **Memory Efficient**: Uses running averages to avoid storing all individual values
- **Caching Strategy**: Handles cache hits by skipping statistics collection to avoid double-counting
(nemo-evaluator-logging)=
# Logging Configuration
This document describes how to configure and use logging in the NVIDIA NeMo Evaluator framework.
## Log Levels
Set these environment variables for logging configuration:
```bash
# Set log level (INFO, DEBUG, WARNING, ERROR, CRITICAL)
export LOG_LEVEL=DEBUG
# or (legacy, still supported)
export NEMO_EVALUATOR_LOG_LEVEL=DEBUG
```
```{list-table}
:header-rows: 1
:widths: 15 35 50
* - Level
- Description
- Use Case
* - `INFO`
- General information
- Normal operation logs
* - `DEBUG`
- Detailed debugging
- Development and troubleshooting
* - `WARNING`
- Warning messages
- Potential issues
* - `ERROR`
- Error messages
- Problems that need attention
* - `CRITICAL`
- Critical errors
- Severe problems requiring immediate action
```
## Log Output
### Console Output
Logs appear in the console (stderr) with color coding:
- **Green**: INFO messages
- **Yellow**: WARNING messages
- **Red**: ERROR messages
- **Red background**: CRITICAL messages
- **Gray**: DEBUG messages
### Custom Log Directory
Specify a custom log directory using the `NEMO_EVALUATOR_LOG_DIR` environment variable:
```bash
# Set custom log directory
export NEMO_EVALUATOR_LOG_DIR=/path/to/logs/
# Run evaluation (logs will be written to the specified directory)
nemo-evaluator run_eval ...
```
If `NEMO_EVALUATOR_LOG_DIR` is not set, logs appear in the console (stderr) without file output.
## Using Logging Interceptors
NeMo Evaluator supports dedicated interceptors for request and response logging. Add logging to your adapter configuration:
```yaml
target:
api_endpoint:
adapter_config:
interceptors:
- name: "request_logging"
config:
log_request_body: true
log_request_headers: true
- name: "response_logging"
config:
log_response_body: true
log_response_headers: true
```
## Request Tracking
Each request automatically gets a unique UUID that appears in all related log messages. This helps trace requests through the system.
## Troubleshooting
### No logs appearing
- Enable logging interceptors in your configuration
- Verify log level with `LOG_LEVEL=INFO` or `NEMO_EVALUATOR_LOG_LEVEL=INFO`
### Missing DEBUG logs
- Set `LOG_LEVEL=DEBUG` or `NEMO_EVALUATOR_LOG_LEVEL=DEBUG`
### Logs not going to files
- Check directory permissions
- Verify log directory path with `NEMO_EVALUATOR_LOG_DIR`
### Debug mode
```bash
export LOG_LEVEL=DEBUG
```
## Examples
### Basic logging
```bash
# Enable DEBUG logging
export LOG_LEVEL=DEBUG
# Run evaluation with logging
nemo-evaluator run_eval --eval_type mmlu_pro --model_id gpt-4 ...
```
### Custom log directory
```bash
# Specify custom log location using environment variable
export NEMO_EVALUATOR_LOG_DIR=./my_logs/
# Run evaluation with logging to custom directory
nemo-evaluator run_eval --eval_type mmlu_pro ...
```
### Environment verification
```bash
echo "LOG_LEVEL: $LOG_LEVEL"
echo "NEMO_EVALUATOR_LOG_DIR: $NEMO_EVALUATOR_LOG_DIR"
```
(references-overview)=
# References
Comprehensive reference documentation for NeMo Evaluator APIs, functions, and configuration options.
## CLI vs. Programmatic Usage
The NeMo Evaluator SDK supports two usage patterns:
1. **CLI Usage** (Recommended): Use `nemo-evaluator` and/or `nemo-evaluator-launcher` binaries which parses command line arguments
2. **Programmatic Usage**: Use Python API with configuration objects
**When to Use Which:**
- **CLI**: For command-line tools, scripts, and simple automation
- **Programmatic**: For building custom applications, workflows, and integration with other systems
## API References
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`command-palette;1.5em;sd-mr-1` NeMo Evaluator Launcher CLI
:link: api/nemo-evaluator-launcher/cli
:link-type: doc
Comprehensive command-line interface reference with all commands, options, and examples.
:::
:::{grid-item-card} {octicon}`terminal;1.5em;sd-mr-1` NeMo Evaluator Launcher API
:link: api/nemo-evaluator-launcher/api
:link-type: doc
Complete Python API reference for programmatic evaluation workflows and job management.
:::
:::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Configuration Schema
:link: ../libraries/nemo-evaluator-launcher/configuration/index
:link-type: doc
Configuration reference for NeMo Evaluator Launcher with examples for all executors and deployment types.
:::
:::{grid-item-card} {octicon}`command-palette;1.5em;sd-mr-1` NeMo Evaluator CLI
:link: api/nemo-evaluator/cli
:link-type: doc
Comprehensive command-line interface reference with all commands, options, and examples.
:::
:::{grid-item-card} {octicon}`terminal;1.5em;sd-mr-1` NeMo Evaluator Python API
:link: api/nemo-evaluator/api/index
:link-type: doc
Complete Python API reference for programmatic evaluation workflows and job management.
:::
::::
# Python API
The NeMo Evaluator Launcher provides a Python API for programmatic access to evaluation functionality. This allows you to integrate evaluations into your Python workflows, Jupyter notebooks, and automated pipelines.
## Installation
```bash
pip install nemo-evaluator-launcher
# With optional exporters
pip install nemo-evaluator-launcher[mlflow,wandb,gsheets]
```
## Core Functions
### Running Evaluations
```python
from nemo_evaluator_launcher.api import RunConfig, run_eval
# Run evaluation with configuration
config = RunConfig.from_hydra(
config="examples/local_basic.yaml",
hydra_overrides=[
"execution.output_dir=my_results"
]
)
invocation_id = run_eval(config)
# Returns invocation ID for tracking
print(f"Started evaluation: {invocation_id}")
```
### Listing Available Tasks
```python
from nemo_evaluator_launcher.api import get_tasks_list
# Get all available evaluation tasks
tasks = get_tasks_list()
# Each task contains: [task_name, endpoint_type, harness, container]
for task in tasks[:5]:
task_name, endpoint_type, harness, container = task
print(f"Task: {task_name}, Type: {endpoint_type}")
```
### Checking Job Status
```python
from nemo_evaluator_launcher.api import get_status
# Check status of a specific invocation or job
status = get_status(["abc12345"])
# Returns list of status dictionaries with keys: invocation, job_id, status, progress, data
for job_status in status:
print(f"Job {job_status['job_id']}: {job_status['status']}")
```
## Configuration Management
### Creating Configuration with Hydra
```python
from nemo_evaluator_launcher.api import RunConfig
from omegaconf import OmegaConf
# Load default configuration
config = RunConfig.from_hydra()
print(OmegaConf.to_yaml(config))
```
### Loading Existing Configuration
```python
from nemo_evaluator_launcher.api import RunConfig
# Load a specific configuration file
config = RunConfig.from_hydra(
config="examples/local_basic.yaml"
)
```
### Configuration with Overrides
```python
import tempfile
from nemo_evaluator_launcher.api import RunConfig, run_eval
# Create configuration with both Hydra overrides and dictionary overrides
config = RunConfig.from_hydra(
hydra_overrides=[
"execution.output_dir=" + tempfile.mkdtemp()
],
dict_overrides={
"target": {
"api_endpoint": {
"url": "https://integrate.api.nvidia.com/v1/chat/completions",
"model_id": "meta/llama-3.2-3b-instruct",
"api_key_name": "NGC_API_KEY"
}
},
"evaluation": [
{
"name": "ifeval",
"overrides": {
"config.params.limit_samples": 10
}
}
]
}
)
# Run evaluation
invocation_id = run_eval(config)
```
### Exploring Deployment Options
```python
from nemo_evaluator_launcher.api import RunConfig
from omegaconf import OmegaConf
# Load configuration with different deployment backend
config = RunConfig.from_hydra(
hydra_overrides=["deployment=vllm"]
)
print(OmegaConf.to_yaml(config))
```
## Jupyter Notebook Integration
```python
# Cell 1: Setup
import tempfile
from omegaconf import OmegaConf
from nemo_evaluator_launcher.api import RunConfig, get_status, get_tasks_list, run_eval
# Cell 2: List available tasks
tasks = get_tasks_list()
print("Available tasks:")
for task in tasks[:10]: # Show first 10
print(f" - {task[0]} ({task[1]})")
# Cell 3: Create and run evaluation
config = RunConfig.from_hydra(
hydra_overrides=[
"execution.output_dir=" + tempfile.mkdtemp()
],
dict_overrides={
"target": {
"api_endpoint": {
"url": "https://integrate.api.nvidia.com/v1/chat/completions",
"model_id": "meta/llama-3.2-3b-instruct",
"api_key_name": "NGC_API_KEY"
}
},
"evaluation": [
{
"name": "ifeval",
"overrides": {
"config.params.limit_samples": 10
}
}
]
}
)
invocation_id = run_eval(config)
print(f"Started evaluation: {invocation_id}")
# Cell 4: Check status
status_list = get_status([invocation_id])
status = status_list[0]
print(f"Status: {status['status']}")
print(f"Output directory: {status['data']['output_dir']}")
```
## See Also
- [CLI Reference](index.md) - Command-line interface documentation
- [Configuration](configuration/index.md) - Configuration system overview
- [Exporters](exporters/index.md) - Result export options
# NeMo Evaluator Launcher CLI Reference (nemo-evaluator-launcher)
The NeMo Evaluator Launcher provides a command-line interface for running evaluations, managing jobs, and exporting results. The CLI is available through `nemo-evaluator-launcher` command.
## Global Options
```bash
nemo-evaluator-launcher --help # Show help
nemo-evaluator-launcher --version # Show version information
```
## Commands Overview
```{list-table}
:header-rows: 1
:widths: 20 80
* - Command
- Description
* - `run`
- Run evaluations with specified configuration
* - `status`
- Check status of jobs or invocations
* - `info`
- Show detailed job(s) information
* - `kill`
- Kill a job or invocation
* - `ls`
- List tasks or runs
* - `export`
- Export evaluation results to various destinations
* - `version`
- Show version information
```
## run - Run Evaluations
Execute evaluations using Hydra configuration management.
### Basic Usage
```bash
# Using example configurations
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml
# With output directory override
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
-o execution.output_dir=/path/to/results
```
### Configuration Options
```bash
# Using custom config directory
nemo-evaluator-launcher run --config my_configs/my_evaluation.yaml
# Multiple overrides (Hydra syntax)
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
-o execution.output_dir=results \
-o target.api_endpoint.model_id=my-model \
-o +config.params.limit_samples=10
```
### Config Loading Modes
The `--config-mode` parameter controls how configuration files are loaded:
- **`hydra`** (default): Uses Hydra configuration system. Hydra handles configuration composition, overrides, and validation.
- **`raw`**: Loads the config file directly without Hydra processing. Useful for loading pre-generated complete configuration files.
```bash
# Default: Hydra mode (config file is processed by Hydra)
nemo-evaluator-launcher run --config my_config.yaml
# Explicit Hydra mode
nemo-evaluator-launcher run --config my_config.yaml --config-mode=hydra
# Raw mode: load config file directly (bypasses Hydra)
nemo-evaluator-launcher run --config complete_config.yaml --config-mode=raw
```
**Note:** When using `--config-mode=raw`, the `--config` parameter is required, and `-o/--override` cannot be used.
(launcher-cli-dry-run)=
### Dry Run
Preview the full resolved configuration without executing:
```bash
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml --dry-run
```
### Test Runs
Run with limited samples for testing:
```bash
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
-o +config.params.limit_samples=10
```
### Task Filtering
Run only specific tasks from your configuration using the `-t` flag:
```bash
# Run a single task (local_basic.yaml has ifeval, gpqa_diamond, mbpp)
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml -t ifeval
# Run multiple specific tasks
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml -t ifeval -t mbpp
# Combine with other options
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml -t ifeval -t mbpp --dry-run
```
**Notes:**
- Tasks must be defined in your configuration file under `evaluation.tasks`
- If any requested task is not found in the configuration, the command will fail with an error listing available tasks
- Task filtering preserves all task-specific overrides and `nemo_evaluator_config` settings
### Examples by Executor
**Local Execution:**
```bash
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
-o execution.output_dir=./local_results
```
**Slurm Execution:**
```bash
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/slurm_vllm_basic.yaml \
-o execution.output_dir=/shared/results
```
**Lepton AI Execution:**
```bash
# With model deployment
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/lepton_nim.yaml
# Using existing endpoint
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/lepton_basic.yaml
```
## status - Check Job Status
Check the status of running or completed evaluations.
### Status Basic Usage
```bash
# Check status of specific invocation (returns all jobs in that invocation)
nemo-evaluator-launcher status abc12345
# Check status of specific job
nemo-evaluator-launcher status abc12345.0
# Output as JSON
nemo-evaluator-launcher status abc12345 --json
```
### Output Formats
**Table Format (default):**
```text
Job ID | Status | Executor Info | Location
abc12345.0 | running | container123 | /task1/...
abc12345.1 | success | container124 | /task2/...
```
**JSON Format (with --json flag):**
```json
[
{
"invocation": "abc12345",
"job_id": "abc12345.0",
"status": "running",
"data": {
"container": "eval-container",
"output_dir": "/path/to/results"
}
},
{
"invocation": "abc12345",
"job_id": "abc12345.1",
"status": "success",
"data": {
"container": "eval-container",
"output_dir": "/path/to/results"
}
}
]
```
## info - Job information and navigation
Display detailed job information, including metadata, configuration, and paths to logs/artifacts with descriptions of key result files. Supports copying results locally from both local and remote jobs.
### Basic usage
```bash
# Show job info for one or more IDs (job or invocation)
nemo-evaluator-launcher info
nemo-evaluator-launcher info
```
### Show configuration
```bash
nemo-evaluator-launcher info --config
```
### Show paths
```bash
# Show artifact locations
nemo-evaluator-launcher info --artifacts
# Show log locations
nemo-evaluator-launcher info --logs
```
### Copy files locally
```bash
# Copy logs
nemo-evaluator-launcher info --copy-logs [DIR]
# Copy artifacts
nemo-evaluator-launcher info --copy-artifacts [DIR]
```
### Example (Slurm)
```text
nemo-evaluator-launcher info
Job .0
├── Executor: slurm
├── Created:
├── Task:
├── Artifacts: user@host:/shared/...//task_name/artifacts (remote)
│ └── Key files:
│ ├── results.yml - Benchmark scores, task results and resolved run configuration.
│ ├── eval_factory_metrics.json - Response + runtime stats (latency, tokens count, memory)
│ ├── metrics.json - Harness/benchmark metric and configuration
│ ├── report.html - Request-Response Pairs samples in HTML format (if enabled)
│ ├── report.json - Report data in json format, if enabled
├── Logs: user@host:/shared/...//task_name/logs (remote)
│ └── Key files:
│ ├── client-{SLURM_JOB_ID}.out - Evaluation container/process output
│ ├── slurm-{SLURM_JOB_ID}.out - SLURM scheduler stdout/stderr (batch submission, export steps).
│ ├── server-{SLURM_JOB_ID}.out - Model server logs when a deployment is used.
├── Slurm Job ID:
```
## kill - Kill Jobs
Stop running evaluations.
### Kill Basic Usage
```bash
# Kill entire invocation
nemo-evaluator-launcher kill abc12345
# Kill specific job
nemo-evaluator-launcher kill abc12345.0
```
The command outputs JSON with the results of the kill operation.
## ls - List Resources
List available tasks or runs.
### List Tasks
```bash
# List all available evaluation tasks
nemo-evaluator-launcher ls tasks
# List tasks with JSON output
nemo-evaluator-launcher ls tasks --json
```
**Output Format:**
Tasks display grouped by harness and container, showing the task name and required endpoint type:
```text
===================================================
harness: lm_eval
container: nvcr.io/nvidia/nemo:24.01
task endpoint_type
---------------------------------------------------
arc_challenge chat
hellaswag completions
winogrande completions
---------------------------------------------------
3 tasks available
===================================================
```
### List Runs
```bash
# List recent evaluation runs
nemo-evaluator-launcher ls runs
# Limit number of results
nemo-evaluator-launcher ls runs --limit 10
# Filter by executor
nemo-evaluator-launcher ls runs --executor local
# Filter by date
nemo-evaluator-launcher ls runs --since "2024-01-01"
nemo-evaluator-launcher ls runs --since "2024-01-01T12:00:00"
# Filter by retrospecitve period
# - days
nemo-evaluator-launcher ls runs --since 2d
# - hours
nemo-evaluator-launcher ls runs --since 6h
```
**Output Format:**
```text
invocation_id earliest_job_ts num_jobs executor benchmarks
abc12345 2024-01-01T10:00:00 3 local ifeval,gpqa_diamond,mbpp
def67890 2024-01-02T14:30:00 2 slurm hellaswag,winogrande
```
## export - Export Results
Export evaluation results to various destinations.
### Export Basic Usage
```bash
# Export to local files (JSON format)
nemo-evaluator-launcher export abc12345 --dest local --format json
# Export to specific directory
nemo-evaluator-launcher export abc12345 --dest local --format json --output-dir ./results
# Specify custom filename
nemo-evaluator-launcher export abc12345 --dest local --format json --output-filename results.json
```
### Export Options
```bash
# Available destinations
nemo-evaluator-launcher export abc12345 --dest local # Local file system
nemo-evaluator-launcher export abc12345 --dest mlflow # MLflow tracking
nemo-evaluator-launcher export abc12345 --dest wandb # Weights & Biases
nemo-evaluator-launcher export abc12345 --dest gsheets # Google Sheets
# Format options (for local destination only)
nemo-evaluator-launcher export abc12345 --dest local --format json
nemo-evaluator-launcher export abc12345 --dest local --format csv
# Include logs when exporting
nemo-evaluator-launcher export abc12345 --dest local --format json --copy-logs
# Filter metrics by name
nemo-evaluator-launcher export abc12345 --dest local --format json --log-metrics score --log-metrics accuracy
# Copy all artifacts (not just required ones)
nemo-evaluator-launcher export abc12345 --dest local --only-required False
```
### Exporting Multiple Invocations
```bash
# Export several runs together
nemo-evaluator-launcher export abc12345 def67890 ghi11111 --dest local --format json
# Export several runs with custom output
nemo-evaluator-launcher export abc12345 def67890 --dest local --format csv \
--output-dir ./all-results --output-filename combined.csv
```
### Cloud Exporters
For cloud destinations like MLflow, W&B, and Google Sheets, configure credentials through environment variables or their respective configuration files before using the export command. Refer to each exporter's documentation for setup instructions.
## version - Version Information
Display version and build information.
```bash
# Show version
nemo-evaluator-launcher version
# Alternative
nemo-evaluator-launcher --version
```
## Environment Variables
The CLI respects environment variables for logging and task-specific authentication:
```{list-table}
:header-rows: 1
:widths: 30 50 20
* - Variable
- Description
- Default
* - `LOG_LEVEL`
- Logging level for the launcher (DEBUG, INFO, WARNING, ERROR, CRITICAL)
- `WARNING`
* - `LOG_DISABLE_REDACTION`
- Disable credential redaction in logs (set to 1, true, or yes)
- Not set
```
### Task-Specific Environment Variables
Some evaluation tasks require API keys or tokens. These are configured in your evaluation YAML file under `env_vars` and must be set before running:
```bash
# Set task-specific environment variables
export HF_TOKEN="hf_..." # For Hugging Face datasets
export NGC_API_KEY="nvapi-..." # For NVIDIA API endpoints
# Run evaluation
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml
```
The specific environment variables required depend on the tasks and endpoints you're using. Refer to the example configuration files for details on which variables are needed.
## Configuration File Examples
The NeMo Evaluator Launcher includes several example configuration files that demonstrate different use cases. These files are located in the `examples/` directory of the package:
To use these examples:
```bash
# Copy an example to your local directory
cp examples/local_basic.yaml my_config.yaml
# Edit the configuration as needed
# Then run with your config
nemo-evaluator-launcher run --config ./my_config.yaml
```
Refer to the {ref}`configuration documentation ` for detailed information on all available configuration options.
## Troubleshooting
### Configuration Issues
**Configuration Errors:**
```bash
# Validate configuration without running
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/my_config.yaml --dry-run
```
**Permission Errors:**
```bash
# Check file permissions
ls -la examples/my_config.yaml
# Use absolute paths
nemo-evaluator-launcher run --config /absolute/path/to/configs/my_config.yaml
```
**Network Issues:**
```bash
# Test endpoint connectivity
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "test", "messages": [{"role": "user", "content": "Hello"}]}'
```
### Debug Mode
```bash
# Set log level to DEBUG for detailed output
export LOG_LEVEL=DEBUG
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml
# Or use single-letter shorthand
export LOG_LEVEL=D
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml
# Logs are written to ~/.nemo-evaluator/logs/
```
### Getting Help
```bash
# Command-specific help
nemo-evaluator-launcher run --help
nemo-evaluator-launcher info --help
nemo-evaluator-launcher ls --help
nemo-evaluator-launcher export --help
# General help
nemo-evaluator-launcher --help
```
## See Also
- [Python API](api.md) - Programmatic interface
- {ref}`gs-quickstart-launcher` - Getting started guide
- {ref}`executors-overview` - Execution backends
- {ref}`exporters-overview` - Export destinations
``nemo_evaluator.adapters.adapter_config``
==========================================
.. currentmodule:: nemo_evaluator.adapters.adapter_config
.. automodule:: nemo_evaluator.adapters.adapter_config
:members:
:undoc-members:
``nemo_evaluator.adapters``
===========================
Interceptors and PostEvalHooks are important part of NeMo Evaluator SDK. They expand functionality of each harness, providing a standardized way of enabling features in your evaluation runs.
Behind each interceptor and post-eval-hook stands a specific class that implements its logic.
However, these classes are referenced only to provide their configuration options which are reflected in ``Params`` of each class and to point from which classes one should inherit.
From the usage perspective, one should always use configuration classes (see :ref:`configuration`) to add them to evaluations. No interceptor should be directly instantiated.
Interceptors are defined in a chain. They go under ``target.api_endpoint.adapter_config`` and can be defined as follow::
adapter_config = AdapterConfig(
interceptors=[
InterceptorConfig(
name="system_message",
enabled=True,
config={"system_message": "You are a helpful assistant."}
),
InterceptorConfig(name="request_logging", enabled=True),
InterceptorConfig(
name="caching",
enabled=True,
config={"cache_dir": "./cache", "reuse_cached_responses": True}
),
InterceptorConfig(name="reasoning", enabled=True),
InterceptorConfig(name="endpoint")
]
)
.. _configuration:
Configuration
--------------
.. currentmodule:: nemo_evaluator.adapters.adapter_config
.. autosummary::
:nosignatures:
:recursive:
DiscoveryConfig
InterceptorConfig
PostEvalHookConfig
AdapterConfig
.. .. automodule:: nemo_evaluator.adapters.adapter_config
.. :members:
.. :undoc-members:
Interceptors
-------------
.. currentmodule:: nemo_evaluator.adapters.interceptors
.. autosummary::
:nosignatures:
:recursive:
CachingInterceptor
EndpointInterceptor
PayloadParamsModifierInterceptor
ProgressTrackingInterceptor
RaiseClientErrorInterceptor
RequestLoggingInterceptor
ResponseLoggingInterceptor
ResponseReasoningInterceptor
ResponseStatsInterceptor
SystemMessageInterceptor
PostEvalHooks
-------------
.. currentmodule:: nemo_evaluator.adapters.reports
.. autosummary::
:nosignatures:
:recursive:
PostEvalReportHook
Interfaces
--------------
.. currentmodule:: nemo_evaluator.adapters.types
.. autosummary::
:nosignatures:
:recursive:
RequestInterceptor
ResponseInterceptor
RequestToResponseInterceptor
PostEvalHook
.. .. automodule:: nemo_evaluator.adapters
.. :members:
.. :undoc-members:
.. toctree::
:hidden:
adapter-config
interceptors
types
.. _interceptor_reference :
``nemo_evaluator.adapters.interceptors``
========================================
.. currentmodule:: nemo_evaluator.adapters.interceptors
.. automodule:: nemo_evaluator.adapters.interceptors
:members:
:undoc-members:
``nemo_evaluator.adapters.types``
=================================
Interceptor Interfaces
----------------------
.. currentmodule:: nemo_evaluator.adapters.types
.. automodule:: nemo_evaluator.adapters.types
:members:
:undoc-members:
something
.. _modelling-inout:
``nemo_evaluator.api.api_dataclasses``
======================================
NeMo Evaluator Core operates on strictly defined input and output which are modelled through pydantic dataclasses. Whether you use Python API or CLI, the reference below serves as a map of configuration options and output format.
.. currentmodule:: nemo_evaluator.api.api_dataclasses
Modeling Target
---------------
.. autosummary::
:nosignatures:
:recursive:
ApiEndpoint
EndpointType
EvaluationTarget
Modeling Evaluation
-------------------
.. autosummary::
:nosignatures:
:recursive:
EvaluationConfig
ConfigParams
Modeling Result
---------------
.. autosummary::
:nosignatures:
:recursive:
EvaluationResult
GroupResult
MetricResult
Score
ScoreStats
TaskResult
.. automodule:: nemo_evaluator.api.api_dataclasses
:members:
:undoc-members:
``nemo_evaluator.api``
======================================
The central point of evaluation is ``evaluate()`` function that takes standarized input and returns standarized output. See :ref:`modelling-inout` to learn how to instantiate standardized input and consume standardized output. Below is an example of how one might configure and run evaluation via Python API::
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig,
EvaluationTarget,
ConfigParams,
ApiEndpoint
)
# Create evaluation configuration
eval_config = EvaluationConfig(
type="simple_evals.mmlu_pro",
output_dir="./results",
params=ConfigParams(
limit_samples=100,
temperature=0.1
)
)
# Create target configuration
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
url="https://integrate.api.NVIDIA.com/v1/chat/completions",
model_id="meta/llama-3.2-3b-instruct",
type="chat",
api_key="MY_API_KEY" # Name of the environment variable that stores api_key
)
)
# Run evaluation
result = evaluate(eval_config, target_config)
.. automodule:: nemo_evaluator.api
:members:
:undoc-members:
:member-order: bysource
.. toctree::
:hidden:
api-dataclasses
nemo-evaluator.adapters <../adapters/adapters>
nemo-evaluator.sandbox <../sandbox/index>
``nemo_evaluator.sandbox``
======================================
Sandbox implementations used by evaluation harnesses that need a tmux-like interactive session.
This module is designed to keep dependencies **optional**:
- The ECS Fargate implementation only imports AWS SDKs (``boto3``/``botocore``) when actually used.
- Using the ECS sandbox also requires the AWS CLI (``aws``) and ``session-manager-plugin`` on the host.
Usage (ECS Fargate)
-------------------
Typical usage is:
- configure :class:`~nemo_evaluator.sandbox.ecs_fargate.EcsFargateConfig`
- :meth:`~nemo_evaluator.sandbox.ecs_fargate.EcsFargateSandbox.spin_up` a sandbox context
- create an interactive :class:`~nemo_evaluator.sandbox.base.NemoSandboxSession`
Example::
from nemo_evaluator.sandbox import EcsFargateConfig, EcsFargateSandbox
cfg = EcsFargateConfig(
region="us-west-2",
cluster="my-ecs-cluster",
task_definition="my-task-def:1",
container_name="eval",
subnets=["subnet-abc"],
security_groups=["sg-xyz"],
s3_bucket="my-staging-bucket",
)
with EcsFargateSandbox.spin_up(
cfg=cfg,
task_id="task-001",
trial_name="trial-0001",
run_id="run-2026-01-12",
) as sandbox:
session = sandbox.create_session("main")
session.send_keys(["echo hello", "Enter"], block=True)
print(session.capture_pane())
Prerequisites / Notes
---------------------
- The harness host must have **AWS CLI** and **session-manager-plugin** installed.
- If you use S3-based fallbacks (large uploads / long commands), configure ``s3_bucket``.
.. automodule:: nemo_evaluator.sandbox
:members:
:undoc-members:
:member-order: bysource
(nemo-evaluator-cli)=
# NeMo Evaluator CLI Reference (nemo-evaluator)
This document provides a comprehensive reference for the `nemo-evaluator` command-line interface, which is the primary way to interact with NeMo Evaluator from the terminal.
## Prerequisites
- **Container way**: Use evaluation containers mentioned in {ref}`nemo-evaluator-containers`
- **Package way**:
```bash
pip install nemo-evaluator
```
To run evaluations, you also need to install an evaluation framework package (for example, `nvidia-simple-evals`):
```bash
pip install nvidia-simple-evals
```
## Overview
The CLI provides a unified interface for managing evaluations and frameworks. It's built on top of the Python API and provides full feature parity with it.
## Command Structure
```bash
nemo-evaluator [command] [options]
```
## Available Commands
### `ls` - List Available Evaluations
List all available evaluation types and frameworks.
```bash
nemo-evaluator ls
```
**Output Example:**
```
nvidia-simple-evals:
* mmlu_pro
...
human_eval:
* human_eval
```
### `run_eval` - Run Evaluation
Execute an evaluation with the specified configuration.
```bash
nemo-evaluator run_eval [options]
```
To see the list of options, run:
```bash
nemo-evaluator run_eval --help
```
**Required Options:**
- `--eval_type`: Type of evaluation to run
- `--model_id`: Model identifier
- `--model_url`: API endpoint URL
- `--model_type`: Endpoint type (chat, completions, vlm, embedding)
- `--output_dir`: Output directory for results
**Optional Options:**
- `--api_key_name`: Environment variable name for API key
- `--run_config`: Path to YAML configuration file
- `--overrides`: Comma-separated parameter overrides
- `--dry_run`: Show configuration without running
- `--debug`: Enable debug mode (deprecated, use NEMO_EVALUATOR_LOG_LEVEL)
**Example Usage:**
```bash
# Basic evaluation
nemo-evaluator run_eval \
--eval_type mmlu_pro \
--model_id "meta/llama-3.2-3b-instruct" \
--model_url "https://integrate.api.nvidia.com/v1/chat/completions" \
--model_type chat \
--api_key_name MY_API_KEY \
--output_dir ./results
# With parameter overrides
nemo-evaluator run_eval \
--eval_type mmlu_pro \
--model_id "meta/llama-3.2-3b-instruct" \
--model_url "https://integrate.api.nvidia.com/v1/chat/completions" \
--model_type chat \
--api_key_name MY_API_KEY \
--output_dir ./results \
--overrides "config.params.limit_samples=100,config.params.temperature=0.1"
# Dry run to see configuration
nemo-evaluator run_eval \
--eval_type mmlu_pro \
--model_id "meta/llama-3.2-3b-instruct" \
--model_url "https://integrate.api.nvidia.com/v1/chat/completions" \
--model_type chat \
--api_key_name MY_API_KEY \
--output_dir ./results \
--dry_run
```
For execution with run configuration:
```bash
# Using YAML configuration file
nemo-evaluator run_eval \
--eval_type mmlu_pro \
--output_dir ./results \
--run_config ./config/eval_config.yml
```
To check the structure of the run configuration, refer to the [Run Configuration](#run-configuration) section below.
(run-configuration)=
## Run Configuration
Run configurations are stored in YAML files with the following structure:
```yaml
config:
type: mmlu_pro
params:
limit_samples: 10
target:
api_endpoint:
url: https://integrate.api.nvidia.com/v1/chat/completions
model_id: meta/llama-3.2-3b-instruct
type: chat
api_key: MY_API_KEY
adapter_config:
interceptors:
- name: "request_logging"
- name: "caching"
enabled: true
config:
cache_dir: "./cache"
- name: "endpoint"
- name: "response_logging"
enabled: true
config:
output_dir: "./cache/responses"
```
Run configurations can be specified in YAML files and executed with following syntax:
```bash
nemo-evaluator run_eval \
--run_config config.yml \
--output_dir `mktemp -d`
```
(parameter-overrides)=
## Parameter Overrides
Parameter overrides use a dot-notation format to specify configuration paths:
```bash
# Basic parameter overrides
--overrides "config.params.limit_samples=100,config.params.temperature=0.1"
# Adapter configuration overrides
--overrides "target.api_endpoint.adapter_config.interceptors.0.config.output_dir=./logs"
# Multiple complex overrides
--overrides "config.params.limit_samples=100,config.params.max_tokens=512,target.api_endpoint.adapter_config.use_caching=true"
```
### Override Format
```
section.subsection.parameter=value
```
**Examples:**
- `config.params.limit_samples=100`
- `target.api_endpoint.adapter_config.use_caching=true`
## Handle Errors
### Debug Mode
Enable debug mode for detailed error information:
```bash
# Set environment variable (recommended)
export NEMO_EVALUATOR_LOG_LEVEL=DEBUG
# Or use deprecated debug flag
nemo-evaluator run_eval --debug [options]
```
## Examples
### Complete Evaluation Workflow
```bash
# 1. List available evaluations
nemo-evaluator ls
# 2. Run evaluation
nemo-evaluator run_eval \
--eval_type mmlu_pro \
--model_id "meta/llama-3.2-3b-instruct" \
--model_url "https://integrate.api.nvidia.com/v1/chat/completions" \
--model_type chat \
--api_key_name MY_API_KEY \
--output_dir ./results \
--overrides "config.params.limit_samples=100"
# 3. Show results
ls -la ./results/
```
### Batch Evaluation Script
```bash
#!/bin/bash
# Batch evaluation script
models=("meta/llama-3.1-8b-instruct" "meta/llama-3.1-70b-instruct")
eval_types=("mmlu_pro" "gsm8k")
for model in "${models[@]}"; do
for eval_type in "${eval_types[@]}"; do
echo "Running $eval_type on $model..."
nemo-evaluator run_eval \
--eval_type "$eval_type" \
--model_id "$model" \
--model_url "https://integrate.api.nvidia.com/v1/chat/completions" \
--model_type chat \
--api_key_name MY_API_KEY \
--output_dir "./results/${model//\//_}_${eval_type}" \
--overrides "config.params.limit_samples=50"
echo "Completed $eval_type on $model"
done
done
echo "All evaluations completed!"
```
### Framework Development
```bash
# Setup new framework
nemo-evaluator-example my_custom_eval .
# This creates the basic structure:
# core_evals/my_custom_eval/
# ├── framework.yml
# ├── output.py
# └── __init__.py
# Edit framework.yml to configure your evaluation
# Edit output.py to implement result parsing
# Test your framework
nemo-evaluator run_eval \
--eval_type my_custom_eval \
--model_id "test-model" \
--model_url "https://api.example.com/v1/chat/completions" \
--model_type chat \
--api_key_name MY_API_KEY \
--output_dir ./results
```
## Framework Setup Command
### `nemo-evaluator-example` - Setup Framework
Set up NVIDIA framework files in a destination folder.
```bash
nemo-evaluator-example [package_name] [destination]
```
**Arguments:**
- `package_name`: Python package-like name for the framework
- `destination`: Destination folder where to create framework files
**Example Usage:**
```bash
# Setup framework in specific directory
nemo-evaluator-example my_package /path/to/destination
# Setup framework in current directory
nemo-evaluator-example my_package .
```
**What it creates:**
- `core_evals/my_package/framework.yml` - Framework configuration
- `core_evals/my_package/output.py` - Output parsing logic
- `core_evals/my_package/__init__.py` - Package initialization
## Environment Variables
### Logging Configuration
```bash
# Set log level (recommended over --debug flag)
export NEMO_EVALUATOR_LOG_LEVEL=DEBUG
```
---
orphan: true
---
(evaluation-utils-reference)=
# Evaluation Utilities Reference
Complete reference for evaluation discovery and utility functions in NeMo Evaluator.
## nemo_evaluator.show_available_tasks()
Discovers and displays all available evaluation tasks across installed evaluation frameworks.
### Function Signature
```python
def show_available_tasks() -> None
```
### Returns
| Type | Description |
|------|-------------|
| `None` | Prints available tasks to stdout |
### Description
This function scans all installed `core_evals` packages and prints a hierarchical list of available evaluation tasks organized by framework. Use this function to discover which benchmarks and tasks are available in your environment.
The function automatically detects:
- **Installed frameworks**: lm-evaluation-harness, simple-evals, bigcode, BFCL
- **Available tasks**: All tasks defined in each framework's configuration
- **Installation status**: Displays message if no evaluation packages are installed
### Usage Examples
#### Basic Task Discovery
```python
from nemo_evaluator import show_available_tasks
# Display all available evaluations
show_available_tasks()
# Example output:
# lm-evaluation-harness:
# * mmlu
# * gsm8k
# * arc_challenge
# * hellaswag
# simple-evals:
# * AIME_2025
# * humaneval
# * drop
# bigcode:
# * mbpp
# * humaneval
# * apps
```
#### Programmatic Task Discovery
For programmatic access to task information, use the launcher API:
```python
from nemo_evaluator_launcher.api.functional import get_tasks_list
# Get structured task information
tasks = get_tasks_list()
for task in tasks:
task_name, endpoint_type, harness, container = task
print(f"Task: {task_name}, Type: {endpoint_type}, Framework: {harness}")
```
To filter tasks using the CLI:
```bash
# List all tasks
nemo-evaluator-launcher ls tasks
# Filter for specific tasks
nemo-evaluator-launcher ls tasks | grep mmlu
```
#### Check Installation Status
```python
from nemo_evaluator import show_available_tasks
# Check if evaluation packages are installed
print("Available evaluation frameworks:")
show_available_tasks()
# If no packages installed, you'll see:
# NO evaluation packages are installed.
```
### Installation Requirements
To use this function, install evaluation framework packages:
```bash
# Install all frameworks
pip install nvidia-lm-eval nvidia-simple-evals nvidia-bigcode-eval nvidia-bfcl
# Or install selectively
pip install nvidia-lm-eval # LM Evaluation Harness
pip install nvidia-simple-evals # Simple Evals
pip install nvidia-bigcode-eval # BigCode benchmarks
pip install nvidia-bfcl # Berkeley Function Calling Leaderboard
```
### Error Handling
The function handles missing packages:
```python
from nemo_evaluator import show_available_tasks
# Safely check for available tasks
try:
show_available_tasks()
except ImportError as e:
print(f"Error: {e}")
print("Install evaluation frameworks: pip install nvidia-lm-eval")
```
---
## Integration with Evaluation Workflows
### Pre-Flight Task Verification
Verify task availability before running evaluations:
```python
from nemo_evaluator_launcher.api.functional import get_tasks_list
def verify_task_available(task_name: str) -> bool:
"""Check if a specific task is available."""
tasks = get_tasks_list()
return any(task[0] == task_name for task in tasks)
# Usage
if verify_task_available("mmlu"):
print("✓ MMLU is available")
else:
print("✗ MMLU not found. Install evaluation framework packages")
```
### Filter Tasks by Endpoint Type
Use task discovery to filter by endpoint type:
```python
from nemo_evaluator_launcher.api.functional import get_tasks_list
# Get all chat endpoint tasks
tasks = get_tasks_list()
chat_tasks = [task[0] for task in tasks if task[1] == "chat"]
completions_tasks = [task[0] for task in tasks if task[1] == "completions"]
print(f"Chat tasks: {chat_tasks[:5]}") # Show first five
print(f"Completions tasks: {completions_tasks[:5]}")
```
### Framework Selection
When a task is provided by more than one framework, use explicit framework specification in your configuration:
```python
from nemo_evaluator.api.api_dataclasses import EvaluationConfig, ConfigParams
# Explicit framework specification
config = EvaluationConfig(
type="lm-evaluation-harness.mmlu", # Instead of just "mmlu"
params=ConfigParams(task="mmlu")
)
```
---
## Troubleshooting
### Problem: "NO evaluation packages are installed"
**Solution**:
```bash
# Install evaluation frameworks
pip install nvidia-lm-eval nvidia-simple-evals nvidia-bigcode-eval nvidia-bfcl
# Verify installation
python -c "from nemo_evaluator import show_available_tasks; show_available_tasks()"
```
### Problem: Task not appearing in list
**Solution**:
```bash
# Install the required framework package
pip install nvidia-lm-eval
# Verify installation
python -c "from nemo_evaluator import show_available_tasks; show_available_tasks()"
```
### Problem: Task conflicts between frameworks
When a task name is provided by more than one framework (for example, both `lm-evaluation-harness` and `simple-evals` provide `mmlu`), use explicit framework specification:
**Solution**:
```bash
# Use explicit framework.task format in your configuration overrides
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
-o 'evaluation.tasks=["lm-evaluation-harness.mmlu"]'
```
---
## Related Functions
### NeMo Evaluator Launcher API
For programmatic access with structured results:
```python
from nemo_evaluator_launcher.api.functional import get_tasks_list
# Returns list of tuples: (task_name, endpoint_type, framework, container)
tasks = get_tasks_list()
```
### CLI Commands
```bash
# List all tasks
nemo-evaluator-launcher ls tasks
# List recent evaluation runs
nemo-evaluator-launcher ls runs
# Get detailed help
nemo-evaluator-launcher --help
```
---
**Source**: `packages/nemo-evaluator/src/nemo_evaluator/core/entrypoint.py:105-123`
**API Export**: `nemo_evaluator/__init__.py` exports `show_available_tasks` for public use
**Related**: See {ref}`gs-quickstart` for evaluation setup and {ref}`eval-benchmarks` for task descriptions
# Frequently Asked Questions
## **What benchmarks and harnesses are supported?**
The docs list hundreds of benchmarks across multiple harnesses, available via curated NGC evaluation containers and the unified Launcher.
Reference: {ref}`eval-benchmarks`
:::{tip}
Discover available tasks with
```bash
nemo-evaluator-launcher ls tasks
```
:::
---
## **How do I set log dir and verbose logging?**
Set these environment variables for logging configuration:
```bash
# Set log level (INFO, DEBUG, WARNING, ERROR, CRITICAL)
export LOG_LEVEL=DEBUG
# or (legacy, still supported)
export NEMO_EVALUATOR_LOG_LEVEL=DEBUG
```
Reference: {ref}`nemo-evaluator-logging`.
---
## **Can I run distributed or on a scheduler?**
Yes. Launcher supports multiple executors. For optimal performance, the SLURM executor is recommended. It schedules and executes jobs across cluster nodes, enabling parallel, large‑scale evaluation runs while preserving reproducibility via containerized benchmarks.
See {ref}`executor-slurm` for details.
---
## **Can I point Evaluator at my own endpoint?**
Yes. Provide your OpenAI‑compatible endpoint. The "none" deployment option means no model deployment is performed as part of the evaluation job. Instead, you provide an existing OpenAI-compatible endpoint. The launcher handles running evaluation tasks while connecting to your existing endpoint.
```yaml
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct # Model identifier (required)
url: https://your-endpoint.com/v1/chat/completions # Endpoint URL (required)
api_key_name: API_KEY # Environment variable name (recommended)
```
Reference: {ref}`deployment-none`.
---
**Can I test my endpoint for OpenAI compatibility?**
Yes. Preview the full resolved configuration without executing using `--dry-run` :
```bash
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_basic.yaml --dry-run
```
Reference: {ref}`launcher-cli-dry-run`.
---
## **Can I store and retrieve per-sample results, not just the summary?**
Yes. Capture full request/response artifacts and retrieve them from the run's artifacts folder.
Enable detailed logging with `nemo_evaluator_config`:
```yaml
evaluation:
# Request + response logging (example at 1k each)
nemo_evaluator_config:
target:
api_endpoint:
adapter_config:
use_request_logging: True
max_saved_requests: 1000
use_response_logging: True
max_saved_responses: 1000
```
These enable the **RequestLoggingInterceptor** and **ResponseLoggingInterceptor** so each prompt/response pair is saved alongside the evaluation job.
Retrieve artifacts after the run:
```bash
nemo-evaluator-launcher export --dest local --output-dir ./artifacts --copy-logs
```
Look under `./artifacts/` for `results.yml`, reports, logs, and saved request/response files.
Reference: {ref}`interceptor-request-logging`.
---
## **Where do I find evaluation results?**
After a run completes, copy artifacts locally:
```bash
nemo-evaluator-launcher info --copy-artifacts ./artifacts
```
Inside `./artifacts/` you'll see the run config, `results.yaml` (main output file), HTML/JSON reports, logs, and cached request/response files, if caching was used.
Where the output is structured:
```bash
/
│ ├── eval_factory_metrics.json
│ ├── report.html
│ ├── report.json
│ ├── results.yml
│ ├── run_config.yml
│ └── /
```
Reference: {ref}`evaluation-output`.
---
## **Can I export a consolidated JSON of scores?**
Yes. JSON is included in the standard output exporter, along with automatic exporters for MLflow, Weights & Biases, and Google Sheets.
```bash
nemo-evaluator-launcher export --dest local --format json
```
This creates `processed_results.json` (you can also pass multiple invocation IDs to merge).
**Exporter docs:** Local files, W&B, MLflow, GSheets are listed under **Launcher → Exporters** in the docs.
Reference: {ref}`exporters-overview`.
---
## **What's the difference between Launcher and Core?**
* **Launcher (`nemo-evaluator-launcher`)**: Unified CLI with config/exec backends (local/Slurm/Lepton), container orchestration, and exporters. Best for most users. See {ref}`lib-launcher`.
* **Core (`nemo-evaluator`)**: Direct access to the evaluation engine and adapters—useful for custom programmatic pipelines and advanced interceptor use. See {ref}`lib-core`.
---
## **Can I add a new benchmark?**
Yes. Use a **Framework Definition File (FDF)**—a YAML that declares framework metadata, default commands/params, and one or more evaluation tasks. Minimal flow:
1. Create an FDF with `framework`, `defaults`, and `evaluations` sections.
2. Point the launcher/Core at your FDF and run.
3. (Recommended) Package as a container for reproducibility and shareability. See {ref}`extending-evaluator`.
**Skeleton FDF (excerpt):**
```yaml
framework:
name: my-custom-eval
pkg_name: my_custom_eval
defaults:
command: >-
my-eval-cli --model {{target.api_endpoint.model_id}}
--task {{config.params.task}}
--output {{config.output_dir}}
evaluations:
- name: my_task_1
defaults:
config:
params:
task: my_task_1
```
See the "Framework Definition File (FDF)" page for the full example and field reference.
Reference: {ref}`framework-definition-file`.
---
## **Why aren't exporters included in the main wheel?**
Exporters target **external systems** (e.g., W&B, MLflow, Google Sheets). Each of those adds heavy/optional dependencies and auth integrations. To keep the base install lightweight and avoid forcing unused deps on every user, exporters ship as **optional extras**:
```bash
# Only what you need
pip install "nemo-evaluator-launcher[wandb]"
pip install "nemo-evaluator-launcher[mlflow]"
pip install "nemo-evaluator-launcher[gsheets]"
# Or everything
pip install "nemo-evaluator-launcher[all]"
```
**Exporter docs:** Local files, W&B, MLflow, GSheets are listed under {ref}`exporters-overview`.
---
## **How is input configuration managed?**
NeMo Evaluator uses **Hydra** for configuration management, allowing flexible composition, inheritance, and command-line overrides.
Each evaluation is defined by a YAML configuration file that includes four primary sections:
```yaml
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: results
target:
api_endpoint:
model_id: meta/llama-3.2-3b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY
evaluation:
- name: gpqa_diamond
- name: ifeval
```
This structure defines **where to run**, **how to serve the model**, **which model or endpoint to evaluate**, and **what benchmarks to execute**.
You can start from a provided example config or compose your own using Hydra's `defaults` list to combine deployment, execution, and benchmark modules.
Reference: {ref}`configuration-overview`.
---
## **Can I customize or override configuration values?**
Yes. You can override any field in the YAML file directly from the command line using the `-o` flag:
```bash
# Override output directory
nemo-evaluator-launcher run --config your_config.yaml \
-o execution.output_dir=my_results
# Override multiple fields
nemo-evaluator-launcher run --config your_config.yaml \
-o target.api_endpoint.url="https://new-endpoint.com/v1/chat/completions" \
-o target.api_endpoint.model_id=openai/gpt-4o
```
Overrides are merged dynamically at runtime—ideal for testing new endpoints, swapping models, or changing output destinations without editing your base config.
:::{tip}
Always start with a dry run to validate your configuration before launching a full evaluation:
```bash
nemo-evaluator-launcher run --config your_config.yaml --dry-run
```
:::
Reference: {ref}`configuration-overview`.
---
## **How do I choose the right deployment and execution configuration?**
NeMo Evaluator separates **deployment** (how your model is served) from **execution** (where your evaluations are run). These are configured in the `defaults` section of your YAML file:
```yaml
defaults:
- execution: local # Where to run: local, lepton, or slurm
- deployment: none # How to serve the model: none, vllm, sglang, nim, trtllm, generic
```
**Deployment Options — How your model is served**
| Option | Description | Best for |
| ----- | ----- | ----- |
| `none` | Uses an existing API endpoint (e.g., NVIDIA API Catalog, OpenAI, Anthropic). No deployment needed. | External APIs or already-deployed services |
| `vllm` | High-performance inference server for LLMs with tensor parallelism and caching. | Fast local/cluster inference, production workloads |
| `sglang` | Lightweight structured generation server optimized for throughput. | Evaluating structured or long-form text generation |
| `nim` | NVIDIA Inference Microservice (NIM) – optimized for enterprise-grade serving with autoscaling and telemetry. | Enterprise, production, and reproducible benchmarks |
| `trtllm` | TensorRT-LLM backend using GPU-optimized kernels. | Lowest latency and highest GPU efficiency |
| `generic` | Use a custom serving stack of your choice. | Custom frameworks or experimental endpoints |
**Execution Platforms — Where evaluations run**
| Platform | Description | Use case |
| ----- | ----- | ----- |
| `local` | Runs Docker-based evaluation locally. | Development, testing, or small-scale benchmarking |
| `lepton` | Runs on NVIDIA Lepton for on-demand GPU execution. | Scalable, production-grade evaluations |
| `slurm` | Uses your HPC cluster's job scheduler. | Research clusters or large batch evaluations |
**Example:**
```yaml
defaults:
- execution: lepton
- deployment: vllm
```
This configuration launches the model with **vLLM serving** and runs benchmarks remotely on **Lepton GPUs**.
When in doubt:
* Use `deployment: none` + `execution: local` for your **first run** (quickest setup).
* Use `vllm` or `nim` once you need **scalability and speed**.
Always test first:
```bash
nemo-evaluator-launcher run --config your_config.yaml --dry-run
```
Reference: {ref}`configuration-overview`.
---
## **Can I use Evaluator without internet access?**
Yes. NeMo Evaluator uses datasets and model checkpoints from [Hugging Face Hub](https://huggingface.co/docs/hub/en/index). If a requested dataset or model is not available locally, it is downloaded from the Hub at runtime.
When working in an environment without internet access, configure a cache directory and pre-populate it with all required data before launching the evaluation.
See the [example configuration](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator-launcher/examples/slurm_vllm_advanced_hf_caching.yaml) with HF caching:
```{literalinclude} ../../packages/nemo-evaluator-launcher/examples/slurm_vllm_advanced_hf_caching.yaml
:language: yaml
:start-after: "# [docs-start-snippet]"
:end-before: "# [docs-end-snippet]"
```
Modify the example with actual paths for the mounts and run:
```bash
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/slurm_vllm_advanced_hf_caching.yaml
```
---
orphan: true
---
(troubleshooting-index)=
# Troubleshooting
Comprehensive troubleshooting guide for {{ product_name_short }} evaluations, organized by problem type and complexity level.
This section provides systematic approaches to diagnose and resolve evaluation issues. Start with the quick diagnostics below to verify your basic setup, then navigate to the appropriate troubleshooting category based on where your issue occurs in the evaluation workflow.
## Quick Start
Before diving into specific problem areas, run these basic checks to verify your evaluation environment:
::::{tab-set}
:::{tab-item} Launcher Quick Check
```bash
# Verify launcher installation and basic functionality
nemo-evaluator-launcher --version
# List available tasks
nemo-evaluator-launcher ls tasks
# Validate configuration without running
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml --dry-run
# Check recent runs
nemo-evaluator-launcher ls runs
```
:::
:::{tab-item} Model Endpoint Check
```python
import requests
# Check health endpoint (adjust based on your deployment)
# vLLM/SGLang/NIM: use /health
# NeMo/Triton: use /v1/triton_health
health_response = requests.get("http://0.0.0.0:8080/health", timeout=5)
print(f"Health Status: {health_response.status_code}")
# Test completions endpoint
test_payload = {
"prompt": "Hello",
"model": "megatron_model",
"max_tokens": 5
}
response = requests.post("http://0.0.0.0:8080/v1/completions/", json=test_payload)
print(f"Completions Status: {response.status_code}")
```
:::
:::{tab-item} Core API Check
```python
from nemo_evaluator import show_available_tasks
try:
print("Available frameworks and tasks:")
show_available_tasks()
except ImportError as e:
print(f"Missing dependency: {e}")
```
:::
::::
## Troubleshooting Categories
Choose the category that best matches your issue for targeted solutions and debugging steps.
::::{grid} 1 1 1 1
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`download;1.5em;sd-mr-1` Setup & Installation
:link: setup-issues/index
:link-type: doc
Installation problems, authentication setup, and model deployment issues to get {{ product_name_short }} running.
:::
:::{grid-item-card} {octicon}`play;1.5em;sd-mr-1` Runtime & Execution
:link: runtime-issues/index
:link-type: doc
Configuration validation and launcher management during evaluation execution.
:::
::::
## Getting Help
### Log Collection
When reporting issues, include:
1. System Information:
```bash
python --version
pip list | grep nvidia
nvidia-smi
```
2. Configuration Details:
```python
print(f"Task: {eval_cfg.type}")
print(f"Endpoint: {target_cfg.api_endpoint.url}")
print(f"Model: {target_cfg.api_endpoint.model_id}")
```
3. Error Messages: Full stack traces and error logs
### Community Resources
- **GitHub Issues**: [{{ product_name_short }} Issues](https://github.com/NVIDIA-NeMo/Eval/issues)
- **Discussions**: [GitHub Discussions](https://github.com/NVIDIA-NeMo/Eval/discussions)
- **Documentation**: {ref}`template-home`
### Professional Support
For enterprise support, contact: [nemo-toolkit@nvidia.com](mailto:nemo-toolkit@nvidia.com)
(configuration-issues)=
# Configuration Issues
Solutions for configuration parameters, tokenizer setup, and endpoint configuration problems.
## Log-Probability Evaluation Issues
### Problem: Log-probability evaluation fails
**Required Configuration**:
```python
from nemo_evaluator import EvaluationConfig, ConfigParams
config = EvaluationConfig(
type="arc_challenge",
params=ConfigParams(
extra={
"tokenizer": "/path/to/checkpoint/context/nemo_tokenizer",
"tokenizer_backend": "huggingface"
}
)
)
```
**Common Issues**:
- Missing tokenizer path
- Incorrect tokenizer backend
- Tokenizer version mismatch
### Tokenizer Configuration
**Verify Tokenizer Path**:
```python
import os
tokenizer_path = "/path/to/checkpoint/context/nemo_tokenizer"
if os.path.exists(tokenizer_path):
print(" Tokenizer path exists")
else:
print(" Tokenizer path not found")
# Check alternative locations
```
## Chat vs. Completions Configuration
Before troubleshooting endpoint issues, verify your endpoint supports the required OpenAI API format using our {ref}`deployment-testing-compatibility` guide.
### Problem: Chat evaluation fails with base model
:::{admonition} Issue
:class: error
Base models don't have chat templates
:::
:::{admonition} Solution
:class: tip
Use completions endpoint instead:
```python
from nemo_evaluator import ApiEndpoint, EvaluationConfig, EndpointType
# Change from chat to completions
api_endpoint = ApiEndpoint(
url="http://0.0.0.0:8080/v1/completions/",
type=EndpointType.COMPLETIONS
)
# Use completion-based tasks
config = EvaluationConfig(type="mmlu")
```
:::
### Endpoint Configuration Examples
**For Completions (Base Models)**:
```python
from nemo_evaluator import EvaluationTarget, ApiEndpoint, EndpointType
target_cfg = EvaluationTarget(
api_endpoint=ApiEndpoint(
url="http://0.0.0.0:8080/v1/completions/",
type=EndpointType.COMPLETIONS,
model_id="megatron_model"
)
)
```
**For Chat (Instruct Models)**:
```python
from nemo_evaluator import EvaluationTarget, ApiEndpoint, EndpointType
target_cfg = EvaluationTarget(
api_endpoint=ApiEndpoint(
url="http://0.0.0.0:8080/v1/chat/completions/",
type=EndpointType.CHAT,
model_id="megatron_model"
)
)
```
## Timeout and Parallelism Issues
### Problem: Evaluation hangs, times out or crashes with "Too many requests" error
**Diagnosis**:
- Check `parallelism` setting (start with 1)
- Monitor resource usage
- Verify network connectivity
**Solutions**:
```python
from nemo_evaluator import ConfigParams
# Reduce concurrency
params = ConfigParams(
parallelism=1, # Start with single-threaded
limit_samples=10, # Test with small sample
request_timeout=600 # Increase timeout for large models (seconds)
)
```
## Configuration Validation
### Pre-Evaluation Checks
```python
from nemo_evaluator import show_available_tasks
# Verify task exists
print("Available tasks:")
show_available_tasks()
# Test endpoint connectivity with curl before running evaluation:
# curl -X POST http://0.0.0.0:8080/v1/completions/ \
# -H "Content-Type: application/json" \
# -d '{"prompt": "test", "model": "megatron_model", "max_tokens": 1}'
```
### Common Configuration Issues
- Wrong endpoint type (using `EndpointType.CHAT` for base models or `EndpointType.COMPLETIONS` for instruct models)
- Missing tokenizer (log-probability tasks require explicit tokenizer configuration in `params.extra`)
- High parallelism (starting with `parallelism > 1` can mask underlying issues; use `parallelism=1` for initial debugging)
- Incorrect model ID (model ID must match what the deployment expects)
- Missing output directory (ensure output path exists and is writable)
### Task-Specific Configuration
**MMLU (Choice-Based)**:
```python
from nemo_evaluator import EvaluationConfig, ConfigParams
config = EvaluationConfig(
type="mmlu",
params=ConfigParams(
extra={
"tokenizer": "/path/to/tokenizer",
"tokenizer_backend": "huggingface"
}
)
)
```
**Generation Tasks**:
```python
from nemo_evaluator import EvaluationConfig, ConfigParams
config = EvaluationConfig(
type="hellaswag",
params=ConfigParams(
max_new_tokens=100,
limit_samples=50
)
)
```
---
orphan: true
---
# Runtime and Execution Issues
Solutions for problems that occur during evaluation execution, including configuration validation and launcher management.
## Common Runtime Problems
When evaluations fail during execution, start with these diagnostic steps:
::::{tab-set}
:::{tab-item} Configuration Check
```bash
# Validate configuration before running
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml --dry-run
# Test minimal configuration
python -c "
from nemo_evaluator import EvaluationConfig, ConfigParams
config = EvaluationConfig(type='mmlu', params=ConfigParams(limit_samples=1))
print('Configuration valid')
"
```
:::
:::{tab-item} Endpoint Test
```python
import requests
# Test model endpoint connectivity
response = requests.post(
"http://0.0.0.0:8080/v1/completions/",
json={"prompt": "test", "model": "megatron_model", "max_tokens": 1}
)
print(f"Endpoint status: {response.status_code}")
```
:::
:::{tab-item} Resource Monitor
```bash
# Monitor system resources during evaluation
nvidia-smi -l 1 # GPU usage
htop # CPU/Memory usage
```
:::
::::
## Runtime Categories
Choose the category that matches your runtime issue:
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Configuration Issues
:link: configuration
:link-type: doc
Config parameter validation, tokenizer setup, and endpoint configuration problems.
:::
:::{grid-item-card} {octicon}`terminal;1.5em;sd-mr-1` Launcher Issues
:link: launcher
:link-type: doc
NeMo Evaluator Launcher-specific problems including job management and multi-backend execution.
:::
::::
:::{toctree}
:caption: Runtime Issues
:hidden:
Configuration
Launcher
:::
# Launcher Issues
Troubleshooting guide for NeMo Evaluator Launcher-specific problems including configuration validation, job management, and multi-backend execution issues.
## Configuration Issues
### Configuration Validation Errors
**Problem**: Configuration fails validation before execution
**Solution**: Use dry-run to validate configuration:
```bash
# Validate configuration without running
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml --dry-run
```
**Common Issues**:
::::{dropdown} Missing Required Fields
:icon: code-square
```
Error: Missing required field 'execution.output_dir'
```
**Fix**: Add output directory to config or override:
```bash
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \
-o execution.output_dir=./results
```
::::
::::{dropdown} Invalid Task Names
:icon: code-square
```
Error: Unknown task 'invalid_task'. Available tasks: hellaswag, arc_challenge, ...
```
**Fix**: List available tasks and use correct names:
```bash
nemo-evaluator-launcher ls tasks
```
::::
::::{dropdown} Configuration Conflicts
:icon: code-square
```
Error: Cannot specify both 'api_key' and 'api_key_name' in target.api_endpoint
```
**Fix**: Use only one authentication method in configuration.
::::
### Hydra Configuration Errors
**Problem**: Hydra fails to resolve configuration composition
**Common Errors**:
```
MissingConfigException: Cannot find primary config 'missing_config'
```
**Solutions**:
1. **Verify Config Directory**:
```bash
# List available configs
ls examples/
# Ensure config file exists
ls examples/local_basic.yaml
```
2. **Check Config Composition**:
```yaml
# Verify defaults section in config file
defaults:
- execution: local
- deployment: none
- _self_
```
3. **Use Absolute Paths**:
```bash
nemo-evaluator-launcher run --config /absolute/path/to/configs/my_config.yaml
```
## Job Management Issues
### Job Status Problems
**Problem**: Cannot check job status or jobs appear stuck
**Diagnosis**:
```bash
# Check job status
nemo-evaluator-launcher status
# List all runs
nemo-evaluator-launcher ls runs
# Check specific job
nemo-evaluator-launcher status
```
**Common Issues**:
1. **Invalid Invocation ID**:
```
Error: Invocation 'abc123' not found
```
**Fix**: Use correct invocation ID from run output or list recent runs:
```bash
nemo-evaluator-launcher ls runs
```
2. **Stale Job Database**:
**Fix**: Check execution database location and permissions:
```bash
# Database location
ls -la ~/.nemo-evaluator/exec-db/exec.v1.jsonl
```
### Job Termination Issues
**Problem**: Cannot kill running jobs
**Solutions**:
```bash
# Kill entire invocation
nemo-evaluator-launcher kill
# Kill specific job
nemo-evaluator-launcher kill
```
**Executor-Specific Issues**:
- **Local**: Jobs run in Docker containers - ensure Docker daemon is running
- **Slurm**: Check Slurm queue status with `squeue`
- **Lepton**: Verify Lepton workspace connectivity
## Multi-Backend Execution Issues
::::{dropdown} Local Executor Problems
:icon: code-square
**Problem**: Docker-related execution failures
**Common Issues**:
1. **Docker Not Running**:
```
Error: Cannot connect to Docker daemon
```
**Fix**: Start Docker daemon:
```bash
# macOS/Windows: Start Docker Desktop
# Linux:
sudo systemctl start docker
```
2. **Container Pull Failures**:
```
Error: Failed to pull container image
```
**Fix**: Check network connectivity and container registry access.
::::
::::{dropdown} Slurm Executor Problems
:icon: code-square
**Problem**: Jobs fail to submit to Slurm cluster
**Diagnosis**:
```bash
# Check Slurm cluster status
sinfo
squeue -u $USER
# Check partition availability
sinfo -p
```
**Common Issues**:
1. **Invalid Partition**:
```
Error: Invalid partition name 'gpu'
```
**Fix**: Use correct partition name:
```bash
# List available partitions
sinfo -s
```
2. **Resource Unavailable**:
```
Error: Insufficient resources for job
```
**Fix**: Adjust resource requirements:
```yaml
execution:
num_nodes: 1
gpus_per_node: 2
walltime: "2:00:00"
```
::::
::::{dropdown} Lepton Executor Problems
:icon: code-square
**Problem**: Lepton deployment or execution failures
**Diagnosis**:
```bash
# Check Lepton authentication
lep workspace list
# Test connection
lep deployment list
```
**Common Issues**:
1. **Authentication Failure**:
```
Error: Invalid Lepton credentials
```
**Fix**: Re-authenticate with Lepton:
```bash
lep login -c :
```
2. **Deployment Timeout**:
```
Error: Deployment failed to reach Ready state
```
**Fix**: Check Lepton workspace capacity and deployment status.
::::
## Export Issues
### Export Failures
**Problem**: Results export fails to destination
**Diagnosis**:
```bash
# List completed runs
nemo-evaluator-launcher ls runs
# Try export
nemo-evaluator-launcher export --dest local --format json
```
**Common Issues**:
1. **Missing Dependencies**:
```
Error: MLflow not installed
```
**Fix**: Install required exporter dependencies:
```bash
pip install nemo-evaluator-launcher[mlflow]
```
2. **Authentication Issues**:
```
Error: Invalid W&B credentials
```
**Fix**: Configure authentication for export destination:
```bash
# W&B
wandb login
```
## Advanced Debugging Techniques
### Injecting Custom Command Into Evaluation Container
:::{note}
Do not use this functionality for running at scale, because it a)
reduces the reproducility of evaluations; b) introduces security issues (remote command execution).
:::
For various debugging or testing purposes, one can supply a field `pre_cmd` under
the following configuration positions:
```yaml
...
evaluation:
pre_cmd: |
any script that will be executed inside of
the container before running evaluation
it can be multiline
tasks:
- name:
pre_cmd: one can override this command
```
For security reasons (running configs from untrusted sources), if `pre_cmd` is
non-empty, the `nemo-evaluator-launcher` will fail unless `NEMO_EVALUATOR_TRUST_PRE_CMD=1` environment
variable is supplied.
## Getting Help
### Debug Information Collection
When reporting launcher issues, include:
1. **Configuration Details**:
```bash
# Show resolved configuration
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/.yaml --dry-run
```
2. **System Information**:
```bash
# Launcher version
nemo-evaluator-launcher --version
# System info
python --version
docker --version # For local executor
sinfo # For Slurm executor
lep workspace list # For Lepton executor
```
3. **Job Information**:
```bash
# Job status
nemo-evaluator-launcher status
# Recent runs
nemo-evaluator-launcher ls runs
```
4. **Log Files**:
- Local executor: Check `//logs/stdout.log`
- Slurm executor: Check job output files in output directory
- Lepton executor: Check Lepton job logs via Lepton CLI
For complex issues, see the [Python API documentation](../../libraries/nemo-evaluator-launcher/api).
(authentication)=
# Authentication
Solutions for HuggingFace token issues and dataset access permissions.
## Common Authentication Issues
### Problem: `401 Unauthorized` for Gated Datasets
**Solution**:
```bash
# Set HuggingFace token
export HF_TOKEN=your_huggingface_token
# Or authenticate using CLI
huggingface-cli login
# Verify authentication
huggingface-cli whoami
```
**In Python**:
```python
import os
os.environ["HF_TOKEN"] = "your_token_here"
```
### Problem: `403 Forbidden` for Specific Datasets
**Solution**:
1. Request access to the gated dataset on HuggingFace
2. Wait for approval from dataset maintainers
3. Ensure your token has the required permissions
## Datasets Requiring Authentication
The following datasets require `HF_TOKEN` and access approval:
- **GPQA Diamond** (and variants): [Request access](https://huggingface.co/datasets/Idavidrein/gpqa)
- **Aegis v2**: Required for safety evaluation tasks
- **HLE**: Human-like evaluation tasks
:::{note}
Most standard benchmarks (MMLU, HellaSwag, ARC, etc.) do not require authentication.
:::
---
orphan: true
---
# Setup and Installation Issues
Solutions for getting {{ product_name_short }} up and running, including installation problems, authentication setup, and model deployment issues.
## Common Setup Problems
Before diving into specific issues, verify your basic setup with these quick checks:
::::{tab-set}
:::{tab-item} Installation Check
```bash
# Verify core packages are installed
pip list | grep nvidia
# Check for missing evaluation frameworks
python -c "from nemo_evaluator import show_available_tasks; show_available_tasks()"
```
:::
:::{tab-item} Authentication Check
```bash
# Verify HuggingFace token
huggingface-cli whoami
# Test token access
python -c "import os; print('HF_TOKEN set:', bool(os.environ.get('HF_TOKEN')))"
```
:::
:::{tab-item} Deployment Check
```bash
# Check if deployment server is running
# Use /health for vLLM, SGLang, NIM deployments
# Use /v1/triton_health for NeMo/Triton deployments
curl -I http://0.0.0.0:8080/health
# Verify GPU availability
nvidia-smi
```
:::
::::
## Setup Categories
Choose the category that matches your setup issue:
::::{grid} 1 2 2 2
:gutter: 1 1 1 2
:::{grid-item-card} {octicon}`download;1.5em;sd-mr-1` Installation Issues
:link: installation
:link-type: doc
Module import errors, missing dependencies, and framework installation problems.
:::
:::{grid-item-card} {octicon}`key;1.5em;sd-mr-1` Authentication Setup
:link: authentication
:link-type: doc
HuggingFace tokens, dataset access permissions, and gated model authentication.
:::
::::
:::{toctree}
:caption: Setup Issues
:hidden:
Installation
Authentication
:::
(installation-issues)=
# Installation Issues
Solutions for import errors, missing dependencies, and framework installation problems.
## Common Import and Installation Problems
### Problem: `ModuleNotFoundError: No module named 'core_evals'`
**Solution**:
```bash
# Install missing core evaluation framework
pip install nvidia-lm-eval
# For additional frameworks
pip install nvidia-simple-evals nvidia-bigcode-eval nvidia-bfcl
```
### Problem: `Framework for task X not found`
**Diagnosis**:
```python
from nemo_evaluator import show_available_tasks
# Display all available tasks
print("Available tasks:")
show_available_tasks()
```
Or use the CLI:
```bash
nemo-evaluator-launcher ls tasks
```
**Solution**:
```bash
# Install the framework containing the missing task
pip install nvidia-
# Restart Python session to reload frameworks
```
### Problem: `Multiple frameworks found for task X`
**Solution**:
```python
# Use explicit framework specification
config = EvaluationConfig(
type="lm-evaluation-harness.mmlu", # Instead of just "mmlu"
# ... other config
)
```