(template-home)= # NeMo Evaluator SDK Documentation Welcome to the NeMo Evaluator SDK Documentation. ````{div} sd-d-flex-row ```{button-ref} get-started/install :ref-type: doc :color: primary :class: sd-rounded-pill sd-mr-3 Install ``` ```{button-ref} get-started/quickstart/launcher :ref-type: doc :color: secondary :class: sd-rounded-pill sd-mr-3 Quickstart Evaluations ``` ```{raw} html

Download Docs for LLM Context

``` ```` --- ## Introduction to NeMo Evaluator SDK Discover how NeMo Evaluator SDK works and explore its key features. ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`info;1.5em;sd-mr-1` About NeMo Evaluator SDK :link: about/index :link-type: doc Explore the NeMo Evaluator Core and Launcher architecture ::: :::{grid-item-card} {octicon}`star;1.5em;sd-mr-1` Key Features :link: about/key-features :link-type: doc Discover NeMo Evaluator SDK's powerful capabilities. ::: :::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` Concepts :link: about/concepts/index :link-type: doc Master core concepts powering NeMo Evaluator SDK. ::: :::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` Release Notes :link: about/release-notes/index :link-type: doc Release notes for the NeMo Evaluator SDK. ::: :::: ## Choose a Quickstart Select the evaluation approach that best fits your workflow and technical requirements. ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`terminal;1.5em;sd-mr-1` Launcher :link: gs-quickstart-launcher :link-type: ref Use the CLI to orchestrate evaluations with automated container management. +++ {bdg-secondary}`cli` ::: :::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Core :link: gs-quickstart-core :link-type: ref Get direct Python API access with full adapter features, custom configurations, and workflow integration capabilities. +++ {bdg-secondary}`api` ::: :::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Container :link: gs-quickstart-container :link-type: ref Gain full control over the container environment with volume mounting, environment variable management, and integration into Docker-based CI/CD pipelines. +++ {bdg-secondary}`Docker` ::: :::: ## Libraries ### Launcher Orchestrate evaluations across different execution backends with unified CLI and programmatic interfaces. ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Configuration :link: libraries/nemo-evaluator-launcher/configuration/index :link-type: doc Complete configuration schema, examples, and advanced patterns for all use cases. +++ {bdg-secondary}`Setup` ::: :::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` Executors :link: libraries/nemo-evaluator-launcher/configuration/executors/index :link-type: doc Run evaluations on local machines, HPC clusters (Slurm), or cloud platforms (Lepton AI). +++ {bdg-secondary}`Execution` ::: :::{grid-item-card} {octicon}`upload;1.5em;sd-mr-1` Exporters :link: libraries/nemo-evaluator-launcher/exporters/index :link-type: doc Export results to MLflow, Weights & Biases, Google Sheets, or local files with one command. +++ {bdg-secondary}`Export` ::: :::: ### Core Access the core evaluation engine directly with containerized benchmarks and flexible adapter architecture. ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`workflow;1.5em;sd-mr-1` Workflows :link: libraries/nemo-evaluator/workflows/index :link-type: doc Use the evaluation engine through Python API, containers, or programmatic workflows. +++ {bdg-secondary}`Integration` ::: :::{grid-item-card} {octicon}`container;1.5em;sd-mr-1` Containers :link: libraries/nemo-evaluator/containers/index :link-type: doc Ready-to-use evaluation containers with curated benchmarks and frameworks. +++ {bdg-secondary}`Containers` ::: :::{grid-item-card} {octicon}`plug;1.5em;sd-mr-1` Interceptors :link: libraries/nemo-evaluator/interceptors/index :link-type: doc Configure request/response interceptors for logging, caching, and custom processing. +++ {bdg-secondary}`Customization` ::: :::{grid-item-card} {octicon}`log;1.5em;sd-mr-1` Logging :link: libraries/nemo-evaluator/logging :link-type: doc Comprehensive logging setup for evaluation runs, debugging, and audit trails. +++ {bdg-secondary}`Monitoring` ::: :::{grid-item-card} {octicon}`tools;1.5em;sd-mr-1` Extending :link: libraries/nemo-evaluator/extending/index :link-type: doc Add custom benchmarks and frameworks by defining configuration and interfaces. +++ {bdg-secondary}`Extension` ::: :::{grid-item-card} {octicon}`code;1.5em;sd-mr-1` API Reference :link: libraries/nemo-evaluator/api :link-type: doc Python API documentation for programmatic evaluation control and integration. +++ {bdg-secondary}`API` ::: :::: :::{toctree} :hidden: Home ::: :::{toctree} :caption: About :hidden: Overview Key Features Concepts Release Notes ::: :::{toctree} :caption: Get Started :hidden: Getting Started Install SDK Quickstart ::: :::{toctree} :caption: Tutorials :hidden: About Tutorials How-To Guides Tutorials for NeMo Framework Evaluate an Existing Endpoint ::: :::{toctree} :caption: Evaluation :hidden: About Model Evaluation Benchmark Catalog Tasks Not Explicitly Defined by FDF Evaluation Techniques Add Evaluation Packages to NeMo Framework ::: :::{toctree} :caption: Model Deployment :hidden: About Model Deployment Bring-Your-Own-Endpoint Use NeMo Framework ::: :::{toctree} :caption: Libraries :hidden: About NeMo Evaluator Libraries Launcher Core ::: :::{toctree} :caption: References :hidden: About References FAQ NeMo Evaluator Core Python API NeMo Evaluator Launcher Python API nemo-evaluator CLI nemo-evaluator-launcher CLI ::: (about-overview)= # About NeMo Evaluator SDK NeMo Evaluator SDK is NVIDIA's comprehensive platform for AI model evaluation and benchmarking. It consists of two core libraries that work together to enable consistent, scalable, and reproducible evaluation of large language models across diverse capabilities including reasoning, code generation, function calling, and safety. ## System Architecture NeMo Evaluator SDK consists of two main libraries: ```{list-table} NeMo Evaluator SDK Components :header-rows: 1 :widths: 30 70 * - Component - Key Capabilities * - **nemo-evaluator** (*Core Evaluation Engine*) - • {ref}`interceptors-concepts` for request and response processing • Standardized evaluation workflows and containerized frameworks • Deterministic configuration and reproducible results • Consistent result schemas and artifact layouts * - **nemo-evaluator-launcher** (*Orchestration Layer*) - • Unified CLI and programmatic entry points • Multi-backend execution (local, Slurm, cloud) • Job monitoring and lifecycle management • Result export to multiple destinations (MLflow, W&B, Google Sheets) ``` ## Target Users ```{list-table} Target User Personas :header-rows: 1 :widths: 30 70 * - User Type - Key Benefits * - **Researchers** - Access 100+ benchmarks across multiple evaluation harnesses with containerized reproducibility. Run evaluations locally or on HPC clusters. * - **ML Engineers** - Integrate evaluations into ML pipelines with programmatic APIs. Deploy models and run evaluations across multiple backends. * - **Organizations** - Scale evaluation across teams with unified CLI, multi-backend execution, and result tracking. Export results to MLflow, Weights & Biases, or Google Sheets. * - **AI Safety Teams** - Conduct safety assessments using specialized containers for security testing and bias evaluation with detailed logging. * - **Model Developers** - Evaluate custom models against standard benchmarks using OpenAI-compatible APIs. ``` --- orphan: true --- (adapters-concepts)= # Adapters Adapters in NeMo Evaluator provide sophisticated request and response processing through a configurable interceptor pipeline. They enable advanced evaluation capabilities like caching, logging, reasoning extraction, and custom prompt injection. ## Architecture Overview The adapter system transforms simple API calls into sophisticated evaluation workflows through a two-phase pipeline: 1. **Request Processing**: Interceptors modify outgoing requests (system prompts, parameters) before they reach the endpoint 2. **Response Processing**: Interceptors extract reasoning, log data, cache results, and track statistics after receiving responses The endpoint interceptor bridges these phases by handling HTTP communication with the model API. ## Core Components - **AdapterConfig**: Configuration class for all interceptor settings - **Interceptor Pipeline**: Modular components for request/response processing - **Endpoint Management**: HTTP communication with error handling and retries - **Resource Management**: Caching, logging, and progress tracking ## Available Interceptors The adapter system includes several built-in interceptors: - **System Message**: Inject custom system prompts - **Payload Modifier**: Transform request parameters - **Request/Response Logging**: Capture detailed interaction data - **Caching**: Store and retrieve responses for efficiency - **Reasoning**: Extract chain-of-thought reasoning - **Response Stats**: Collect aggregated statistics from API responses - **Progress Tracking**: Monitor evaluation progress - **Endpoint**: Handle HTTP communication with the model API - **Raise Client Errors**: Handle and raise exceptions for client errors ## Integration The adapter system integrates seamlessly with: - **Evaluation Frameworks**: Works with any OpenAI-compatible API - **NeMo Evaluator Core**: Direct integration via `AdapterConfig` - **NeMo Evaluator Launcher**: YAML configuration support ## Configuration ### Modern Interceptor-Based Configuration The recommended approach uses the interceptor-based API: :::{code-block} python from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig adapter_config = AdapterConfig( interceptors=[ InterceptorConfig( name="system_message", enabled=True, config={"system_message": "You are a helpful assistant."} ), InterceptorConfig(name="request_logging", enabled=True), InterceptorConfig( name="caching", enabled=True, config={"cache_dir": "./cache", "reuse_cached_responses": True} ), InterceptorConfig(name="reasoning", enabled=True), InterceptorConfig(name="endpoint") ] ) ::: For detailed usage and configuration examples, see {ref}`interceptors-concepts`. # Architecture Overview NeMo Evaluator provides a **two-tier architecture** for comprehensive model evaluation: ```{mermaid} graph TB subgraph Tier2[" Orchestration Layer"] Launcher["nemo-evaluator-launcher
• CLI orchestration
• Multi-backend execution (local, Slurm, Lepton)
• Deployment management (vLLM, NIM, SGLang)
• Result export (MLflow, W&B, Google Sheets)"] end subgraph Tier1[" Evaluation Engine"] Evaluator["nemo-evaluator
• Adapter system
• Interceptor pipeline
• Containerized evaluation execution
• Result aggregation"] end subgraph External["NVIDIA Eval Factory Containers"] Containers["Evaluation Frameworks
• nvidia-lm-eval (lm-evaluation-harness)
• nvidia-simple-evals
• nvidia-bfcl, nvidia-bigcode-eval
• nvidia-eval-factory-garak
• nvidia-safety-harness"] end Launcher --> Evaluator Evaluator --> Containers style Tier2 fill:#e1f5fe style Tier1 fill:#f3e5f5 style External fill:#fff3e0 ``` ## Component Overview ### **Orchestration Layer** (`nemo-evaluator-launcher`) High-level orchestration for complete evaluation workflows. **Key Features:** - CLI and YAML configuration management - Multi-backend execution (local, Slurm, Lepton) - Deployment management (vLLM, NIM, SGLang, or bring-your-own-endpoint) - Result export to MLflow, Weights & Biases, and Google Sheets - Job monitoring and lifecycle management **Use Cases:** - Automated evaluation pipelines - HPC cluster evaluations with Slurm - Cloud deployments with Lepton AI - Multi-model comparative studies ### **Evaluation Engine** (`nemo-evaluator`) Core evaluation capabilities with request/response processing. **Key Features:** - **Adapter System**: Request/response processing layer for API endpoints - **Interceptor Pipeline**: Modular components for logging, caching, and reasoning - **Containerized Execution**: Evaluation harnesses run in Docker containers - **Result Aggregation**: Standardized result schemas and metrics **Use Cases:** - Programmatic evaluation integration - Request/response transformation and logging - Custom interceptor development - Direct Python API usage ## Interceptor Pipeline The evaluation engine provides an interceptor system for request/response processing. Interceptors are configurable components that process API requests and responses in a pipeline. ```{mermaid} graph LR A[Request] --> B[System Message] B --> C[Payload Modifier] C --> D[Request Logging] D --> E[Caching] E --> F[API Endpoint] F --> G[Response Logging] G --> H[Reasoning] H --> I[Response Stats] I --> J[Response] style E fill:#e1f5fe style F fill:#f3e5f5 ``` **Available Interceptors:** - **System Message**: Inject system prompts into chat requests - **Payload Modifier**: Transform request parameters - **Request/Response Logging**: Log requests and responses to files - **Caching**: Cache responses to avoid redundant API calls - **Reasoning**: Extract chain-of-thought from responses - **Response Stats**: Track token usage and latency metrics - **Progress Tracking**: Monitor evaluation progress ## Integration Patterns ### **Pattern 1: Launcher with Deployment** Use the launcher to handle both model deployment and evaluation: ```bash nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_vllm_logprobs.yaml \ -o deployment=vllm \ -o ++deployment.hf_handle=meta-llama/Llama-3.1-8B \ -o ++deployment.served_model_name=meta-llama/Llama-3.1-8B ``` ### **Pattern 2: Launcher with Existing Endpoint** Point the launcher to an existing API endpoint: ```bash export HF_TOKEN_FOR_GPQA_DIAMOND=hf_your-token nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \ -o target.api_endpoint.url=http://localhost:8080/v1/completions \ -o target.api_endpoint.api_key_name=null \ -o deployment=none ``` ### **Pattern 3: Python API** Use the Python API for programmatic integration: ```python from nemo_evaluator import evaluate, EvaluationConfig, EvaluationTarget, ApiEndpoint, EndpointType # Configure target endpoint api_endpoint = ApiEndpoint( url="http://localhost:8080/v1/completions", type=EndpointType.COMPLETIONS ) target = EvaluationTarget(api_endpoint=api_endpoint) # Configure evaluation eval_config = EvaluationConfig( type="mmlu_pro", output_dir="./results" ) # Run evaluation results = evaluate(eval_cfg=eval_config, target_cfg=target) ``` (evaluation-model)= # Evaluation Model NeMo Evaluator provides evaluation approaches and endpoint compatibility for comprehensive AI model assessment. ## Evaluation Approaches NeMo Evaluator supports several evaluation approaches through containerized harnesses: - **Text Generation**: Models generate responses to prompts, assessed for correctness or quality against reference answers or rubrics. - **Log Probability**: Models assign probabilities to token sequences, enabling confidence measurement without text generation. Effective for choice-based tasks and base model evaluation. - **Code Generation**: Models generate code from natural language descriptions, evaluated for correctness through test execution. - **Function Calling**: Models generate structured outputs for tool use and API interaction scenarios. - **Retrieval Augmented Generation**: Models fetches content based on context, evaluated for content relevance and converage, as well as answer corectness. - **Visual Understanding**: Models generate responses to prompts with images and videos, assessed for correctness or quality against reference answers or rubrics. - **Agentic Workflows**: Models are tasked with complex problems and need to select and engage tools autonomously. - **Safety & Security**: Evaluation against adversarial prompts and safety benchmarks to test model alignment and robustness. One or more evaluation harnesses implement each approach. To discover available tasks for each approach, use `nemo-evaluator-launcher ls tasks`. ## Endpoint Compatibility NeMo Evaluator targets OpenAI-compatible API endpoints. The platform supports the following endpoint types: - **`completions`**: Direct text completion without chat formatting (`/v1/completions`). Used for base models and academic benchmarks. - **`chat`**: Conversational interface with role-based messages (`/v1/chat/completions`). Used for instruction-tuned and chat models. - **`vlm`**: Vision-language model endpoints supporting image inputs. - **`embedding`**: Embedding generation endpoints for retrieval evaluation. Each evaluation task specifies which endpoint types it supports. Verify compatibility using `nemo-evaluator-launcher ls tasks`. ## Metrics Individual evaluation harnesses define metrics that vary by task. Common metric types include: - **Accuracy metrics**: Exact match, normalized accuracy, F1 scores - **Generative metrics**: BLEU, ROUGE, code execution pass rates - **Probability metrics**: Perplexity, log-likelihood scores - **Safety metrics**: Refusal rates, toxicity scores, vulnerability detection The platform returns results in a standardized schema regardless of the source harness. To see metrics for a specific task, refer to {ref}`eval-benchmarks` or run an evaluation and inspect the results. For hands-on guides, refer to {ref}`eval-run`. (evaluation-output)= # Evaluation Output This page describes the structure and content of evaluation output files generated by NVIDIA NeMo Evaluator. The evaluation output provides comprehensive information about the evaluation run, including configuration details, results, and metadata. ## Input Configuration The input configuration comes from the command described in the [Launcher Quickstart Guide](../../get-started/quickstart/launcher.md#quick-start), namely ```{literalinclude} ../../get-started/_snippets/launcher_full_example.sh :language: bash :start-after: "# [snippet-start]" :end-before: "# [snippet-end]" ``` :::{note} For local execution all artifacts are already present on your machine. When working with remote executors such as `slurm` you can download the artifacts with the following command: ```bash nemo-evaluator-launcher info --copy-artifacts ``` ::: For the reference purposes, we cite here the launcher config that is used in the command: ```{literalinclude} ../../../packages/nemo-evaluator-launcher/examples/local_basic.yaml :language: yaml :start-after: "[docs-start-snippet]" ``` ## Output Structure After running an evaluation, NeMo Evaluator creates a structured output directory containing various artifacts. If you run the command provided above, it will create the following directory structure inside `execution.output_dir` (`./results` in our case): ```bash ./results/ ├── - │   ├── gpqa_diamond │   │   ├── artifacts │   │   ├── logs │   │   └── run.sh │   ├── ifeval │   │   ├── artifacts │   │   ├── logs │   │   └── run.sh │   ├── mbpp │   │   ├── artifacts │   │   ├── logs │   │   └── run.sh │   └── run_all.sequential.sh ``` Each `artifacts` directory contains output produced by the evaluation job. Such directory will also be created if you use `nemo-evaluator` or direct container access (see {ref}`gs-quickstart` to compare different ways of using NeMo Evaluator SDK) Regardless of the chosen path, the generated artifacts directory will have the following content: ```text / │ ├── run_config.yml # Task-specific configuration used during execution │ ├── eval_factory_metrics.json # Evaluation metrics and performance statistics │ ├── results.yml # Detailed results in YAML format │ ├── report.html # Human-readable HTML report │ ├── report.json # JSON format report │ └── / # Task-specific artifacts ``` These files are standardized and always follow the same structure regardless of the underlying evaluation harness: ```{list-table} :header-rows: 1 :widths: 20 30 30 50 * - File Name - Description - Content - Usage * - `results.yml` - Evaluation results in YAML format - * Final evaluation scores and metrics * Evaluation configuration used - The main results file for programmatic analysis and integration with downstream tools. * - `run_config.yml` - Complete evaluation configuration (all parameters, overrides, and settings) used for the run. - * Task and model settings * Endpoint configuration * Interceptor config * Evaluation-specific overrides - Enables full reproducibility of evaluations and configuration auditing. * - `eval_factory_metrics.json` - Detailed metrics and performance statistics for the evaluation execution. - * Request/response timings * Token usage * Error rates * Resource utilization - Performance analysis and failure pattern identification. * - `report.html` and `report.json` - Example request-response pairs collected during benchmark execution. - * Human-readable HTML report * Machine-readable JSON version with the same content - For sharing, quick review, analysis, and debugging. * - Task-specific artifacts - Artifacts procuded by the underlying benchmark (e.g., caches, raw outputs, error logs). - * Cached queries & responses * Source/context data * Special task outputs or logs - Advanced troubleshooting, debugging, or domain-specific analysis. ``` ### Results file The primary evaluation output is stored in a `results.yaml`. It is standardized accross all evaluation benchmarks and follows the [API dataclasses](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator/src/nemo_evaluator/api/api_dataclasses.py) specification. Below we give the output for the command from the [Launcher Quickstart Section](../../get-started/quickstart/launcher.md#quick-start) for the `GPQA-Diamond` task. ```{literalinclude} ./_snippets/results.yaml :language: yaml ``` :::{note} It is instructive to compare the configuration file cited above and the resulting one. ::: The evaluation output contains the following general sections: | Section | Description | |---------|-------------| | `command` | The exact command executed to run the evaluation | | `config` | Evaluation configuration including parameters and settings | | `results` | Evaluation metrics and scores organized by groups and tasks | | `target` | Model and API endpoint configuration details | | `git_hash` | Git commit hash (if available) | The evaluation metrics are available under the `results` key and are stored in a following structure: ```yaml metrics: metric_name: scores: score_name: stats: # optional set of statistics, e.g.: count: 10 # number of values used for computing the score min: 0 # minimum of all values used for computing the score max: 1 # maximum of all values used for computing the score stderr: 0.13 # standard error value: 0.42 # score value ``` In the example output above, the metric used is the micro-average across the samples (thus `micro` key in the structure) and the standard deviation (`stddev`) and standard error (`stderr`) statistics are reported. The types of metrics available in the results differ for different evaluation harness and task, but they are always presented using the same structure as shown above. ## Exporting the Results Once the evaluation has finished and the `results.yaml` file was produced, the scores can be exported. In this example we show how the local export works. For information on other exporters, see {ref}`exporters-overview`. The results can be exported using the following command: ```bash nemo-evaluator-launcher export --dest local --format json ``` This command extracts the scores from the `results.yaml` and creates a `processed_results.json` file with the following content: ```{literalinclude} ./_snippets/processed_results.json :language: json ``` The `nemo-evaluator-launcher export` can accept multiple invocation IDs and gather results accross different invocations, regardless if they have been run locally or using remote executors (see {ref}`executors-overview`), e.g.: ```bash nemo-evaluator-launcher export --dest local --format json --output_dir combined-results ``` will create the `combined-results/processed_results.json` with the same stracuture as in the example above. (execution-backend)= # Execution Backend NeMo Evaluator can be run in various environments: locally, on a cluster, or within NVIDIA Lepton. We refer to these environments as **execution backends** and we have corresponding **Executors** in NeMo Evaluator Launcher that take care of evaluation orchestration in the designated backend. Each executor uses NVIDIA-built docker containers with pre-installed harnesses and the right container is automatically selected and run for you through the Launcher. ## Executors Different environments require a bit different setup. This includes, among others, submitting, launching and collecting results. Refer to the list below for executors available today. - **Local Executor**: orchestrates evaluations on a local machine using docker daemon. - **SLURM executor**: orchestrates evaluations through a SLURM manager - **Lepton Executor**: submit jobs via [Lepton AI](https://www.nvidia.com/en-us/data-center/dgx-cloud-lepton/) :::{note} SLURM executor operates only on a SLURM cluster with pyxis SPANK plugin installed. Pyxis allows unprivileged cluster users to run containerized tasks through the `srun` command. Visit [pyxis github homepage](https://github.com/NVIDIA/pyxis) to learn more. ::: Naturally, each executor might require additional configuration. For example, NeMo Evaluator Launcher needs information on partition, account and selected nodes on SLURM execution backend (fdf-concept)= # Framework Definition Files ::::{note} **Who needs this?** This documentation is for framework developers and organizations creating custom evaluation frameworks. If you're running existing evaluation tasks using {ref}`nemo-evaluator-launcher ` (NeMo Evaluator Launcher CLI) or {ref}`nemo-evaluator ` (NeMo Evaluator CLI), you don't need to create FDFs—they're already provided by framework packages. :::: A Framework Definition File (FDF) is a YAML configuration file that serves as the single source of truth for integrating evaluation frameworks into the NeMo Evaluator ecosystem. FDFs define how evaluation frameworks are configured, executed, and integrated with the Eval Factory system. ## What an FDF Defines An FDF specifies five key aspects of an evaluation framework: - **Framework metadata**: Name, description, package information, and repository URL - **Default configurations**: Parameters, commands, and settings that apply across all evaluations within that framework - **Evaluation types**: Available evaluation tasks and their specific configurations - **Execution commands**: Jinja2-templated commands for running evaluations with dynamic parameter injection - **API compatibility**: Supported endpoint types (chat, completions, vlm, embedding) and their configurations ## How FDFs Integrate with NeMo Evaluator FDFs sit at the integration point between your evaluation framework's CLI and NeMo Evaluator's orchestration system: ```{mermaid} graph LR A[User runs
nemo-evaluator] --> B[System loads
framework.yml] B --> C[Merges defaults +
user evaluation config] C --> D[Renders Jinja2
command template] D --> E[Executes your
CLI command] E --> F[Parses output] style B fill:#e1f5fe style D fill:#fff3e0 style E fill:#f3e5f5 ``` **The workflow:** 1. When you run `nemo-evaluator` (see {ref}`nemo-evaluator-cli`), the system discovers and loads your FDF (`framework.yml`) 2. Configuration values are merged from framework defaults, evaluation-specific settings, and user overrides (see {ref}`parameter-overrides`) 3. The system renders the Jinja2 command template with the merged configuration 4. Your framework's CLI is executed with the generated command 5. Results are parsed and processed by the system This architecture allows you to integrate any evaluation framework that exposes a CLI interface, without modifying NeMo Evaluator's core code. ## Key Concepts ### Jinja2 Templating FDFs use Jinja2 template syntax to inject configuration values dynamically into command strings. Variables are referenced using `{{variable}}` syntax: ```yaml command: >- my-eval-cli --model {{target.api_endpoint.model_id}} --task {{config.params.task}} --output {{config.output_dir}} ``` At runtime, these variables are replaced with actual values from the configuration. ### Parameter Inheritance Configuration values cascade through multiple layers, with later layers overriding earlier ones: 1. **Framework defaults**: Base configuration in the FDF's `defaults` section 2. **Evaluation defaults**: Task-specific overrides in the `evaluations` section 3. **User configuration**: Values from run configuration files 4. **CLI overrides**: Command-line arguments passed at runtime This inheritance model allows you to define sensible defaults while giving users full control over specific runs. For detailed examples and patterns, see {ref}`advanced-features`. ### Endpoint Types Evaluations declare which API endpoint types they support (see {ref}`evaluation-model` for details). NeMo Evaluator uses adapters to translate between different API formats: - **`chat`**: OpenAI-compatible chat completions (messages with roles) - **`completions`**: Text completion endpoints (prompt in, text out) - **`vlm`**: Vision-language models (text + image inputs) - **`embedding`**: Embedding generation endpoints Your FDF specifies which types each evaluation supports, and the system validates compatibility at runtime. ### Validation FDFs are validated when loaded to catch configuration errors early: - **Schema validation**: Pydantic models ensure required fields exist and have correct types - **Template validation**: Jinja2 templates are parsed with `StrictUndefined` to catch undefined variables - **Reference validation**: Template variables must reference valid fields in the configuration model - **Consistency validation**: Endpoint types and parameters are checked for consistency Validation failures produce clear error messages that help you fix configuration issues before runtime. For common validation errors and solutions, see {ref}`fdf-troubleshooting`. ## File Structure An FDF follows a three-section hierarchical structure: ```yaml framework: # Framework identification and metadata name: my-eval pkg_name: my_eval full_name: My Evaluation Framework description: Evaluates specific capabilities url: https://github.com/example/my-eval defaults: # Default configurations and commands command: >- my-eval-cli --model {{target.api_endpoint.model_id}} config: params: temperature: 0.0 target: api_endpoint: type: chat evaluations: # Available evaluation types - name: task_1 description: First task defaults: config: params: task: task_1 ``` ## Next Steps Ready to create your own FDF? Refer to {ref}`framework-definition-file` for detailed reference documentation and practical guidance on building Framework Definition Files. (about-concepts)= # Concepts Use this section to understand how {{ product_name_short }} works at a high level. Start with the evaluation model, then read about adapters and deployment choices. ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} Evaluation Model :link: evaluation-model :link-type: ref Core evaluation types, OpenAI-compatible endpoints, and metrics. ::: :::{grid-item-card} Execution Backend :link: execution-backend :link-type: ref Your runtime execution environment. ::: :::{grid-item-card} Evaluation Output :link: evaluation-output :link-type: doc Standardized output structure across all harnesses and tasks is what makes Evaluator powerful. ::: :::{grid-item-card} Framework Definition Files :link: fdf-concept :link-type: ref YAML configuration files that integrate evaluation frameworks into NeMo Evaluator. ::: :::{grid-item-card} Interceptors :link: interceptors :link-type: doc Advanced request/response processing with configurable interceptor pipelines. ::: :::: ```{toctree} :hidden: Architecture Evaluation Model Evaluation Output Execution Backend Framework Definition Files Interceptors ``` (interceptors-concepts)= # Interceptors The **interceptor system** is a core architectural concept in NeMo Evaluator that enables sophisticated request and response processing during model evaluation. The main take away information from learning about interceptors is that they enable functionalities that can be plugged-in to many evaluation harnesses without modifying their underlaying code. If you configure at least one interceptor in your evaluation pipeline, a lightweight middleware server is started next to the evaluation runtime. This server transforms simple API calls through a two-phase pipeline: 1. **Request Processing**: Interceptors modify outgoing requests (system prompts, parameters) before they reach the endpoint 2. **Response Processing**: Interceptors extract reasoning, log data, cache results, and track statistics after receiving responses :::{note} You might see throughout evaluation logs that the evaluation harness sends requests to `localhost` on port proximal to `3825` instead of the URL you provided. This is the middleware server at work. ::: ## Conceptual Overview The interceptor system transforms simple model API calls into sophisticated evaluation workflows through a configurable pipeline of **interceptors**. This design provides: - **Modularity**: Each interceptor handles a specific concern (logging, caching, reasoning) - **Composability**: Multiple interceptors can be chained together - **Configurability**: Interceptors can be enabled/disabled and configured independently - **Extensibility**: Custom interceptors can be added for specialized processing The following diagram shows a typical interceptor pipeline configuration. Note that interceptors must follow the order: Request → RequestToResponse → Response, but the specific interceptors and their configuration are flexible: ```{mermaid} graph TB A[Evaluation Request] --> B[Adapter System] B --> C[Interceptor Pipeline] C --> D[Model API] D --> E[Response Pipeline] E --> F[Processed Response] subgraph "Request Processing" C --> G[System Message] G --> H[Payload Modifier] H --> I[Request Logging] I --> J[Caching Check] J --> K[Endpoint Call] end subgraph "Response Processing" E --> L[Response Logging] L --> M[Reasoning Extraction] M --> N[Progress Tracking] N --> O[Cache Storage] end style B fill:#f3e5f5 style C fill:#e1f5fe style E fill:#e8f5e8 ``` ## Core Concepts ### Adapter Server **Adapter Server** is a lightweight server that handles communication between evaluation harness and the endpoint under test. It provides: - **Configuration Management**: Unified interface for interceptor settings - **Pipeline Coordination**: Manages the flow of requests through interceptors - **Resource Management**: Handles shared resources like caches and logs - **Error Handling**: Provides consistent error handling across interceptors ### Interceptors **Interceptors** are modular components that process requests and responses. Key characteristics: - **Dual Interface**: Each interceptor can process both requests and responses - **Context Awareness**: Access to evaluation context (benchmark type, model info) - **Stateful Processing**: Can maintain state across requests - **Chainable**: Multiple interceptors work together in sequence ## Interceptor Categories ### Processing Interceptors Transform or augment requests and responses: - **System Message**: Inject custom system prompts - **Payload Modifier**: Modify request parameters - **Reasoning**: Extract chain-of-thought reasoning ### Infrastructure Interceptors Provide supporting capabilities: - **Caching**: Store and retrieve responses - **Logging**: Capture request/response data - **Progress Tracking**: Monitor evaluation progress - **Response Stats**: Track request statistics and metrics - **Raise Client Error**: Raise exceptions for client errors (4xx status codes, excluding 408 and 429) ### Integration Interceptors Handle external system integration: - **Endpoint**: Route requests to model APIs ## Configuration Philosophy The adapter system follows a **configuration-over-code** philosophy: ### Simple Configuration Enable basic features with minimal configuration: :::{code-block} python from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig adapter_config = AdapterConfig( interceptors=[ InterceptorConfig(name="caching", enabled=True), InterceptorConfig(name="request_logging", enabled=True), InterceptorConfig(name="endpoint") ] ) ::: ### Advanced Configuration Full control over interceptor behavior: :::{code-block} python adapter_config = AdapterConfig( interceptors=[ InterceptorConfig( name="system_message", enabled=True, config={"system_message": "You are an expert."} ), InterceptorConfig( name="caching", enabled=True, config={"cache_dir": "./cache"} ), InterceptorConfig( name="request_logging", enabled=True, config={"max_requests": 1000} ), InterceptorConfig( name="reasoning", enabled=True, config={ "start_reasoning_token": "", "end_reasoning_token": "" } ), InterceptorConfig(name="endpoint") ] ) ::: ### YAML Configuration Declarative configuration for reproducibility: ```yaml adapter_config: interceptors: - name: system_message enabled: true config: system_message: "Think step by step." - name: caching enabled: true - name: reasoning enabled: true - name: endpoint ``` ## Design Benefits 1. **Separation of Concerns**: Each interceptor handles a single responsibility, making the system easier to understand and maintain. 2. **Reusability**: Interceptors can be reused across different evaluation scenarios and benchmarks. 3. **Testability**: Individual interceptors can be tested in isolation, improving reliability. 4. **Performance**: Interceptors can be optimized independently and disabled when not needed. 5. **Extensibility**: New interceptors can be added without modifying existing code. ## Common Use Cases ### Research Workflows - **Reasoning Analysis**: Extract and analyze model reasoning patterns - **Prompt Engineering**: Test different system prompts systematically - **Behavior Studies**: Log detailed interactions for analysis ### Production Evaluations - **Performance Optimization**: Cache responses to reduce API costs - **Monitoring**: Track evaluation progress and performance metrics - **Compliance**: Maintain audit trails of all interactions ### Development and Debugging - **Request Inspection**: Log requests to debug evaluation issues - **Response Analysis**: Capture detailed response data - **Error Tracking**: Monitor and handle evaluation failures ## Integration with Evaluation Frameworks The adapter system integrates seamlessly with evaluation frameworks: - **Framework Agnostic**: Works with any OpenAI-compatible API - **Benchmark Independent**: Same interceptors work across different benchmarks - **Container Compatible**: Integrates with containerized evaluation frameworks ## Next Steps For detailed implementation information, see: - **{ref}`nemo-evaluator-interceptors`**: Individual interceptor guides with configuration examples - **{ref}`interceptor-caching`**: Response caching setup and optimization - **{ref}`interceptor-reasoning`**: Chain-of-thought processing configuration The adapter and interceptor system represents a fundamental shift from simple API calls to sophisticated, configurable evaluation workflows that can adapt to diverse research and production needs. (about-key-features)= # Key Features NeMo Evaluator SDK delivers comprehensive AI model evaluation through a dual-library architecture that scales from local development to enterprise production. Experience container-first reproducibility, multi-backend execution, and comprehensive set of state-of-the-art benchmarks. ## **Unified Orchestration (NeMo Evaluator Launcher)** ### Multi-Backend Execution Run evaluations anywhere with unified configuration and monitoring: - **Local Execution**: Docker-based evaluation on your workstation - **HPC Clusters**: Slurm integration for large-scale parallel evaluation - **Cloud Platforms**: Lepton AI and custom cloud backend support - **Hybrid Workflows**: Mix local development with cloud production ```bash # Single command, multiple backends nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/slurm_vllm_basic.yaml nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/lepton_basic.yaml ``` ### Evaluation Benchmarks & Tasks Access comprehensive benchmark suite with single CLI: ```bash # Discover available benchmarks nemo-evaluator-launcher ls tasks # Run academic benchmarks nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml ``` ### Built-in Result Export First-class integration with MLOps platforms: ```bash # Export to MLflow nemo-evaluator-launcher export --dest mlflow # Export to Weights & Biases nemo-evaluator-launcher export --dest wandb # Export to Google Sheets nemo-evaluator-launcher export --dest gsheets ``` ## **Core Evaluation Engine (NeMo Evaluator Core)** ### Container-First Architecture Pre-built NGC containers guarantee reproducible results across environments: ```bash # Pull and run any evaluation container docker pull nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} docker run --rm -it --gpus all nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} ``` ### Advanced Adapter System Sophisticated request/response processing pipeline with interceptor architecture: ```yaml # Configure adapter system in YAML configuration target: api_endpoint: url: "http://localhost:8080/v1/completions/" model_id: "my-model" adapter_config: interceptors: # System message interceptor - name: system_message config: system_message: "You are a helpful AI assistant. Think step by step." # Request logging interceptor - name: request_logging config: max_requests: 1000 # Caching interceptor - name: caching config: cache_dir: "./evaluation_cache" # Communication with http://localhost:8080/v1/completions/ - name: endpoint # Processing of reasoning traces - name: reasoning config: start_reasoning_token: "" end_reasoning_token: "" # Response logging interceptor - name: response_logging config: max_responses: 1000 # Progress tracking interceptor - name: progress_tracking ``` ### Programmatic API Full Python API for integration into ML pipelines: ```python from nemo_evaluator.core.evaluate import evaluate from nemo_evaluator.api.api_dataclasses import EvaluationConfig, EvaluationTarget # Configure and run evaluation programmatically result = evaluate( eval_cfg=EvaluationConfig(type="mmlu_pro", output_dir="./results"), target_cfg=EvaluationTarget(api_endpoint=endpoint_config) ) ``` ## **Container Direct Access** ### NGC Container Catalog Direct access to specialized evaluation containers through [NGC Catalog](https://catalog.ngc.nvidia.com/search?orderBy=scoreDESC&query=label%3A%22NSPECT-JL1B-TVGU%22): ```bash # Academic benchmarks docker run --rm -it nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} # Code generation evaluation docker run --rm -it nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:{{ docker_compose_latest }} # Safety and security testing docker run --rm -it nvcr.io/nvidia/eval-factory/safety-harness:{{ docker_compose_latest }} # Vision-language model evaluation docker run --rm -it nvcr.io/nvidia/eval-factory/vlmevalkit:{{ docker_compose_latest }} ``` ### Reproducible Evaluation Environments Every container provides: - **Fixed dependencies**: Locked versions for consistent results - **Pre-configured frameworks**: Ready-to-run evaluation harnesses - **Isolated execution**: No dependency conflicts between evaluations - **Version tracking**: Tagged releases for exact reproducibility ## **Enterprise Features** ### Multi-Backend Scalability Scale from laptop to datacenter with unified configuration: - **Local Development**: Quick iteration with Docker - **HPC Clusters**: Slurm integration for large-scale evaluation - **Cloud Platforms**: Lepton AI and custom backend support - **Hybrid Workflows**: Seamless transition between environments ### Advanced Configuration Management Hydra-based configuration with full reproducibility: ```yaml # Evaluation configuration with custom parameters evaluation: tasks: - name: mmlu_pro nemo_evaluator_config: config: params: limit_samples: 1000 - name: gsm8k nemo_evaluator_config: config: params: temperature: 0.0 execution: output_dir: results target: api_endpoint: url: https://my-model-endpoint.com/v1/chat/completions model_id: my-custom-model ``` ## **OpenAI API Compatibility** ### Universal Model Support NeMo Evaluator supports OpenAI-compatible API endpoints: - **Hosted Models**: NVIDIA Build, OpenAI, Anthropic, Cohere - **Self-Hosted**: vLLM, TRT-LLM, NeMo Framework - **Custom Endpoints**: Any service implementing OpenAI API spec (test compatibility with our {ref}`deployment-testing-compatibility` guide) The platform supports the following endpoint types: - **`completions`**: Direct text completion without chat formatting (`/v1/completions`). Used for base models and academic benchmarks. - **`chat`**: Conversational interface with role-based messages (`/v1/chat/completions`). Used for instruction-tuned and chat models. - **`vlm`**: Vision-language model endpoints supporting image inputs. - **`embedding`**: Embedding generation endpoints for retrieval evaluation. ### Endpoint Type Support Support for diverse evaluation endpoint types through the evaluation configuration: ```yaml # Text generation evaluation (chat endpoint) target: api_endpoint: type: chat url: https://api.example.com/v1/chat/completions # Log-probability evaluation (completions endpoint) target: api_endpoint: type: completions url: https://api.example.com/v1/completions # Vision-language evaluation (vlm endpoint) target: api_endpoint: type: vlm url: https://api.example.com/v1/chat/completions # Retrival evaluation (embedding endpoint) target: api_endpoint: type: embedding url: https://api.example.com/v1/embeddings ``` ## **Extensibility and Customization** ### Custom Framework Support Add your own evaluation frameworks using Framework Definition Files: ```yaml # custom_framework.yml framework: name: my_custom_eval description: Custom evaluation for domain-specific tasks defaults: command: >- python custom_eval.py --model {{target.api_endpoint.model_id}} --task {{config.params.task}} --output {{config.output_dir}} evaluations: - name: domain_specific_task description: Evaluate domain-specific capabilities defaults: config: params: task: domain_task temperature: 0.0 ``` ### Advanced Interceptor Configuration Fine-tune request/response processing with the adapter system through YAML configuration: ```yaml # Production-ready adapter configuration in framework YAML target: api_endpoint: url: "https://production-api.com/v1/completions" model_id: "production-model" adapter_config: log_failed_requests: true interceptors: # System message interceptor - name: system_message config: system_message: "You are an expert AI assistant specialized in this domain." # Request logging interceptor - name: request_logging config: max_requests: 5000 # Caching interceptor - name: caching config: cache_dir: "./production_cache" # Reasoning interceptor - name: reasoning config: start_reasoning_token: "" end_reasoning_token: "" # Response logging interceptor - name: response_logging config: max_responses: 5000 # Progress tracking interceptor - name: progress_tracking config: progress_tracking_url: "http://monitoring.internal:3828/progress" ``` ## **Security and Safety** ### Comprehensive Safety Evaluation Built-in safety assessment through specialized containers: ```bash # Run Aegis and Garak evaluations export JUDGE_API_KEY=your-judge-api-key # token with access to your judge endpoint export HF_TOKEN_FOR_AEGIS_V2=hf_your-token # HF token with access to the gated Aegis dataset and meta-llama/Llama-3.1-8B-Instruct export NGC_API_KEY=nvapi-your-key # token with access to build.com # set judge.url in the config or pass with -o nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_safety.yaml \ ``` **Safety Containers Available:** - **safety-harness**: Content safety evaluation using NemoGuard judge models - **garak**: Security vulnerability scanning and prompt injection detection ## **Monitoring and Observability** ### Real-Time Progress Tracking Monitor evaluation progress across all backends: ```bash # Check evaluation status nemo-evaluator-launcher status # Kill running evaluations nemo-evaluator-launcher kill ``` ### Result Export and Analysis Export evaluation results to MLOps platforms for downstream analysis: ```bash # Export to MLflow for experiment tracking nemo-evaluator-launcher export --dest mlflow # Export to Weights & Biases for visualization nemo-evaluator-launcher export --dest wandb # Export to Google Sheets for sharing nemo-evaluator-launcher export --dest gsheets ``` (about-release-notes)= # Release Notes ## NeMo Evaluator SDK — General Availability (0.1.0) NVIDIA is excited to announce the general availability of NeMo Evaluator SDK, an open-source platform for robust, reproducible, and scalable evaluation of large language models. ### Overview NeMo Evaluator SDK provides a comprehensive solution for AI model evaluation and benchmarking, enabling researchers, ML engineers, and organizations to assess model performance across diverse capabilities including reasoning, code generation, function calling, and safety. The platform consists of two core libraries: - **{ref}`nemo-evaluator `**: The core evaluation engine that manages interactions between evaluation harnesses and models being tested - **{ref}`nemo-evaluator-launcher `**: The orchestration layer providing unified CLI and programmatic interfaces for multi-backend execution ### Key Features **Reproducibility by Default**: All configurations, random seeds, and software provenance are captured automatically for auditable and repeatable evaluations. **Scale Anywhere**: Run evaluations from a local machine to a Slurm cluster or cloud-native backends without changing your workflow. **State-of-the-Art Benchmarking**: Access a comprehensive suite of over 100 benchmarks from 21+ popular open-source evaluation harnesses, including popular frameworks such as lm-evaluation-harness, bigcode-evaluation-harness, simple-evals, and specialized tools for safety, function calling, and agentic AI evaluation. **Extensible and Customizable**: Integrate new evaluation harnesses, add custom benchmarks with proprietary data, and define custom result exporters for existing MLOps tooling. **OpenAI-Compatible API Support**: Evaluate any model that exposes an OpenAI-compatible endpoint, including hosted services (build.nvidia.com), self-hosted solutions (NVIDIA NIM, vLLM, TensorRT-LLM), and models trained with NeMo framework. **Containerized Execution**: All evaluations run in open-source Docker containers for auditable and trustworthy results, with pre-built containers available through the NVIDIA NGC catalog. (get-started-overview)= # Get Started ## Before You Start Before you begin, make sure you have: - **Python Environment**: Python 3.10 or higher (up to 3.13) - **OpenAI-Compatible Endpoint**: Hosted or self-deployed model API - **Docker**: For container-based evaluation workflows (optional) - **NVIDIA GPU**: For local model deployment (optional) --- ## Quick Start Path ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`download;1.5em;sd-mr-1` Installation :link: gs-install :link-type: ref Install {{ product_name_short }} and set up your evaluation environment with all necessary dependencies. ::: :::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` Quick Start :link: gs-quickstart :link-type: ref Deploy your first model and run a simple evaluation in just a few minutes. ::: :::: ## Entry Point Decision Guide NeMo Evaluator provides three primary entry points, each designed for different user needs and workflows. Use this guide to choose the right approach for your use case. ```{mermaid} flowchart TD A[I need to evaluate AI models] --> B{What's your primary goal?} B -->|Quick evaluations with minimal setup| C[NeMo Evaluator Launcher] B -->|Custom integrations and workflows| D[NeMo Evaluator Core] B -->|Direct container control| E[Direct Container Usage] C --> C1[ Unified CLI interface
Multi-backend execution
Built-in result export
100+ benchmarks ready] D --> D1[ Programmatic API control
Custom evaluation workflows
Adapter/interceptor system
Framework extensions] E --> E1[ Maximum flexibility
Custom container workflows
Direct framework access
Advanced users only] C1 --> F[Start with Launcher Quickstart] D1 --> G[Start with Core API Guide] E1 --> H[Start with Container Reference] style C fill:#e1f5fe style D fill:#f3e5f5 style E fill:#fff3e0 ``` ## What You'll Learn By the end of this section, you'll be able to: 1. **Install and configure** NeMo Evaluator components for your needs 2. **Choose the right approach** from the three-tier architecture 3. **Run your first evaluation** using hosted or self-deployed endpoints 4. **Configure advanced features** like adapters and interceptors 5. **Integrate evaluations** into your ML workflows ## Typical Workflows ### **Launcher Workflow** (Most Users) 1. **Install** NeMo Evaluator Launcher 2. **Configure** endpoint and benchmarks in YAML 3. **Run** evaluations with single CLI command 4. **Export** results to MLflow, W&B, or local files ### **Core API Workflow** (Developers) 1. **Install** NeMo Evaluator Core library 2. **Configure** adapters and interceptors programmatically 3. **Integrate** into existing ML pipelines 4. **Customize** evaluation logic and processing ### **Container Workflow** (Container Users) 1. **Pull** pre-built evaluation containers 2. **Run** evaluations directly in isolated environments 3. **Mount** data and results for persistence 4. **Combine** with existing container orchestration (gs-install)= # Installation Guide NeMo Evaluator provides multiple installation paths depending on your needs. Choose the approach that best fits your use case. ## Choose Your Installation Path ```{list-table} Installation Path Comparison :header-rows: 1 :widths: 25 25 50 * - **Installation Path** - **Best For** - **Key Features** * - **NeMo Evaluator Launcher** (Recommended) - Most users who want unified CLI and orchestration across backends - • Unified CLI for 100+ benchmarks • Multi-backend execution (local, Slurm, cloud) • Built-in result export to MLflow, W&B, etc. • Configuration management with examples * - **NeMo Evaluator Core** - Developers building custom evaluation pipelines - • Programmatic Python API • Direct container access • Custom framework integration • Advanced adapter configuration * - **Container Direct** - Users who prefer container-based workflows - • Pre-built NGC evaluation containers • Guaranteed reproducibility • No local installation required • Isolated evaluation environments ``` --- ## Prerequisites ### System Requirements - Python 3.10 or higher (supports 3.10, 3.11, 3.12, and 3.13) - CUDA-compatible GPU(s) (tested on RTX A6000, A100, H100) - Docker (for container-based workflows) ### Recommended Environment - Python 3.12 - CUDA 12.9 - Ubuntu 24.04 --- ## Installation Methods ::::{tab-set} :::{tab-item} Launcher (Recommended) Install NeMo Evaluator Launcher for unified CLI and orchestration: ```{literalinclude} _snippets/install_launcher.sh :language: bash :start-after: "# [snippet-start]" :end-before: "# [snippet-end]" ``` Quick verification: ```{literalinclude} _snippets/verify_launcher.sh :language: bash :start-after: "# [snippet-start]" :end-before: "# [snippet-end]" ``` ::: :::{tab-item} Core Library Install NeMo Evaluator Core for programmatic access: ```{literalinclude} _snippets/install_core.sh :language: bash :start-after: "# [snippet-start]" :end-before: "# [snippet-end]" ``` Quick verification: ```{literalinclude} _snippets/verify_core.sh :language: bash :start-after: "# [snippet-start]" :end-before: "# [snippet-end]" ``` ::: :::{tab-item} NGC Containers Use pre-built evaluation containers from NVIDIA NGC for guaranteed reproducibility: ```bash # Pull evaluation containers (no local installation needed) docker pull nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} docker pull nvcr.io/nvidia/eval-factory/lm-evaluation-harness:{{ docker_compose_latest }} docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:{{ docker_compose_latest }} ``` ```bash # Run container interactively docker run --rm -it \ -v $(pwd)/results:/workspace/results \ nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} bash # Or run evaluation directly docker run --rm \ -v $(pwd)/results:/workspace/results \ -e NGC_API_KEY=nvapi-xxx \ nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} \ nemo-evaluator run_eval \ --eval_type mmlu_pro \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ --model_id meta/llama-3.2-3b-instruct \ --api_key_name NGC_API_KEY \ --output_dir /workspace/results ``` Quick verification: ```bash # Test container access docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} \ nemo-evaluator ls | head -5 echo " Container access verified" ``` ::: :::: --- ## Clone the Repository Clone the NeMo Evaluator repository to get easy access to our ready-to-use examples: ```bash git clone https://github.com/NVIDIA-NeMo/Evaluator.git ``` Run the example: ```bash cd Evaluator/ export NGC_API_KEY=nvapi-... # API Key with access to build.nvidia.com nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_reasoning.yaml \ --override execution.output_dir=nemotron-eval ``` ## Add Evaluation Harnesses to Your Environment Build your custom evaluation pipeline by adding evaluation harness packages to your environment of choice: ```bash pip install nemo-evaluator ``` (core-wheels)= ### Available PyPI Packages ```{list-table} :header-rows: 1 :widths: 30 70 * - Package Name - PyPI URL * - nvidia-bfcl - * - nvidia-bigcode-eval - * - nvidia-compute-eval - * - nvidia-eval-factory-garak - * - nvidia-genai-perf-eval - * - nvidia-crfm-helm - * - nvidia-hle - * - nvidia-ifbench - * - nvidia-livecodebench - * - nvidia-lm-eval - * - nvidia-mmath - * - nvidia-mtbench-evaluator - * - nvidia-eval-factory-nemo-skills - * - nvidia-safety-harness - * - nvidia-scicode - * - nvidia-simple-evals - * - nvidia-tooltalk - * - nvidia-vlmeval - ``` :::{note} Evaluation harnessess that require complex environments are not available as packages but only as containers. ::: (gs-quickstart-container)= # Container Direct **Best for**: Users who prefer container-based workflows The Container Direct approach gives you full control over the container environment with volume mounting, environment variable management, and integration into Docker-based CI/CD pipelines. ## Prerequisites - Docker with GPU support - OpenAI-compatible endpoint ## Quick Start ```bash # 1. Pull evaluation container docker pull nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} # 2. Run container interactively docker run --rm -it nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} bash # 3. Inside container - set up environment export NGC_API_KEY=nvapi-your-key-here export HF_TOKEN=hf_your-token-here # If using gated datasets # 4. Run evaluation nemo-evaluator run_eval \ --eval_type mmlu_pro \ --model_id meta/llama-3.2-3b-instruct \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ --model_type chat \ --api_key_name NGC_API_KEY \ --output_dir /tmp/results \ --overrides 'config.params.limit_samples=10' # Remove to run on full benchmark ``` ## Complete Container Workflow Here's a complete example with volume mounting and advanced configuration: ```bash # 1. Create local directories for persistent storage mkdir -p ./results ./cache ./logs # 2. Run container with volume mounts docker run --rm -it \ -v $(pwd)/results:/workspace/results \ -v $(pwd)/cache:/workspace/cache \ -v $(pwd)/logs:/workspace/logs \ -e NGC_API_KEY=nvapi-your-key-here \ -e HF_TOKEN=hf_your-token-here \ nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} bash # 3. Inside container - run evaluation nemo-evaluator run_eval \ --eval_type mmlu_pro \ --model_id meta/llama-3.2-3b-instruct \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ --model_type chat \ --api_key_name NGC_API_KEY \ --output_dir /workspace/results \ --overrides 'config.params.limit_samples=3' # Remove to run on full benchmark # 4. Exit container and check results exit ls -la ./results/ ``` ## One-Liner Container Execution For automated workflows, you can run everything in a single command: ```bash NGC_API_KEY=nvapi-your-key-here # Run evaluation directly in container docker run --rm \ -v $(pwd)/results:/workspace/results \ -e NGC_API_KEY="${NGC_API_KEY}" \ nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} \ nemo-evaluator run_eval \ --eval_type mmlu_pro \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ --model_type chat \ --model_id meta/llama-3.2-3b-instruct \ --api_key_name NGC_API_KEY \ --output_dir /workspace/results ``` ## Key Features ### Full Container Control - Direct access to container environment - Custom volume mounting strategies - Environment variable management - GPU resource allocation ### CI/CD Integration - Single-command execution for automation - Docker Compose compatibility - Kubernetes deployment ready - Pipeline integration capabilities ### Persistent Storage - Volume mounting for results persistence - Cache directory management - Log file preservation - Custom configuration mounting ### Environment Isolation - Clean, reproducible environments - Dependency management handled - Version pinning through container tags - No local Python environment conflicts ## Advanced Container Patterns ### Docker Compose Integration ```yaml # docker-compose.yml version: '3.8' services: nemo-eval: image: nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} volumes: - ./results:/workspace/results - ./cache:/workspace/cache - ./configs:/workspace/configs environment: - MY_API_KEY=${NGC_API_KEY} command: | nemo-evaluator run_eval --eval_type mmlu_pro --model_id meta/llama-3.2-3b-instruct --model_url https://integrate.api.nvidia.com/v1/chat/completions --model_type chat --api_key_name MY_API_KEY --output_dir /workspace/results ``` ### Batch Processing Script ```bash #!/bin/bash # batch_eval.sh BENCHMARKS=("mmlu_pro" "gpqa_diamond" "humaneval") NGC_API_KEY=nvapi-your-key-here HF_TOKEN=hf_your-token-here # Needed for GPQA-Diamond (gated dataset) for benchmark in "${BENCHMARKS[@]}"; do echo "Running evaluation for $benchmark..." docker run --rm \ -v $(pwd)/results:/workspace/results \ -e MY_API_KEY=$NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} \ nemo-evaluator run_eval \ --eval_type $benchmark \ --model_id meta/llama-3.2-3b-instruct \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ --model_type chat \ --api_key_name MY_API_KEY \ --output_dir /workspace/results/$benchmark \ --overrides 'config.params.limit_samples=10' echo "Completed $benchmark evaluation" done echo "All evaluations completed. Results in ./results/" ``` ## Next Steps - Integrate into your CI/CD pipelines - Explore Docker Compose for multi-service setups - Consider Kubernetes deployment for scale - Try {ref}`gs-quickstart-launcher` for simplified workflows - See {ref}`gs-quickstart-core` for programmatic API and advanced adapter features (gs-quickstart-core)= # NeMo Evaluator Core **Best for**: Developers who need programmatic control The NeMo Evaluator Core provides direct Python API access for custom configurations and integration into existing Python workflows. ## Prerequisites - Python environment - OpenAI-compatible endpoint (hosted or self-deployed) and an API key (if the endpoint is gated) - Verify endpoint compatibility using our {ref}`deployment-testing-compatibility` guide ## Quick Start ```bash # 1. Install the nemo-evaluator and nvidia-simple-evals pip install nemo-evaluator nvidia-simple-evals # 2. List available benchmarks and tasks nemo-evaluator ls # 3. Run evaluation # Prerequisites: Set your API key export NGC_API_KEY="nvapi-..." # Launch using python: ``` ```{literalinclude} ../_snippets/core_basic.py :language: python :start-after: "# [snippet-start]" :end-before: "# [snippet-end]" ``` ## Complete Working Example ### Using Python API ```{literalinclude} ../_snippets/core_full_example.py :language: python :start-after: "# [snippet-start]" :end-before: "# [snippet-end]" ``` ### Using CLI ```{literalinclude} ../_snippets/core_full_cli.sh :language: bash :start-after: "# [snippet-start]" :end-before: "# [snippet-end]" ``` ## Key Features ### Programmatic Integration - Direct Python API access - Pydantic-based configuration with type hints - Integration with existing Python workflows ### Evaluation Configuration - Fine-grained parameter control via `ConfigParams` - Multiple evaluation types: `mmlu_pro`, `gsm8k`, `hellaswag`, and more - Configurable sampling, temperature, and token limits ### Endpoint Support - Chat endpoints (`EndpointType.CHAT`) - Completion endpoints (`EndpointType.COMPLETIONS`) - VLM endpoints (`EndpointType.VLM`) - Embedding endpoints (`EndpointType.EMBEDDING`) ## Advanced Usage Patterns ### Multi-Benchmark Evaluation ```{literalinclude} ../_snippets/core_multi_benchmark.py :language: python :start-after: "# [snippet-start]" :end-before: "# [snippet-end]" ``` ### Discovering Installed Benchmarks ```python from nemo_evaluator import show_available_tasks # List all installed evaluation tasks show_available_tasks() ``` :::{tip} To extend the list of benchmarks install additional harnesses. See the list of evaluation harnesses available as PyPI wheels: {ref}`core-wheels`. ::: ### Using Adapters and Interceptors For advanced evaluation scenarios, configure the adapter system with interceptors for request/response processing, caching, logging, and more: ```python from nemo_evaluator.core.evaluate import evaluate from nemo_evaluator.api.api_dataclasses import ( ApiEndpoint, EvaluationConfig, EvaluationTarget, ConfigParams, EndpointType ) from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig # Configure evaluation target with adapter api_endpoint = ApiEndpoint( url="http://0.0.0.0:8080/v1/completions/", type=EndpointType.COMPLETIONS, model_id="my_model" ) # Create adapter configuration with interceptors api_endpoint.adapter_config = AdapterConfig( interceptors=[ InterceptorConfig( name="system_message", config={"system_message": "You are a helpful AI assistant. Think step by step."} ), InterceptorConfig( name="request_logging", config={"max_requests": 50} ), InterceptorConfig( name="caching", config={ "cache_dir": "./evaluation_cache", "reuse_cached_responses": True } ), InterceptorConfig( name="endpoint", ), InterceptorConfig( name="response_logging", config={"max_responses": 50} ), InterceptorConfig( name="reasoning", config={ "start_reasoning_token": "", "end_reasoning_token": "" } ), InterceptorConfig( name="progress_tracking", config={"progress_tracking_url": "http://localhost:3828/progress"} ) ] ) target = EvaluationTarget(api_endpoint=api_endpoint) # Run evaluation with full adapter pipeline config = EvaluationConfig( type="gsm8k", output_dir="./results/gsm8k", params=ConfigParams( limit_samples=10, temperature=0.0, max_new_tokens=512, parallelism=1 ) ) if __name__ == "__main__": result = evaluate(eval_cfg=config, target_cfg=target) print(f"Evaluation completed: {result}") ``` **Available Interceptors:** - `system_message`: Add custom system prompts to chat requests - `request_logging`: Log incoming requests for debugging - `response_logging`: Log outgoing responses for debugging - `caching`: Cache responses to reduce API costs and speed up reruns - `reasoning`: Extract chain-of-thought reasoning from model responses - `progress_tracking`: Track evaluation progress and send updates For complete adapter documentation, refer to {ref}`adapters-usage`. ## Next Steps - Integrate into your existing Python workflows - Run multiple benchmarks in sequence - Explore available evaluation types with `show_available_tasks()` - Configure adapters and interceptors for advanced evaluation scenarios - Consider {ref}`gs-quickstart-launcher` for CLI workflows - Try {ref}`gs-quickstart-container` for containerized environments (gs-quickstart)= # Quickstart Get up and running with NeMo Evaluator in minutes. Choose your preferred approach based on your needs and experience level. ## Prerequisites All paths require: - OpenAI-compatible endpoint (hosted or self-deployed) - Valid API key for your chosen endpoint ## Quick Reference | Task | Command | |------|---------| | List benchmarks | `nemo-evaluator-launcher ls tasks` | | Run evaluation | `nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/.yaml` | | Check status | `nemo-evaluator-launcher status ` | | Job info | `nemo-evaluator-launcher info ` | | Export results | `nemo-evaluator-launcher export --dest local --format json` | | Dry run | Add `--dry-run` to any run command | | Test with limited samples | Add `-o +config.params.limit_samples=3` | ## Choose Your Path Select the approach that best matches your workflow and technical requirements: ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` NeMo Evaluator Launcher :link: gs-quickstart-launcher :link-type: ref **Recommended for most users** Unified CLI experience with automated container management, built-in orchestration, and result export capabilities. ::: :::{grid-item-card} {octicon}`code;1.5em;sd-mr-1` NeMo Evaluator Core :link: gs-quickstart-core :link-type: ref **For Python developers** Programmatic control with full adapter features, custom configurations, and direct API access for integration into existing workflows. ::: :::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` NeMo Framework Container :link: gs-quickstart-nemo-fw :link-type: ref **For NeMo Framework Users** End-to-end training and evaluation of large language models (LLMs). ::: :::{grid-item-card} {octicon}`container;1.5em;sd-mr-1` Container Direct :link: gs-quickstart-container :link-type: ref **For container workflows** Direct container execution with volume mounting, environment control, and integration into Docker-based CI/CD pipelines. ::: :::: ## Model Endpoints NeMo Evaluator works with any OpenAI-compatible endpoint. You have several options: ### **Hosted Endpoints** (Recommended) - **NVIDIA Build**: [build.nvidia.com](https://build.nvidia.com) - Ready-to-use hosted models - **OpenAI**: Standard OpenAI API endpoints - **Other providers**: Anthropic, Cohere, or any OpenAI-compatible API ### **Self-Hosted Options** If you prefer to host your own models, verify OpenAI compatibility using our {ref}`deployment-testing-compatibility` guide. If you are deploying the model locally with Docker, you can use a dedicated docker network. This will provide a secure connetion between deployment and evaluation docker containers. ```bash # create a dedicated docker network docker network create my-custom-network # launch deployment docker run --gpus all --network my-custom-network --name my-phi-container vllm/vllm-openai:latest \ --model microsoft/Phi-4-mini-instruct --max-model-len 8192 # Or use other serving frameworks # TRT-LLM, NeMo Framework, etc. ``` Create an evaluation config: ```yaml defaults: - execution: local - deployment: none - _self_ execution: output_dir: my_phi_test extra_docker_args: "--network my-custom-network" # same network as used for deployment target: api_endpoint: model_id: microsoft/Phi-4-mini-instruct url: http://my-phi-container:8000/v1/chat/completions api_key_name: null evaluation: tasks: - name: simple_evals.mmlu_pro nemo_evaluator_config: config: params: limit_samples: 10 # TEST ONLY: Limits to 10 samples for quick testing parallelism: 1 ``` Save the config to a file (e.g. `phi-eval.yaml`) and launch the evaluation: ```bash nemo-evaluator-launcher run \ --config ./phi-eval.yaml \ -o execution.output_dir=./phi-results ## Validation and Troubleshooting ### Quick Validation Steps Before running full evaluations, verify your setup: ```bash # 1. Test your endpoint connectivity export NGC_API_KEY=nvapi-... curl -X POST "https://integrate.api.nvidia.com/v1/chat/completions" \ -H "Authorization: Bearer $NGC_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "meta/llama-3.2-3b-instruct", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 10 }' # 2. Run a dry-run to validate configuration nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \ --dry-run # 3. Run a minimal test with very few samples nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \ -o +config.params.limit_samples=1 \ -o execution.output_dir=./test_results ``` ### Common Issues and Solutions ::::{tab-set} :::{tab-item} API Key Issues :sync: api-key ```bash # Verify your API key is set correctly echo $NGC_API_KEY # Test with a simple curl request (see above) ``` ::: :::{tab-item} Container Issues :sync: container ```bash # Check Docker is running and has GPU access docker run --rm --gpus all nvidia/cuda:11.8-base-ubuntu20.04 nvidia-smi # Pull the latest container if you have issues docker pull nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} ``` ::: :::{tab-item} Configuration Issues :sync: config ```bash # Enable debug logging export LOG_LEVEL=DEBUG # Check available evaluation types nemo-evaluator-launcher ls tasks ``` ::: :::{tab-item} Result Validation :sync: results ```bash # Check if results were generated find ./results -name "*.yml" -type f # View task results cat ./results///artifacts/results.yml # Or export and view processed results nemo-evaluator-launcher export --dest local --format json cat ./results//processed_results.json ``` ::: :::: ## Next Steps After completing your quickstart: ::::{tab-set} :::{tab-item} Explore More Benchmarks :sync: benchmarks ```bash # List all available tasks nemo-evaluator-launcher ls tasks # Run with limited samples for quick testing nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml ``` ::: :::{tab-item} Export Results :sync: export ```bash # Export to MLflow nemo-evaluator-launcher export --dest mlflow # Export to Weights & Biases nemo-evaluator-launcher export --dest wandb # Export to Google Sheets nemo-evaluator-launcher export --dest gsheets # Export to local files nemo-evaluator-launcher export --dest local --format json ``` ::: :::{tab-item} Scale to Clusters :sync: scale ```bash cd packages/nemo-evaluator-launcher # Run on Slurm cluster nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/slurm_vllm_basic.yaml # Run on Lepton AI nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/lepton_vllm.yaml ``` ::: :::: ```{toctree} :maxdepth: 1 :hidden: NeMo Evaluator Launcher NeMo Evaluator Core NeMo Framework Container Container Direct ``` (gs-quickstart-launcher)= # NeMo Evaluator Launcher **Best for**: Most users who want a unified CLI experience The NeMo Evaluator Launcher provides the simplest way to run evaluations with automated container management, built-in orchestration, and comprehensive result export capabilities. ## Prerequisites - OpenAI-compatible endpoint (hosted or self-deployed) and an API key (if the endpoint is gated), below referred as `NGC_API_KEY` in case one uses models hosted under [NVIDIA's serving platform](https://build.nvidia.com) - Docker installed (for local execution) - NeMo Evaluator repository cloned (for access to [examples](https://github.com/NVIDIA-NeMo/Evaluator/tree/main/packages/nemo-evaluator-launcher/examples)) ```bash git clone https://github.com/NVIDIA-NeMo/Evaluator.git ``` - Your Hugging Face token with access to the GPQA-Diamond dataset (click [here](https://huggingface.co/datasets/Idavidrein/gpqa) to request), below referred as `HF_TOKEN_FOR_GPQA_DIAMOND`. ## Quick Start ```bash # 1. Install the launcher pip install nemo-evaluator-launcher # Optional: Install with specific exporters pip install "nemo-evaluator-launcher[all]" # All exporters pip install "nemo-evaluator-launcher[mlflow]" # MLflow only pip install "nemo-evaluator-launcher[wandb]" # W&B only pip install "nemo-evaluator-launcher[gsheets]" # Google Sheets only # 2. List available benchmarks nemo-evaluator-launcher ls tasks # 3. Run evaluation against a hosted endpoint # Prerequisites: Set your API key and HF token. Visit https://huggingface.co/datasets/Idavidrein/gpqa # to get access to the gated GPQA dataset for this task. export NGC_API_KEY=nvapi-... export HF_TOKEN_FOR_GPQA_DIAMOND=hf_... # Move into the cloned directory (see above). cd Evaluator ``` ```{literalinclude} ../_snippets/launcher_basic.sh :language: bash :start-after: "# [snippet-start]" :end-before: "# [snippet-end]" ``` ```bash # 4. Check status nemo-evaluator-launcher status --json # use the ID printed by the run command # 5. Find all the recent runs you launched nemo-evaluator-launcher ls runs --since 2h # list runs from last 2 hours ``` :::{note} It is possible to use short version of IDs in `status` command, for example `abcd` instead of a full `abcdef0123456` or `ab.0` instead of `abcdef0123456.0`, so long as there are no collisions. This is a syntactic sugar allowing for a slightly easier usage. ::: ```bash # 6a. Check the results cat /artifacts/results.yml # use the output_dir printed by the run command # 6b. Check the running logs tail -f /*/logs/stdout.log # use the output_dir printed by the run command # 7a. Export your results (JSON/CSV) nemo-evaluator-launcher export --dest local --format json # 7b. Or see the job details, with lots of useful subcommands inside nemo-evaluator-launcher info # use the ID printed by the run command # 8. Kill the running job(s) nemo-evaluator-launcher kill # use the ID printed by the run command ``` ## Complete Working Example Here's a complete example using NVIDIA Build (build.nvidia.com): ```bash # Prerequisites: Set your API key and HF token export NGC_API_KEY=nvapi-... export HF_TOKEN_FOR_GPQA_DIAMOND=hf_... ``` ```{literalinclude} ../_snippets/launcher_full_example.sh :language: bash :start-after: "# [snippet-start]" :end-before: "# [snippet-end]" ``` **What happens:** - Pulls appropriate evaluation container - Runs benchmark against your endpoint - Saves results to specified directory - Provides monitoring and status updates ## Key Features ### Automated Container Management - Automatically pulls and manages evaluation containers - Handles volume mounting for results - No manual Docker commands required ### Built-in Orchestration - Job queuing and parallel execution - Progress monitoring and status tracking ### Result Export - Export to MLflow, Weights & Biases, or local formats - Structured result formatting - Integration with experiment tracking platforms ### Configuration Management - YAML-based configuration system - Override parameters via command line - Template configurations for common scenarios ## Next Steps - Explore different evaluation types: `nemo-evaluator-launcher ls tasks` - Try advanced configurations in the `packages/nemo-evaluator-launcher/examples/` directory - Export results to your preferred tracking platform - Scale to cluster execution with Slurm or cloud providers For more advanced control, consider the {ref}`gs-quickstart-core` Python API or {ref}`gs-quickstart-container` approaches. (gs-quickstart-nemo-fw)= # Evaluate checkpoints trained by NeMo Framework The NeMo Framework is NVIDIA’s GPU-accelerated, end-to-end training platform for large language models (LLMs), multimodal models, and speech models. It enables seamless scaling of both pretraining and post-training workloads, from a single GPU to clusters with thousands of nodes, supporting Hugging Face/PyTorch and Megatron models. NeMo includes a suite of libraries and curated training recipes to help users build models from start to finish. The NeMo Evaluator is integrated within NeMo Framework, offering streamlined deployment and advanced evaluation capabilities for models trained using NeMo, leveraging state-of-the-art evaluation harnesses. ## Prerequisites - Docker installed - CUDA-compatible GPU - [NeMo Framework docker container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags) - Access to a Megatron Bridge checkpoint ## Quick Start ```bash # 1. Start NeMo Framework Container TAG=... CHECKPOINT_PATH=/path/to/checkpoint/mbridge_llama3_8b/iter_0000000" # use absolute path docker run --rm -it -w /workdir -v $(pwd):/workdir -v $CHECKPOINT_PATH:/checkpoint/ \ --entrypoint bash \ --gpus all \ nvcr.io/nvidia/nemo:${TAG} ``` ```bash # Run inside the container: # 2. Deploy a Model python \ /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --megatron_checkpoint /checkpoint \ --model_id megatron_model \ --port 8080 \ --host 0.0.0.0 ``` ```{literalinclude} ../_snippets/nemo_fw_basic.py :language: python :start-after: "# [snippet-start]" :end-before: "# [snippet-end]" ``` ## Key Features - **Multi-Backend Deployment**: Supports PyTriton and multi-instance evaluations using the Ray Serve deployment backend - **Production-Ready**: Supports high-performance inference with CUDA graphs and flash decoding - **Multi-GPU and Multi-Node Support**: Enables distributed inference across multiple GPUs and compute nodes - **OpenAI-Compatible API**: Provides RESTful endpoints aligned with OpenAI API specifications - **Comprehensive Evaluation**: Includes state-of-the-art evaluation harnesses for academic benchmarks, reasoning benchmarks, code generation, and safety testing - **Adapter System**: Benefits from NeMo Evaluator's Adapter System for customizable request and response processing ## Advanced Usage Patterns ### Evaluate LLMs Using Log-Probabilities ```{literalinclude} ../../deployment/nemo-fw/_snippets/arc_challenge_mbridge.py :language: python :start-after: "## Run the evaluation" ``` ### Multi-Instance Deployment with Ray Deploy multiple instances of your model: ```shell python \ /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --megatron_checkpoint /checkpoint \ --model_id "megatron_model" \ --port 8080 \ # Ray server port --num_gpus 4 \ # Total GPUs available --num_replicas 2 \ # Number of model replicas --tensor_model_parallel_size 2 \ # Tensor parallelism per replica --pipeline_model_parallel_size 1 \ # Pipeline parallelism per replica --context_parallel_size 1 # Context parallelism per replica ``` Run evaluations with increased parallelism: ```python from nemo_evaluator.api import check_endpoint, evaluate from nemo_evaluator.api.api_dataclasses import EvaluationConfig, ApiEndpoint, EvaluationTarget, ConfigParams # Configure the evaluation target api_endpoint = ApiEndpoint( url="http://0.0.0.0:8080/v1/completions/", type="completions", model_id="megatron_model", ) eval_target = EvaluationTarget(api_endpoint=api_endpoint) eval_params = ConfigParams(top_p=0, temperature=0, parallelism=2) eval_config = EvaluationConfig(type='mmlu', params=eval_params, output_dir="results") if __name__ == "__main__": check_endpoint( endpoint_url=eval_target.api_endpoint.url, endpoint_type=eval_target.api_endpoint.type, model_name=eval_target.api_endpoint.model_id, ) evaluate(target_cfg=eval_target, eval_cfg=eval_config) ``` ## Next Steps - Integrate evaluation into your training pipeline - Run deployment and evaluation with NeMo Run - Configure adapters and interceptors for advanced evaluation scenarios - Explore {ref}`tutorials-overview` (tutorials-overview)= # Tutorials Master {{ product_name_short }} with hands-on tutorials and practical examples. ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`goal;1.5em;sd-mr-1` How-To :link: how-to/index :link-type: doc Hands-on, step-by-step guides showcasing a single feature or use-case. ::: :::{grid-item-card} {octicon}`play;1.5em;sd-mr-1` Evaluation with NeMo Framework :link: nemo-fw/index :link-type: doc Deploy models and run evaluations using NeMo Framework container. ::: :::{grid-item-card} {octicon}`light-bulb;1.5em;sd-mr-1` Evaluate an existing endpoint using local executor :link: local-evaluation-of-existing-endpoint :link-type: doc ::: :::: --- orphan: true --- (create-framework-definition-file)= # Tutorial: Create a Framework Definition File Learn by building a complete FDF for a simple evaluation framework. **What you'll build**: An FDF that wraps a hypothetical CLI tool called `domain-eval` **Time**: 20 minutes **Prerequisites**: - Python evaluation framework with a CLI - Basic YAML knowledge - Understanding of your framework's parameters ## What You're Creating By the end, you'll have integrated your evaluation framework with {{ product_name_short }}, allowing users to run: ```bash nemo-evaluator run_eval \ --eval_type domain_specific_task \ --model_id meta/llama-3.2-3b-instruct \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ --model_type chat ``` --- ## Step 1: Understand Your Framework First, document your framework's CLI interface. For our example `domain-eval`: ```bash # How your CLI currently works domain-eval \ --model-name gpt-4 \ --api-url https://api.example.com/v1/chat/completions \ --task medical_qa \ --temperature 0.0 \ --output-dir ./results ``` **Action**: Write down your own framework's command structure. --- ## Step 2: Create the Directory Structure ```bash mkdir -p my-framework/core_evals/domain_eval cd my-framework/core_evals/domain_eval touch framework.yml output.py __init__.py ``` **Why this structure?** {{ product_name_short }} discovers frameworks by scanning `core_evals/` directories. --- ## Step 3: Add Framework Identification Create `framework.yml` and start with the identification section: ```yaml framework: name: domain-eval # Lowercase, hyphenated pkg_name: domain_eval # Python package name full_name: Domain Evaluation Framework description: Evaluates models on domain-specific medical and legal tasks url: https://github.com/example/domain-eval ``` **Why these fields?** - `name`: Used in CLI commands (`--framework domain-eval`) - `pkg_name`: Used for Python imports - `full_name`: Shows in documentation - `url`: Links users to your source code **Test**: This minimal FDF should now be discoverable (but not runnable yet). --- ## Step 4: Map CLI Parameters to Template Variables Now map your CLI to {{ product_name_short }}'s configuration structure: | Your CLI Flag | Maps To | FDF Template Variable | |---------------|---------|----------------------| | `--model-name` | Model ID | `{{target.api_endpoint.model_id}}` | | `--api-url` | Endpoint URL | `{{target.api_endpoint.url}}` | | `--task` | Task name | `{{config.params.task}}` | | `--temperature` | Temperature | `{{config.params.temperature}}` | | `--output-dir` | Output path | `{{config.output_dir}}` | **Action**: Create this mapping for your own framework. --- ## Step 5: Write the Command Template Add the `defaults` section with your command template: ```yaml defaults: command: >- {% if target.api_endpoint.api_key is not none %}export API_KEY=${{target.api_endpoint.api_key}} && {% endif %} domain-eval --model-name {{target.api_endpoint.model_id}} --api-url {{target.api_endpoint.url}} --task {{config.params.task}} --temperature {{config.params.temperature}} --output-dir {{config.output_dir}} ``` **Understanding the template**: - `{% if ... %}`: Conditional - exports API key if provided - `{{variable}}`: Placeholder filled with actual values at runtime - Line breaks are optional (using `>-` makes it readable) **Common pattern**: Export environment variables before the command runs. --- ## Step 6: Define Default Parameters Add default configuration values: ```yaml defaults: command: >- # ... command from previous step ... config: params: temperature: 0.0 # Deterministic by default max_new_tokens: 1024 # Token limit parallelism: 10 # Concurrent requests max_retries: 5 # API retry attempts request_timeout: 60 # Seconds target: api_endpoint: type: chat # Default to chat endpoint ``` **Why defaults?** Users can run evaluations without specifying every parameter. --- ## Step 7: Define Your Evaluation Tasks Add the specific tasks your framework supports: ```yaml evaluations: - name: medical_qa description: Medical question answering evaluation defaults: config: type: medical_qa supported_endpoint_types: - chat params: task: medical_qa # Passed to --task flag - name: legal_reasoning description: Legal reasoning and case analysis defaults: config: type: legal_reasoning supported_endpoint_types: - chat - completions # Supports both endpoint types params: task: legal_reasoning temperature: 0.0 # Override for deterministic reasoning ``` **Key points**: - Each evaluation has a unique `name` (used in CLI) - `supported_endpoint_types` declares API compatibility - Task-specific `params` override framework defaults --- ## Step 8: Create the Output Parser Create `output.py` to parse your framework's results: ```python def parse_output(output_dir: str) -> dict: """Parse evaluation results from your framework's output format.""" import json from pathlib import Path # Adapt this to your framework's output format results_file = Path(output_dir) / "results.json" with open(results_file) as f: raw_results = json.load(f) # Convert to {{ product_name_short }} format return { "tasks": { "medical_qa": { "name": "medical_qa", "metrics": { "accuracy": raw_results["accuracy"], "f1_score": raw_results["f1"] } } } } ``` **What this does**: Translates your framework's output format into {{ product_name_short }}'s standard schema. --- ## Step 9: Test Your FDF Install your framework package and test: ```bash # From your-framework/ directory pip install -e . # List available evaluations (should show your tasks) eval-factory list_evals --framework domain-eval # Run a test evaluation nemo-evaluator run_eval \ --eval_type medical_qa \ --model_id gpt-3.5-turbo \ --model_url https://api.openai.com/v1/chat/completions \ --model_type chat \ --api_key_name OPENAI_API_KEY \ --output_dir ./test_results \ --overrides "config.params.limit_samples=5" ``` **Expected output**: Your CLI should execute with substituted parameters. --- ## Step 10: Add Conditional Logic (Advanced) Make parameters optional with Jinja2 conditionals: ```yaml defaults: command: >- domain-eval --model-name {{target.api_endpoint.model_id}} --api-url {{target.api_endpoint.url}} {% if config.params.task is not none %}--task {{config.params.task}}{% endif %} {% if config.params.temperature is not none %}--temperature {{config.params.temperature}}{% endif %} {% if config.params.limit_samples is not none %}--num-samples {{config.params.limit_samples}}{% endif %} --output-dir {{config.output_dir}} ``` **When to use conditionals**: For optional flags that shouldn't appear if not specified. --- ## Complete Example Here's your full `framework.yml`: ```yaml framework: name: domain-eval pkg_name: domain_eval full_name: Domain Evaluation Framework description: Evaluates models on domain-specific tasks url: https://github.com/example/domain-eval defaults: command: >- {% if target.api_endpoint.api_key is not none %}export API_KEY=${{target.api_endpoint.api_key}} && {% endif %} domain-eval --model-name {{target.api_endpoint.model_id}} --api-url {{target.api_endpoint.url}} --task {{config.params.task}} --temperature {{config.params.temperature}} --output-dir {{config.output_dir}} config: params: temperature: 0.0 max_new_tokens: 1024 parallelism: 10 max_retries: 5 request_timeout: 60 target: api_endpoint: type: chat evaluations: - name: medical_qa description: Medical question answering defaults: config: type: medical_qa supported_endpoint_types: - chat params: task: medical_qa - name: legal_reasoning description: Legal reasoning tasks defaults: config: type: legal_reasoning supported_endpoint_types: - chat - completions params: task: legal_reasoning ``` --- ## Next Steps **Dive deeper into FDF features**: {ref}`framework-definition-file` **Learn about advanced templating**: {ref}`advanced-features` **Share your framework**: Package and distribute via PyPI **Troubleshooting**: {ref}`fdf-troubleshooting` --- ## Common Patterns ### Pattern 1: Framework with Custom CLI Flags ```yaml command: >- my-eval --custom-flag {{config.params.extra.custom_value}} ``` Use `extra` dict for framework-specific parameters. ### Pattern 2: Multiple Output Files ```yaml command: >- my-eval --results {{config.output_dir}}/results.json --logs {{config.output_dir}}/logs.txt ``` Organize outputs in subdirectories using `output_dir`. ### Pattern 3: Environment Variable Setup ```yaml command: >- export HF_TOKEN=${{target.api_endpoint.api_key}} && export TOKENIZERS_PARALLELISM=false && my-eval ... ``` Set environment variables before execution. --- ## Summary You've learned how to: ✅ Create the FDF directory structure ✅ Map your CLI to template variables ✅ Write Jinja2 command templates ✅ Define default parameters ✅ Declare evaluation tasks ✅ Create output parsers ✅ Test your integration **Your framework is now integrated with {{ product_name_short }}!** # How-To Guides These practical, task-oriented guides walk you through specific configurations and workflows in NeMo Evaluator. Each guide focuses on a single feature or use case, providing clear instructions to help you accomplish common tasks efficiently. ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`comment;1.5em;sd-mr-1` Remove Reasoning Traces :link: how-to-reasoning :link-type: ref Configure NeMo Evaluator Launcher for evaluating reasoning models. ::: :::{grid-item-card} {octicon}`arrow-switch;1.5em;sd-mr-1` Switch Executor :link: how-to-switch-executors :link-type: ref Learn how to switch between execution backends (e.g. convert from local to slurm) ::: :::: :::{toctree} :caption: How-To Guides :hidden: reasoning local-to-slurm ::: (how-to-reasoning)= # Remove Reasoning Traces This guide walks you through configuring NeMo Evaluator Launcher for evaluating reasoning models. It shows how to: - adjust sampling parameters - remove reasoning traces from the answer - controlling thinking budget ensuring accurate benchmark evaluation. :::{tip} Need more in-depth explanation? See the {ref}`run-eval-reasoning` guide. ::: ## Before You Start Ensure you have: - **Model Endpoint**: An OpenAI-compatible API reasoning endpoint for your model (completions or chat). See {ref}`deployment-testing-compatibility` for snippets you can use to test your endpoint and {ref}`run-eval-reasoning` for details on reasoning models. - **API Access**: Valid API key if your endpoint requires authentication - **Installed Packages**: NeMo Evaluator or access to evaluation containers ## Prepare your config file ### Configure the Evaluation 1. Select tasks: ```yaml evaluation: tasks: - name: simple_evals.mmlu_pro - name: mgsm ``` 2. Adjust sampling parameters for a reasoning model, e.g.: ```yaml evaluation: tasks: - name: simple_evals.mmlu_pro - name: mgsm nemo_evaluator_config: config: params: temperature: 0.6 top_p: 0.95 max_new_tokens: 32768 # for reasoning + final answer request_timeout: 3600 # long timeout to account for thinking time parallelism: 1 # single parallel request to avoid overloading the server ``` 3. Enable Reasoning Interceptor to remove reasoning traces from the model's responses: ```yaml evaluation: tasks: - name: simple_evals.mmlu_pro - name: mgsm nemo_evaluator_config: config: params: temperature: 0.6 top_p: 0.95 max_new_tokens: 32768 # for reasoning + final answer request_timeout: 3600 # long timeout to account for thinking time parallelism: 1 # single parallel request to avoid overloading the server target: api_endpoint: adapter_config: interceptors: - name: endpoint - name: reasoning ``` In this example we will use [NVIDIA-Nemotron-Nano-9B-v2](https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2), which produces reasoning trace in a `...` format. If your model uses a different formatting, make sure to configure the interceptor as shown in {ref}`run-eval-reasoning`. 4. (Optional) Modify the request to turn the reasoning on. In this example we work with an endpoint that requires "/think" to be present in the system message to enable reasoning. We will use the Interceptor to add it to the request. Adjust the example below to match your endpoint (see detailed instructions in {ref}`run-eval-reasoning`). ```yaml evaluation: tasks: - name: simple_evals.mmlu_pro - name: mgsm nemo_evaluator_config: target: api_endpoint: adapter_config: interceptors: - name: system_message config: system_message: "/think" - name: endpoint - name: reasoning ``` ### Select your execution backend and deployment specification For the purpose of this example, we will use local execution without deployment. See other How-to guides to adjust this example to your needs. 1. Configure local executor ```yaml defaults: - execution: local - _self_ execution: output_dir: nel-results ``` 2. Configure target endpoint ```yaml defaults: - execution: local - deployment: none - _self_ execution: output_dir: nel-results target: api_endpoint: # see https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2 for endpoint details model_id: nvidia/nvidia-nemotron-nano-9b-v2 url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY # API Key with access to build.nvidia.com ``` ### The Full Config Combine all components into a config file for your experiment: ```yaml defaults: - execution: local - deployment: none - _self_ execution: output_dir: nel-results target: api_endpoint: # see https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2 for endpoint details model_id: nvidia/nvidia-nemotron-nano-9b-v2 url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY # API Key with access to build.nvidia.com evaluation: tasks: - name: simple_evals.mmlu_pro - name: mgsm nemo_evaluator_config: config: params: temperature: 0.6 top_p: 0.95 max_new_tokens: 32768 # for reasoning + final answer request_timeout: 3600 # long timeout to account for thinking time parallelism: 1 # single parallel request to avoid overloading the server target: api_endpoint: adapter_config: interceptors: - name: system_message config: system_message: "/think" - name: endpoint - name: reasoning ``` ## Verify and execute your experiment 1. Save the prepared config in a file, e.g. `nemotron_eval.yaml` 2. (Recommended) Inspect the configuration with `--dry_run` ```bash export NGC_API_KEY=nvapi-your-key nemo-evaluator-launcher run --config nemotron_eval.yaml --dry_run ``` 3. (Recommended) Run a short experiment with 10 samples per benchmark to verify your config ```bash export NGC_API_KEY=nvapi-your-key nemo-evaluator-launcher run --config nemotron_eval.yaml \ -o +evaluation.nemo_evaluator_config.config.params.limit_samples=10 ``` :::{tip} If everything works correctly you should see logs from the `ResponseReasoningInterceptor` similar to the ones below: ```bash [I 2025-12-02T16:14:28.257] Reasoning tracking information reasoning_words=1905 original_content_words=85 updated_content_words=85 reasoning_finished=True reasoning_started=True reasoning_tokens=unknown updated_content_tokens=unknown logger=ResponseReasoningInterceptor request_id=ccff76b2-2b85-4eed-a9d0-2363b533ae58 ``` ::: 4. Run the full experiment ```bash export NGC_API_KEY=nvapi-your-key nemo-evaluator-launcher run --config nemotron_eval.yaml ``` 5. Analyze the metrics and reasoning statistics After evaluation completes, check these key artifacts: - **`results.yaml`**: Contains the benchmark metrics (see {ref}`evaluation-output`) - **`eval_factory_metrics.json`**: Contains reasoning statistics under the `reasoning` key, including: - `responses_with_reasoning`: How many responses included reasoning traces - `reasoning_finished_count` vs `reasoning_started_count`: If these match, your `max_new_tokens` was sufficient - `reasoning_unfinished_count`: Number of responses where reasoning started but was truncated (didn't reach end token) - `reasoning_finished_ratio`: Percentage (expressed as ratio within 0-1) of responses where reasoning completed to all responses with reasoning - `avg_reasoning_words` and other word- and tokens count metrics: Use these for cost analysis :::{tip} For detailed explanation of reasoning statistics and artifacts, see {ref}`run-eval-reasoning`. ::: (how-to-switch-executors)= # Switch Executor With Nemo-evaluator, you can choose how your evaluations run: locally using Docker, on clusters with Slurm, or other options - all managed through _executors_. In this guide you will learn how to switch from one executor to another. For the purpose of this exercise we will use `local` and `slurm` executors with `vllm` model deployment. :::{tip} Learn more about the {ref}`execution-backend` concept and the {ref}`executors-overview` overview for details on available executors and their configuration. ::: ## Before You Start Ensure you have: - NeMo Evaluator Launcher config that you would like to run. You can use the config shown below, choose one of our [example configs](https://github.com/NVIDIA-NeMo/Evaluator/tree/main/packages/nemo-evaluator-launcher/examples) or prepare your own configuration. - NeMo Evaluator Launcher installed in your environment. - Access to a Slurm cluster (with appropriate partitions/queues) - [Pyxis SPANK plugin](https://github.com/NVIDIA/pyxis) installed on the cluster ## Starting Point: Config for Modifying We will use the following config as our starting point: ```yaml defaults: - execution: local - deployment: vllm - _self_ # set required execution arguments execution: output_dir: local_results deployment: checkpoint_path: null hf_model_handle: microsoft/Phi-4-mini-instruct served_model_name: microsoft/Phi-4-mini-instruct tensor_parallel_size: 1 data_parallel_size: 1 evaluation: tasks: - name: ifeval # chat benchmark will automatically use v1/chat/completions endpoint - name: gsm8k # completions benchmark will automatically use v1/completions endpoint ``` This config will run deployment of Phi-4-mini-instruct with vLLM and evaluation on IFEval and GSM8k benchmarks. The workflow is executed locally on a machine when you launch it. ## Modify the Config To permanently switch to a different execution backend, you can modify the execution section of your config: ```yaml defaults: - execution: local # old executor: local - deployment: vllm - _self_ execution: output_dir: local_results # path on your local machine ``` with a different one: ```yaml defaults: - execution: slurm/default # new executor: slurm - deployment: vllm - _self_ execution: hostname: my-cluster.login.com # SLURM headnode (login) hostname account: my_account # SLURM account allocation output_dir: /absolute/path/on/remote # ABSOLUTE path accessible to SLURM compute nodes ``` This will allow you to run the same deployment and evaluation workflow on a remote Slurm cluster. If you only want to change the executor, there's no need to update other sections of your config. ## Dynamically switch executor with CLI overrides You can also specify a different execution backend at runtime to dynamically switch from one executor to another: ```bash export CLUSTER_HOSTNAME=my-cluster.login.com # SLURM headnode (login) hostname export ACCOUNT=my_account # SLURM account allocation export OUT_DIR=/absolute/path/on/remote # ABSOLUTE path accessible to SLURM compute nodes nel run --config local_config.yaml \ -o execution=slurm/default \ -o execution.hostname=$CLUSTER_HOSTNAME \ -o execution.account=$ACCOUNT \ -o execution.output_dir=$OUT_DIR ``` This also allows you to easily switch from one Slurm cluster to another. (tutorials-local-eval-existing-endpoint)= # Local Evaluation of Existing Endpoint This tutorial shows how to evaluate an existing API endpoint using the Local executor. ## Prerequisites - Docker - Python environment with the NeMo Evaluator Launcher CLI available (install the launcher by following {ref}`gs-install`) ## Step-by-Step Guide ### 1. Select a Model You have the following options: #### Option I: Use the NVIDIA Build API - **URL**: `https://integrate.api.nvidia.com/v1/chat/completions` - **Models**: Choose any endpoint from NVIDIA Build's extensive catalog - **API Key**: Get from [build.nvidia.com](https://build.nvidia.com/meta/llama-3_1-8b-instruct). See [Setting up API Keys](https://docs.omniverse.nvidia.com/guide-sdg/latest/setup.html#preview-and-set-up-an-api-key). Make sure to export the API key: ``` export NGC_API_KEY=nvapi-... ``` #### Option II: Another Hosted Endpoint - **URL**: Your model's endpoint URL - **Models**: Any OpenAI-compatible endpoint - **API_KEY**: If your endpoint is gated, get an API key from your provider and export it: ``` export API_KEY=... ``` #### Option III: Deploy Your Own Endpoint Deploy an OpenAI-compatible endpoint using frameworks like vLLM, SGLang, TRT-LLM, or NIM. :::{note} For this tutorial, we will use `meta/llama-3.2-3b-instruct` from [build.nvidia.com](https://build.nvidia.com/meta/llama-3_1-8b-instruct). You will need to export your `NGC_API_KEY` to access this endpoint. ::: ### 2. Select Tasks Choose which benchmarks to evaluate. You can list all available tasks with the following command: ```bash nemo-evaluator-launcher ls tasks ``` For a comprehensive list of supported tasks and descriptions, see {ref}`nemo-evaluator-containers`. **Important**: Each task has a dedicated endpoint type (e.g., `/v1/chat/completions`, `/v1/completions`). Ensure that your model provides the correct endpoint type for the tasks you want to evaluate. Use our {ref}`deployment-testing-compatibility` guide to verify your endpoint supports the required formats. :::{note} For this tutorial we will pick: `ifeval` and `humaneval_instruct` as these are fast. They both use the chat endpoint. ::: ### 3. Create a Configuration File Create a `configs` directory: ```bash mkdir configs ``` Create a configuration file with a descriptive name (e.g., `configs/local_endpoint.yaml`) and populate it with the following content: ```yaml defaults: - execution: local # The evaluation will run locally on your machine using Docker - deployment: none # Since we are evaluating an existing endpoint, we don't need to deploy the model - _self_ execution: output_dir: results/${target.api_endpoint.model_id} # Logs and artifacts will be saved here mode: sequential # Default: run tasks sequentially. You can also use the mode 'parallel' target: api_endpoint: model_id: meta/llama-3.2-3b-instruct # TODO: update to the model you want to evaluate url: https://integrate.api.nvidia.com/v1/chat/completions # TODO: update to the endpoint you want to evaluate api_key_name: NGC_API_KEY # Name of the env variable that stores the API Key with access to build.nvidia.com (or model of your choice) # specify the benchmarks to evaluate evaluation: # Optional: Global evaluation overrides - these apply to all benchmarks below nemo_evaluator_config: config: params: parallelism: 2 request_timeout: 1600 tasks: - name: ifeval # use the default benchmark configuration - name: humaneval_instruct # Optional: Task overrides - here they apply only to the task `humaneval_instruct` nemo_evaluator_config: config: params: max_new_tokens: 1024 temperature: 0.3 ``` This configuration will create evaluations for 2 tasks: `ifeval` and `humaneval_instruct`. You can display the whole configuration and scripts which will be executed using `--dry-run`: ``` nemo-evaluator-launcher run --config configs/local_endpoint.yaml --dry-run ``` ### 4. Run the Evaluation Once your configuration file is complete, you can run the evaluations: ```bash nemo-evaluator-launcher run --config configs/local_endpoint.yaml ``` ### 5. Run the Same Evaluation for a Different Model (Using CLI Overrides) You can override the values from your configuration file using CLI overrides: ```bash export API_KEY= MODEL_NAME= URL= # Note: endpoint URL needs to be FULL (e.g., https://api.example.com/v1/chat/completions) nemo-evaluator-launcher run --config configs/local_endpoint.yaml \ -o target.api_endpoint.model_id=$MODEL_NAME \ -o target.api_endpoint.url=$URL \ -o target.api_endpoint.api_key_name=API_KEY ``` ### 6. Check the Job Status and Results List the runs from last 2 hours to see the invocation IDs of the two evaluation jobs: ```bash nemo-evaluator-launcher ls runs --since 2h # list runs from last 2 hours ``` Use the IDs to check the jobs statuses: ```bash nemo-evaluator-launcher status --json ``` When jobs finish, you can display results and export them using the available exporters: ```bash # Check the results cat results/*/artifacts/results.yml # Check the running logs tail -f results/*/*/logs/stdout.log # use the output_dir printed by the run command # Export metrics and metadata from both runs to json nemo-evaluator-launcher export --dest local --format json cat processed_results.json ``` Refer to {ref}`exporters-overview` for available export options. ## Next Steps - **{ref}`evaluation-configuration`**: Customize evaluation parameters and prompts - **{ref}`executors-overview`**: Try Slurm or Lepton for different environments - **{ref}`exporters-overview`**: Send results to W&B, MLFlow, or other platforms # Tutorials for NeMo Framework ## Before You Start Before starting the tutorials, ensure you have: - **NeMo Framework Container**: Running the latest [NeMo Framework container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) - **Model Checkpoint**: Access to a Megatron Bridge checkpoint (tutorials use Llama 3.2 1B Instruct converted from a Hugging Face format). - **GPU Resources**: CUDA-compatible GPU with sufficient memory - **Jupyter Environment**: Ability to run Jupyter notebooks --- ## Available Tutorials Build your expertise with these progressive tutorials: ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` Orchestrating evaluations with NeMo Run :link: nemo-run :link-type: doc Launch deployment and evaluation jobs using NeMo Run. ::: :::{grid-item-card} {octicon}`play;1.5em;sd-mr-1` Basic evaluation with MMLU Evaluation :link: https://github.com/NVIDIA-NeMo/Eval/tree/main/tutorials/mmlu.ipynb :link-type: url Deploy models and run evaluations with the MMLU benchmark for both completions and chat endpoints. ::: :::{grid-item-card} {octicon}`package;1.5em;sd-mr-1` Enable additional evaluation harnesses :link: https://github.com/NVIDIA-NeMo/Eval/tree/main/tutorials/simple-evals.ipynb :link-type: url Discover how to extend evaluation capabilities by installing additional harnesses and running HumanEval coding assessments. ::: :::{grid-item-card} {octicon}`tools;1.5em;sd-mr-1` Configure custom tasks :link: https://github.com/NVIDIA-NeMo/Eval/tree/main/tutorials/wikitext.ipynb :link-type: url Master custom evaluation workflows by running WikiText benchmark with advanced configuration and log-probability analysis. ::: :::: ## Run the Notebook Tutorials 1. Start NeMo Framework Container: ```bash # set your Hugging Face token for access to gated datasets and checkpoints export HF_TOKEN=hf_... docker run --rm -it -w /workdir -v $(pwd):/workdir \ -e HF_TOKEN \ --entrypoint bash --gpus all \ nvcr.io/nvidia/nemo:${TAG} ``` 2. Launch Jupyter: ```bash jupyter lab --ip=0.0.0.0 --port=8888 --allow-root ``` 3. Navigate to the `tutorials/` directory and open the desired notebook :::{toctree} :caption: Tutorials :hidden: nemo-run ::: # Run Evaluations with NeMo Run This tutorial explains how to run evaluations inside NeMo Framework container with NeMo Run. For detailed information about [NeMo Run](https://github.com/NVIDIA/NeMo-Run), please refer to its documentation. Below is a concise guide focused on using NeMo Run to perform evaluations in NeMo. ## Prerequisites - Docker installed - [NeMo Framework container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) - Access to a NeMo 2.0 checkpoint (tutorials use Llama 3.2 1B Instruct) - CUDA-compatible GPU with sufficient memory (for running locally) or access to a Slurm-based Cluster (for running on cluster). - NeMo Evaluator repository cloned (for access to [scripts](https://github.com/NVIDIA-NeMo/Evaluator/tree/main/scripts)) ```bash git clone https://github.com/NVIDIA-NeMo/Evaluator.git ``` - (Optional) Your Hugging Face token if you are using gated datasets (e.g. [GPQA-Diamond dataset](https://huggingface.co/datasets/Idavidrein/gpqa)). ## How it works The [evaluation_with_nemo_run.py](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/scripts/evaluation_with_nemo_run.py) script serves as a reference for launching evaluations with NeMo Run. This script demonstrates how to use NeMo Run with both local executors (your local workstation) and Slurm-based executors like clusters. In this setup, the deploy and evaluate processes are launched as two separate jobs with NeMo Run. The evaluate method waits until the PyTriton server is accessible and the model is deployed before starting the evaluations. For this purpose we define a helper function: ```{literalinclude} ../../../scripts/helpers.py :language: python :start-after: "# [snippet-start]" :end-before: "# [snippet-end]" ``` The script supports two types of serving: with Triton (default) and with Ray (pass `--serving_backend ray` flag). User-provided arguments are mapped onto flags exptected by the scripts: ```{literalinclude} ../../../scripts/evaluation_with_nemo_run.py :language: python :start-after: "# [snippet-deploy-start]" :end-before: "# [snippet-deploy-end]" ``` The script supports two modes of running the experiment: - locally, using your environment - remotely, sending the job to the Slurm-based cluster First, an executor is selected based on the arguments provided by the user, either a local one: ```{literalinclude} ../../../scripts/evaluation_with_nemo_run.py :language: python :start-after: "# [snippet-local-executor-start]" :end-before: "# [snippet-local-executor-end]" ``` or a Slurm one: ```{literalinclude} ../../../scripts/evaluation_with_nemo_run.py :language: python :start-after: "# [snippet-slurm-executor-start]" :end-before: "# [snippet-slurm-executor-end]" ``` :::{note} Please make sure to update `HF_TOKEN` with your token - in the NeMo Run script's [local_executor env_vars](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/scripts/evaluation_with_nemo_run.py#L274) if using local executor - in the [slurm_executor's env_vars](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/scripts/evaluation_with_nemo_run.py#L237) if using slurm_executor. ::: Then, the two jobs are configured: ```{literalinclude} ../../../scripts/evaluation_with_nemo_run.py :language: python :start-after: "# [snippet-config-start]" :end-before: "# [snippet-config-end]" ``` Finally, the experiment is started: ```{literalinclude} ../../../scripts/evaluation_with_nemo_run.py :language: python :start-after: "# [snippet-experiment-start]" :end-before: "# [snippet-experiment-end]" ``` ## Run Locally To run evaluations on your local workstation, use the following command: ```bash cd Evaluator/scripts python evaluation_with_nemo_run.py \ --nemo_checkpoint '/workspace/llama3_8b_nemo2/' \ --eval_task 'gsm8k' \ --devices 2 ``` :::{note} When running locally with NeMo Run, you will need to manually terminate the deploy process once evaluations are complete. ::: ## Run on Slurm-based Clusters To run evaluations on Slurm-based clusters, add the `--slurm` flag to your command and specify any custom parameters such as user, host, remote_job_dir, account, mounts, etc. Refer to the `evaluation_with_nemo_run.py` script for further details. Below is an example command: ```bash cd Evaluator/scripts python evaluation_with_nemo_run.py \ --nemo_checkpoint='/workspace/llama3_8b_nemo2' \ --slurm --nodes 1 \ --devices 8 \ --container_image "nvcr.io/nvidia/nemo:25.11" \ --tensor_parallelism_size 8 ``` (evaluation-overview)= # About Evaluation Evaluate LLMs, VLMs, agentic systems, and retrieval models across 100+ benchmarks using unified workflows. ## Before You Start Before you run evaluations, ensure you have: 1. **Chosen your approach**: See {ref}`get-started-overview` for installation and setup guidance 2. **Deployed your model**: See {ref}`deployment-overview` for deployment options 3. **OpenAI-compatible endpoint**: Your model must expose a compatible API (see {ref}`deployment-testing-compatibility`). 4. **API credentials**: Access tokens for your model endpoint and Hugging Face Hub. --- ## Quick Start: Academic Benchmarks :::{admonition} Fastest path to evaluate academic benchmarks :class: tip **For researchers and data scientists**: Evaluate your model on standard academic benchmarks in 3 steps. **Step 1: Choose Your Approach** - **Launcher CLI** (Recommended): `nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml` - **Python API**: Direct programmatic control with `evaluate()` function **Step 2: Select Benchmarks** Common academic suites: - **General Knowledge**: `mmlu_pro`, `gpqa_diamond` - **Mathematical Reasoning**: `AIME_2025`, `mgsm` - **Instruction Following**: `ifbench`, `mtbench` Discover all available tasks: ```bash nemo-evaluator-launcher ls tasks ``` **Step 3: Run Evaluation** Create `config.yml`: ```yaml defaults: - execution: local - deployment: none - _self_ evaluation: tasks: - name: mmlu_pro - name: ifbench ``` Launch the job: ```bash export NGC_API_KEY=nvapi-... nemo-evaluator-launcher run \ --config ./config.yml \ -o execution.output_dir=results \ -o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \ -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \ -o +target.api_endpoint.api_key_name=NGC_API_KEY ``` ::: --- ## Evaluation Workflows Select a workflow based on your environment and desired level of control. ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`terminal;1.5em;sd-mr-1` Launcher Workflows :link: ../get-started/quickstart/launcher :link-type: doc Unified CLI for running evaluations across local, Slurm, and cloud backends with built-in result export. ::: :::{grid-item-card} {octicon}`code;1.5em;sd-mr-1` Core API Workflows :link: ../libraries/nemo-evaluator/workflows/python-api :link-type: doc Programmatic evaluation using Python API for integration into ML pipelines and custom workflows. ::: :::{grid-item-card} {octicon}`package;1.5em;sd-mr-1` Container Workflows :link: ../libraries/nemo-evaluator/containers/index :link-type: doc Direct container access for specialized use cases and custom evaluation environments. ::: :::: ## Configuration and Customization Configure your evaluations, create custom tasks, explore benchmarks, and extend the framework with these guides. ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`list-unordered;1.5em;sd-mr-1` Benchmark Catalog :link: eval-benchmarks :link-type: ref Explore 100+ available benchmarks across 18 evaluation harnesses and their specific use cases. ::: :::{grid-item-card} {octicon}`plus;1.5em;sd-mr-1` Extend Framework :link: ../libraries/nemo-evaluator/extending/framework-definition-file/index :link-type: doc Add custom evaluation frameworks using Framework Definition Files for specialized benchmarks. ::: :::: ## Advanced Features Scale your evaluations, export results, customize adapters, and resolve issues with these advanced features. ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`workflow;1.5em;sd-mr-1` Multi-Backend Execution :link: ../libraries/nemo-evaluator-launcher/configuration/executors/index :link-type: doc Run evaluations on local machines, HPC clusters, or cloud platforms with unified configuration. ::: :::{grid-item-card} {octicon}`database;1.5em;sd-mr-1` Result Export :link: ../libraries/nemo-evaluator-launcher/exporters/index :link-type: doc Export evaluation results to MLflow, Weights & Biases, Google Sheets, and other platforms. ::: :::{grid-item-card} {octicon}`shield;1.5em;sd-mr-1` Adapter System :link: ../libraries/nemo-evaluator/interceptors/index :link-type: doc Configure request/response processing, logging, caching, and custom interceptors. ::: :::: ## Core Evaluation Concepts - For architectural details and core concepts, refer to {ref}`evaluation-model`. - For container specifications, refer to {ref}`nemo-evaluator-containers`. (eval-benchmarks)= # About Selecting Benchmarks NeMo Evaluator provides a comprehensive suite of benchmarks spanning academic reasoning, code generation, safety testing, and domain-specific evaluations. Whether you're validating a new model's capabilities or conducting rigorous academic research, you'll find the right benchmarks to assess your AI system's performance. See {ref}`benchmarks-full-list` for the complete catalog of available benchmarks. ## Available via Launcher ```{literalinclude} ../_snippets/commands/list_tasks.sh :language: bash :start-after: "# [snippet-start]" :end-before: "# [snippet-end]" ``` ## Available via Direct Container Access ```{literalinclude} ../_snippets/commands/list_tasks_core.sh :language: bash :start-after: "# [snippet-start]" :end-before: "# [snippet-end]" ``` ## Choosing Benchmarks for Academic Research :::{admonition} Benchmark Selection Guide :class: tip **For General Knowledge**: - `mmlu_pro` - Expert-level knowledge across 14 domains - `gpqa_diamond` - Graduate-level science questions **For Mathematical & Quantitative Reasoning**: - `AIME_2025` - American Invitational Mathematics Examination (AIME) 2025 questions - `mgsm` - Multilingual math reasoning **For Instruction Following & Alignment**: - `ifbench` - Precise instruction following - `mtbench` - Multi-turn conversation quality See benchmark categories below and {ref}`benchmarks-full-list` for more details. ::: ## Benchmark Categories ### **Academic and Reasoning** ```{list-table} :header-rows: 1 :widths: 20 30 30 50 * - Container - Description - NGC Catalog - Benchmarks * - **simple-evals** - Common evaluation tasks - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/simple-evals) - GPQA-D, MATH-500, AIME 24 & 25, HumanEval, HumanEval+, MGSM, MMLU (also multilingual), MMLU-Pro, MMLU-lite (AR, BN, DE, EN, ES, FR, HI, ID, IT, JA, KO, MY, PT, SW, YO, ZH), SimpleQA, BrowseComp, HealthBench * - **lm-evaluation-harness** - Language model benchmarks - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/lm-evaluation-harness) - ARC Challenge (also multilingual), GSM8K, HumanEval, HumanEval+, MBPP, MBPP+, MINERVA Math, RACE, AGIEval, BBH, BBQ, CSQA, Frames, Global MMLU, GPQA-D, HellaSwag (also multilingual), IFEval, MGSM, MMLU, MMLU-Pro, MMLU-ProX (de, es, fr, it, ja), MMLU-Redux, MUSR, OpenbookQA, Piqa, Social IQa, TruthfulQA, WikiLingua, WinoGrande * - **hle** - Academic knowledge and problem solving - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/hle) - HLE * - **ifbench** - Instruction following - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/ifbench) - IFBench * - **mtbench** - Multi-turn conversation evaluation - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mtbench) - MT-Bench * - **nemo-skills** - Language model benchmarks (science, math, agentic) - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/nemo_skills) - AIME 24 & 25, BFCL_v3, GPQA, HLE, LiveCodeBench, MMLU, MMLU-Pro * - **profbench** - Evaluation of professional knowledge accross Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mtbench) - Report Gerenation, LLM Judge ``` :::{note} BFCL tasks from the nemo-skills container require function calling capabilities. See {ref}`deployment-testing-compatibility` for checking if your endpoint is compatible. ::: **Example Usage:** Create `config.yml`: ```yaml defaults: - execution: local - deployment: none - _self_ evaluation: tasks: - name: ifeval - name: gsm8k_cot_instruct - name: gpqa_diamond ``` Run evaluation: ```bash export NGC_API_KEY=nvapi-... export HF_TOKEN=hf_... nemo-evaluator-launcher run \ --config ./config.yml \ -o execution.output_dir=results \ -o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \ -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \ -o +target.api_endpoint.api_key_name=NGC_API_KEY ``` ### **Code Generation** ```{list-table} :header-rows: 1 :widths: 20 30 30 50 * - Container - Description - NGC Catalog - Benchmarks * - **bigcode-evaluation-harness** - Code generation evaluation - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bigcode-evaluation-harness) - MBPP, MBPP-Plus, HumanEval, HumanEval+, Multiple (cpp, cs, d, go, java, jl, js, lua, php, pl, py, r, rb, rkt, rs, scala, sh, swift, ts) * - **livecodebench** - Coding - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/livecodebench) - LiveCodeBench (v1-v6, 0724_0125, 0824_0225) * - **scicode** - Coding for scientific research - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/scicode) - SciCode ``` **Example Usage:** Create `config.yml`: ```yaml defaults: - execution: local - deployment: none - _self_ evaluation: tasks: - name: humaneval_instruct - name: mbbp ``` Run evaluation: ```bash export NGC_API_KEY=nvapi-... nemo-evaluator-launcher run \ --config ./config.yml \ -o execution.output_dir=results \ -o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \ -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \ -o +target.api_endpoint.api_key_name=NGC_API_KEY ``` ### **Safety and Security** ```{list-table} :header-rows: 1 :widths: 20 30 30 50 * - Container - Description - NGC Catalog - Benchmarks * - **garak** - Safety and vulnerability testing - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/garak) - Garak * - **safety-harness** - Safety and bias evaluation - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/safety-harness) - Aegis v2, WildGuard ``` **Example Usage:** Create `config.yml`: ```yaml defaults: - execution: local - deployment: none - _self_ evaluation: tasks: - name: aegis_v2 - name: garak ``` Run evaluation: ```bash export NGC_API_KEY=nvapi-... export HF_TOKEN=hf_... nemo-evaluator-launcher run \ --config ./config.yml \ -o execution.output_dir=results \ -o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \ -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \ -o +target.api_endpoint.api_key_name=NGC_API_KEY ``` ### **Function Calling** ```{list-table} :header-rows: 1 :widths: 20 30 30 50 * - Container - Description - NGC Catalog - Benchmarks * - **bfcl** - Function calling - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bfcl) - BFCL v2 and v3 * - **tooltalk** - Tool usage evaluation - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/tooltalk) - ToolTalk ``` :::{note} Some of the tasks in this category require function calling capabilities. See {ref}`deployment-testing-compatibility` for checking if your endpoint is compatible. ::: **Example Usage:** Create `config.yml`: ```yaml defaults: - execution: local - deployment: none - _self_ evaluation: tasks: - name: bfclv2_ast_prompting - name: tooltalk ``` Run evaluation: ```bash export NGC_API_KEY=nvapi-... nemo-evaluator-launcher run \ --config ./config.yml \ -o execution.output_dir=results \ -o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \ -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \ -o +target.api_endpoint.api_key_name=NGC_API_KEY ``` ### **Vision-Language Models** ```{list-table} :header-rows: 1 :widths: 20 30 30 50 * - Container - Description - NGC Catalog - Benchmarks * - **vlmevalkit** - Vision-language model evaluation - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/vlmevalkit) - AI2D, ChartQA, MMMU, MathVista-MINI, OCRBench, SlideVQA ``` :::{note} The tasks in this category require a VLM chat endpoint. See {ref}`deployment-testing-compatibility` for checking if your endpoint is compatible. ::: **Example Usage:** Create `config.yml`: ```yaml defaults: - execution: local - deployment: none - _self_ evaluation: tasks: - name: ocrbench - name: chartqa ``` Run evaluation: ```bash export NGC_API_KEY=nvapi-... nemo-evaluator-launcher run \ --config ./config.yml \ -o execution.output_dir=results \ -o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \ -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \ -o +target.api_endpoint.api_key_name=NGC_API_KEY ``` ### **Domain-Specific** ```{list-table} :header-rows: 1 :widths: 20 30 30 50 * - Container - Description - NGC Catalog - Benchmarks * - **helm** - Holistic evaluation framework - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/helm) - MedHelm ``` **Example Usage:** Create `config.yml`: ```yaml defaults: - execution: local - deployment: none - _self_ evaluation: tasks: - name: pubmed_qa - name: medcalc_bench ``` Run evaluation: ```bash export NGC_API_KEY=nvapi-... nemo-evaluator-launcher run \ --config ./config.yml \ -o execution.output_dir=results \ -o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \ -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \ -o +target.api_endpoint.api_key_name=NGC_API_KEY ``` ## Container Details For detailed specifications of each container, see {ref}`nemo-evaluator-containers`. ### Quick Container Access Pull and run any evaluation container directly: ```bash # Academic benchmarks docker pull nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} docker run --rm -it nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} # Code generation docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:{{ docker_compose_latest }} docker run --rm -it nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:{{ docker_compose_latest }} # Safety evaluation docker pull nvcr.io/nvidia/eval-factory/safety-harness:{{ docker_compose_latest }} docker run --rm -it nvcr.io/nvidia/eval-factory/safety-harness:{{ docker_compose_latest }} ``` ### Available Tasks by Container For a complete list of available tasks in each container: ```bash # List tasks in any container docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} nemo-evaluator ls # Or use the launcher for unified access nemo-evaluator-launcher ls tasks ``` ## Integration Patterns NeMo Evaluator provides multiple integration options to fit your workflow: ```bash # Launcher CLI (recommended for most users) nemo-evaluator-launcher ls tasks nemo-evaluator-launcher run --config ./local_mmlu_evaluation.yaml # Container direct execution docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} nemo-evaluator ls # Python API (for programmatic control) # See the Python API documentation for details ``` ## Benchmark Selection Best Practices ### For Model Development **Iterative Testing**: - Start with `limit_samples=100` for quick feedback during development - Run full evaluations before major releases - Track metrics over time to measure improvement **Configuration**: ```python # Development testing params = ConfigParams( limit_samples=100, # Quick iteration temperature=0.01, # Deterministic parallelism=4 ) # Production evaluation params = ConfigParams( limit_samples=None, # Full dataset temperature=0.01, # Deterministic parallelism=8 # Higher throughput ) ``` ### For Specialized Domains - **Code Models**: Focus on `humaneval`, `mbpp`, `livecodebench` - **Instruction Models**: Emphasize `ifbench`, `mtbench` - **Multilingual Models**: Include `arc_multilingual`, `hellaswag_multilingual`, `mgsm` - **Safety-Critical**: Prioritize `safety-harness` and `garak` evaluations ## Next Steps - **Container Details**: Browse {ref}`nemo-evaluator-containers` for complete specifications - **Custom Benchmarks**: Learn {ref}`framework-definition-file` for custom evaluations :::{toctree} :caption: Harnesses :hidden: AA-LCR bfcl bigcode-evaluation-harness codec garak genai_perf_eval helm hle ifbench livecodebench lm-evaluation-harness mmath mtbench mteb nemo_skills profbench ruler safety_eval scicode simple_evals tau2_bench tooltalk vlmevalkit ::: ```{list-table} :header-rows: 1 :widths: 18 30 18 8 26 * - Container - Description - Container Ref - Arch - Tasks * - AA-LCR - A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer). - `26.01` - `multiarch` - aa_lcr * - bfcl - The Berkeley Function Calling Leaderboard V3 (also called Berkeley Tool Calling Leaderboard V3) evaluates the LLM's ability to call functions (aka tools) accurately. - `26.01` - `multiarch` - bfclv2, bfclv2_ast, bfclv2_ast_prompting, bfclv3, bfclv3_ast, bfclv3_ast_prompting * - bigcode-evaluation-harness - A framework for the evaluation of autoregressive code generation language models. - `26.01` - `multiarch` - humaneval, humaneval_instruct, humanevalplus, mbpp-chat, mbpp-completions, mbppplus-chat, mbppplus-completions, mbppplus_nemo, multiple-cpp, multiple-cs, multiple-d, multiple-go, multiple-java, multiple-jl, multiple-js, multiple-lua, multiple-php, multiple-pl, multiple-py, multiple-r, multiple-rb, multiple-rkt, multiple-rs, multiple-scala, multiple-sh, multiple-swift, multiple-ts * - codec - Contamination detection framework for evaluating language models - `26.01` - `amd` - aime_2024, aime_2025, bbq, bfcl_v3, frames, gpqa_diamond, gsm8k_test, gsm8k_train, hellaswag_test, hellaswag_train, hle, ifbench, ifeval, livecodebench_v1, livecodebench_v5, math_500_problem, math_500_solution, mmlu_pro_test, mmlu_test, openai_humaneval, reward_bench_v1, reward_bench_v2, scicode, swebench_test, swebench_train, taubench, terminalbench * - garak - Garak is an LLM vulnerability scanner. - `26.01` - `multiarch` - garak, garak-completions * - genai_perf_eval - GenAI Perf is a tool to evaluate the performance of LLM endpoints, based on GenAI Perf. - `26.01` - `amd` - genai_perf_generation, genai_perf_generation_completions, genai_perf_summarization, genai_perf_summarization_completions * - helm - A framework for evaluating large language models in medical applications across various healthcare tasks - `26.01` - `amd` - aci_bench, ehr_sql, head_qa, med_dialog_healthcaremagic, med_dialog_icliniq, medbullets, medcalc_bench, medec, medhallu, medi_qa, medication_qa, mtsamples_procedures, mtsamples_replicate, pubmed_qa, race_based_med * - hle - Humanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. Humanity's Last Exam consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. - `26.01` - `multiarch` - hle, hle_aa_v2 * - ifbench - IFBench is a new, challenging benchmark for precise instruction following. - `26.01` - `multiarch` - ifbench, ifbench_aa_v2 * - livecodebench - Holistic and Contamination Free Evaluation of Large Language Models for Code. - `26.01` - `multiarch` - codeexecution_v2, codeexecution_v2_cot, codegeneration_notfast, codegeneration_release_latest, codegeneration_release_v1, codegeneration_release_v2, codegeneration_release_v3, codegeneration_release_v4, codegeneration_release_v5, codegeneration_release_v6, livecodebench_0724_0125, livecodebench_0824_0225, livecodebench_aa_v2, testoutputprediction * - lm-evaluation-harness - This project provides a unified framework to test generative language models on a large number of different evaluation tasks. - `26.01` - `multiarch` - adlr_agieval_en_cot, adlr_arc_challenge_llama_25_shot, adlr_commonsense_qa_7_shot, adlr_global_mmlu_lite_5_shot, adlr_gpqa_diamond_cot_5_shot, adlr_gsm8k_cot_8_shot, adlr_humaneval_greedy, adlr_humaneval_sampled, adlr_math_500_4_shot_sampled, adlr_mbpp_sanitized_3_shot_greedy, adlr_mbpp_sanitized_3_shot_sampled, adlr_mgsm_native_cot_8_shot, adlr_minerva_math_nemo_4_shot, adlr_mmlu, adlr_mmlu_pro_5_shot_base, adlr_race, adlr_truthfulqa_mc2, adlr_winogrande_5_shot, agieval, arc_challenge, arc_challenge_chat, arc_multilingual, bbh, bbh_instruct, bbq_chat, bbq_completions, commonsense_qa, global_mmlu, global_mmlu_ar, global_mmlu_bn, global_mmlu_de, global_mmlu_en, global_mmlu_es, global_mmlu_fr, global_mmlu_full, global_mmlu_full_am, global_mmlu_full_ar, global_mmlu_full_bn, global_mmlu_full_cs, global_mmlu_full_de, global_mmlu_full_el, global_mmlu_full_en, global_mmlu_full_es, global_mmlu_full_fa, global_mmlu_full_fil, global_mmlu_full_fr, global_mmlu_full_ha, global_mmlu_full_he, global_mmlu_full_hi, global_mmlu_full_id, global_mmlu_full_ig, global_mmlu_full_it, global_mmlu_full_ja, global_mmlu_full_ko, global_mmlu_full_ky, global_mmlu_full_lt, global_mmlu_full_mg, global_mmlu_full_ms, global_mmlu_full_ne, global_mmlu_full_nl, global_mmlu_full_ny, global_mmlu_full_pl, global_mmlu_full_pt, global_mmlu_full_ro, global_mmlu_full_ru, global_mmlu_full_si, global_mmlu_full_sn, global_mmlu_full_so, global_mmlu_full_sr, global_mmlu_full_sv, global_mmlu_full_sw, global_mmlu_full_te, global_mmlu_full_tr, global_mmlu_full_uk, global_mmlu_full_vi, global_mmlu_full_yo, global_mmlu_full_zh, global_mmlu_hi, global_mmlu_id, global_mmlu_it, global_mmlu_ja, global_mmlu_ko, global_mmlu_pt, global_mmlu_sw, global_mmlu_yo, global_mmlu_zh, gpqa, gpqa_diamond_cot, gsm8k, gsm8k_cot_instruct, gsm8k_cot_llama, gsm8k_cot_zeroshot, gsm8k_cot_zeroshot_llama, hellaswag, hellaswag_multilingual, humaneval_instruct, ifeval, m_mmlu_id_str_chat, m_mmlu_id_str_completions, mbpp_plus_chat, mbpp_plus_completions, mgsm, mgsm_cot_chat, mgsm_cot_completions, mmlu, mmlu_cot_0_shot_chat, mmlu_instruct, mmlu_instruct_completions, mmlu_logits, mmlu_pro, mmlu_pro_instruct, mmlu_prox_chat, mmlu_prox_completions, mmlu_prox_de_chat, mmlu_prox_de_completions, mmlu_prox_es_chat, mmlu_prox_es_completions, mmlu_prox_fr_chat, mmlu_prox_fr_completions, mmlu_prox_it_chat, mmlu_prox_it_completions, mmlu_prox_ja_chat, mmlu_prox_ja_completions, mmlu_redux, mmlu_redux_instruct, musr, openbookqa, piqa, social_iqa, truthfulqa, wikilingua, wikitext, winogrande * - mmath - MMATH is a new benchmark specifically designed for multilingual complex reasoning. It comprises 374 carefully selected math problems from high-quality sources, including AIME, CNMO, and MATH-500, and covers ten typologically and geographically diverse languages. Each problem is translated and validated through a rigorous pipeline that combines frontier LLMs with human verification, ensuring semantic consistency. - `26.01` - `multiarch` - mmath_ar, mmath_en, mmath_es, mmath_fr, mmath_ja, mmath_ko, mmath_pt, mmath_th, mmath_vi, mmath_zh * - mtbench - MT-bench is designed to test multi-turn conversation and instruction-following ability, covering common use cases and focusing on challenging questions to differentiate models. - `26.01` - `multiarch` - mtbench, mtbench-cor1 * - mteb - The Massive Text Embedding Benchmark (MTEB) is a comprehensive benchmark designed to evaluate the performance of text embedding models across a wide range of tasks and datasets. It includes 58 datasets covering 8 tasks and 112 languages. - `26.01` - `multiarch` - MMTEB, MTEB, MTEB_NL_RETRIEVAL, MTEB_VDR, RTEB, ViDoReV1, ViDoReV2, ViDoReV3, ViDoReV3_Text, ViDoReV3_Text_Image, custom_beir_task, fiqa, hotpotqa, miracl, miracl_lite, mldr, mlqa, nano_fiqa, nq * - nemo_skills - NeMo Skills - a project to improve skills of LLMs - `26.01` - `multiarch` - ns_aa_lcr, ns_aime2024, ns_aime2025, ns_bfcl_v3, ns_bfcl_v4, ns_gpqa, ns_hle, ns_hle_aa, ns_hmmt_feb2025, ns_ifbench, ns_ifeval, ns_livecodebench, ns_livecodebench_aa, ns_livecodebench_v5, ns_mmlu, ns_mmlu_pro, ns_mmlu_prox, ns_ruler, ns_scicode, ns_wmt24pp * - profbench - Professional domain benchmark for evaluating LLMs on Physics PhD, Chemistry PhD, Finance MBA, and Consulting MBA tasks - `26.01` - `multiarch` - llm_judge, report_generation * - ruler - RULER generates synthetic examples to evaluate long-context language models with configurable sequence length and task complexity. - `26.01` - `multiarch` - ruler-128k-chat, ruler-128k-completions, ruler-16k-chat, ruler-16k-completions, ruler-1m-chat, ruler-1m-completions, ruler-256k-chat, ruler-256k-completions, ruler-32k-chat, ruler-32k-completions, ruler-4k-chat, ruler-4k-completions, ruler-512k-chat, ruler-512k-completions, ruler-64k-chat, ruler-64k-completions, ruler-8k-chat, ruler-8k-completions, ruler-chat, ruler-completions * - safety_eval - Harness for Safety evaluations - `25.11` - `multiarch` - aegis_v2, aegis_v2_reasoning, wildguard * - scicode - SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. - `26.01` - `multiarch` - scicode, scicode_aa_v2, scicode_background * - simple_evals - simple-evals - a lightweight library for evaluating language models. - `26.01` - `multiarch` - AA_AIME_2024, AA_math_test_500, AIME_2024, AIME_2025, AIME_2025_aa_v2, aime_2024_nemo, aime_2025_nemo, browsecomp, gpqa_diamond, gpqa_diamond_aa_v2, gpqa_diamond_aa_v2_llama_4, gpqa_diamond_aa_v3, gpqa_diamond_nemo, gpqa_extended, gpqa_main, healthbench, healthbench_consensus, healthbench_hard, humaneval, humanevalplus, math_test_500, math_test_500_nemo, mgsm, mgsm_aa_v2, mmlu, mmlu_am, mmlu_ar, mmlu_ar-lite, mmlu_bn, mmlu_bn-lite, mmlu_cs, mmlu_de, mmlu_de-lite, mmlu_el, mmlu_en, mmlu_en-lite, mmlu_es, mmlu_es-lite, mmlu_fa, mmlu_fil, mmlu_fr, mmlu_fr-lite, mmlu_ha, mmlu_he, mmlu_hi, mmlu_hi-lite, mmlu_id, mmlu_id-lite, mmlu_ig, mmlu_it, mmlu_it-lite, mmlu_ja, mmlu_ja-lite, mmlu_ko, mmlu_ko-lite, mmlu_ky, mmlu_llama_4, mmlu_lt, mmlu_mg, mmlu_ms, mmlu_my-lite, mmlu_ne, mmlu_nl, mmlu_ny, mmlu_pl, mmlu_pro, mmlu_pro_aa_v2, mmlu_pro_aa_v3, mmlu_pro_llama_4, mmlu_pt, mmlu_pt-lite, mmlu_ro, mmlu_ru, mmlu_si, mmlu_sn, mmlu_so, mmlu_sr, mmlu_sv, mmlu_sw, mmlu_sw-lite, mmlu_te, mmlu_tr, mmlu_uk, mmlu_vi, mmlu_yo, mmlu_yo-lite, mmlu_zh-lite, simpleqa * - tau2_bench - Evaluating Conversational Agents in a Dual-Control Environment - `26.01` - `multiarch` - tau2_bench_airline, tau2_bench_retail, tau2_bench_telecom * - tooltalk - ToolTalk is designed to evaluate tool-augmented LLMs as a chatbot. ToolTalk contains a handcrafted dataset of 28 easy conversations and 50 hard conversations. - `26.01` - `multiarch` - tooltalk * - vlmevalkit - VLMEvalKit is an open-source evaluation toolkit of large vision-language models (LVLMs). It enables one-command evaluation of LVLMs on various benchmarks, without the heavy workload of data preparation under multiple repositories. In VLMEvalKit, we adopt generation-based evaluation for all LVLMs, and provide the evaluation results obtained with both exact matching and LLM-based answer extraction. - `26.01` - `amd` - ai2d_judge, chartqa, mathvista-mini, mmmu_judge, ocr_reasoning, ocrbench, slidevqa ``` # AA-LCR This page contains all evaluation tasks for the **AA-LCR** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [aa_lcr](#aa-lcr-aa-lcr) - A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer). ``` (aa-lcr-aa-lcr)= ## aa_lcr A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer). ::::{tab-set} :::{tab-item} Container **Harness:** `AA-LCR` **Container:** ``` nvcr.io/nvidia/eval-factory/aa-lcr:26.01 ``` **Container Digest:** ``` sha256:67dd35302ed15610afc9471a2ff4f515d95a235753f1b259db60748249366939 ``` **Container Arch:** `multiarch` **Task Type:** `aa_lcr` ::: :::{tab-item} Command ```bash aa_lcr --model={{target.api_endpoint.model_id}} --endpoint_url={{target.api_endpoint.url}} --temperature={{config.params.temperature}} --top_p={{config.params.top_p}} --request_timeout={{config.params.request_timeout}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --output_dir={{config.output_dir}} {% if target.api_endpoint.api_key_name is not none %}--api_key_name={{target.api_endpoint.api_key_name}}{% endif %} --max_retries={{config.params.max_retries}} --max_new_tokens={{config.params.max_new_tokens}} --async_limit={{config.params.parallelism}} --num_repeats={{config.params.extra.n_samples}} --seed={{config.params.extra.seed}} --judge_model={{config.params.extra.judge.model_id}} --judge_url={{config.params.extra.judge.url}} --judge_temperature={{config.params.extra.judge.temperature}} --judge_top_p={{config.params.extra.judge.top_p}} --judge_max_new_tokens={{config.params.extra.judge.max_new_tokens}} --judge_async_limit={{config.params.extra.judge.parallelism}} {% if config.params.extra.judge.api_key is defined %}--judge_api_key_name={{config.params.extra.judge.api_key}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: AA-LCR pkg_name: aa_lcr config: params: max_new_tokens: 16384 max_retries: 30 parallelism: 10 temperature: 0.0 request_timeout: 600 top_p: 1.0 extra: n_samples: 3 seed: 42 judge: url: https://integrate.api.nvidia.com/v1/chat/completions model_id: nvdev/qwen/qwen-235b request_timeout: 600 max_retries: 30 temperature: 0.0 top_p: 1.0 max_new_tokens: 1024 parallelism: 10 api_key: JUDGE_API_KEY supported_endpoint_types: - chat type: aa_lcr target: api_endpoint: {} ``` ::: :::: # bfcl This page contains all evaluation tasks for the **bfcl** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [bfclv2](#bfcl-bfclv2) - BFCL v2 with Single-turn, Live and Non-Live, AST and Exec evaluation. Not using native function calling. * - [bfclv2_ast](#bfcl-bfclv2-ast) - BFCL v2 with Single-turn, Live and Non-Live, AST evaluation only. Uses native function calling. * - [bfclv2_ast_prompting](#bfcl-bfclv2-ast-prompting) - BFCL v2 with Single-turn, Live and Non-Live, AST evaluation only. Not using native function calling. * - [bfclv3](#bfcl-bfclv3) - BFCL v3 with Single-turn and Multi-turn, Live and Non-Live, AST and Exec evaluation. Not using native function calling. * - [bfclv3_ast](#bfcl-bfclv3-ast) - BFCL v3 with Single-turn and Multi-turn, Live and Non-Live, AST evaluation. Uses native function calling. * - [bfclv3_ast_prompting](#bfcl-bfclv3-ast-prompting) - BFCL v3 with Single-turn and Multi-turn, Live and Non-Live, AST evaluation. Not using native function calling. ``` (bfcl-bfclv2)= ## bfclv2 BFCL v2 with Single-turn, Live and Non-Live, AST and Exec evaluation. Not using native function calling. ::::{tab-set} :::{tab-item} Container **Harness:** `bfcl` **Container:** ``` nvcr.io/nvidia/eval-factory/bfcl:26.01 ``` **Container Digest:** ``` sha256:5016e1f2b9984f5d348ac3806974d7b5d6ff6f550605f3220a3f08318e0c60c9 ``` **Container Arch:** `multiarch` **Task Type:** `bfclv2` ::: :::{tab-item} Command ```bash {%- if config.params.extra.custom_dataset.path is not none and config.params.extra.custom_dataset.format is not none -%} echo "Processing custom dataset..." && export BFCL_DATA_DIR=$(core-evals-process-custom-dataset \ --dataset_format {{config.params.extra.custom_dataset.format}} \ --dataset_path {{config.params.extra.custom_dataset.path}} \ --test_category {{config.params.task}} \ --processing_output_dir {{config.output_dir ~ "/custom_dataset_processing"}} \ {% if config.params.extra.custom_dataset.data_template_path %}--data_template_path {{config.params.extra.custom_dataset.data_template_path}}{% endif %}) && \ echo "Using custom dataset at ${BFCL_DATA_DIR}" && \ {% endif -%} {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl generate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} --num-threads {{config.params.parallelism}} && \ {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl evaluate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --score-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bfcl pkg_name: bfcl config: params: parallelism: 10 task: single_turn extra: native_calling: false custom_dataset: path: null format: null data_template_path: null supported_endpoint_types: - chat - vlm type: bfclv2 target: api_endpoint: {} ``` ::: :::: --- (bfcl-bfclv2-ast)= ## bfclv2_ast BFCL v2 with Single-turn, Live and Non-Live, AST evaluation only. Uses native function calling. ::::{tab-set} :::{tab-item} Container **Harness:** `bfcl` **Container:** ``` nvcr.io/nvidia/eval-factory/bfcl:26.01 ``` **Container Digest:** ``` sha256:5016e1f2b9984f5d348ac3806974d7b5d6ff6f550605f3220a3f08318e0c60c9 ``` **Container Arch:** `multiarch` **Task Type:** `bfclv2_ast` ::: :::{tab-item} Command ```bash {%- if config.params.extra.custom_dataset.path is not none and config.params.extra.custom_dataset.format is not none -%} echo "Processing custom dataset..." && export BFCL_DATA_DIR=$(core-evals-process-custom-dataset \ --dataset_format {{config.params.extra.custom_dataset.format}} \ --dataset_path {{config.params.extra.custom_dataset.path}} \ --test_category {{config.params.task}} \ --processing_output_dir {{config.output_dir ~ "/custom_dataset_processing"}} \ {% if config.params.extra.custom_dataset.data_template_path %}--data_template_path {{config.params.extra.custom_dataset.data_template_path}}{% endif %}) && \ echo "Using custom dataset at ${BFCL_DATA_DIR}" && \ {% endif -%} {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl generate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} --num-threads {{config.params.parallelism}} && \ {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl evaluate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --score-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bfcl pkg_name: bfcl config: params: parallelism: 10 task: ast extra: native_calling: true custom_dataset: path: null format: null data_template_path: null supported_endpoint_types: - chat - vlm type: bfclv2_ast target: api_endpoint: {} ``` ::: :::: --- (bfcl-bfclv2-ast-prompting)= ## bfclv2_ast_prompting BFCL v2 with Single-turn, Live and Non-Live, AST evaluation only. Not using native function calling. ::::{tab-set} :::{tab-item} Container **Harness:** `bfcl` **Container:** ``` nvcr.io/nvidia/eval-factory/bfcl:26.01 ``` **Container Digest:** ``` sha256:5016e1f2b9984f5d348ac3806974d7b5d6ff6f550605f3220a3f08318e0c60c9 ``` **Container Arch:** `multiarch` **Task Type:** `bfclv2_ast_prompting` ::: :::{tab-item} Command ```bash {%- if config.params.extra.custom_dataset.path is not none and config.params.extra.custom_dataset.format is not none -%} echo "Processing custom dataset..." && export BFCL_DATA_DIR=$(core-evals-process-custom-dataset \ --dataset_format {{config.params.extra.custom_dataset.format}} \ --dataset_path {{config.params.extra.custom_dataset.path}} \ --test_category {{config.params.task}} \ --processing_output_dir {{config.output_dir ~ "/custom_dataset_processing"}} \ {% if config.params.extra.custom_dataset.data_template_path %}--data_template_path {{config.params.extra.custom_dataset.data_template_path}}{% endif %}) && \ echo "Using custom dataset at ${BFCL_DATA_DIR}" && \ {% endif -%} {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl generate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} --num-threads {{config.params.parallelism}} && \ {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl evaluate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --score-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bfcl pkg_name: bfcl config: params: parallelism: 10 task: ast extra: native_calling: false custom_dataset: path: null format: null data_template_path: null supported_endpoint_types: - chat - vlm type: bfclv2_ast_prompting target: api_endpoint: {} ``` ::: :::: --- (bfcl-bfclv3)= ## bfclv3 BFCL v3 with Single-turn and Multi-turn, Live and Non-Live, AST and Exec evaluation. Not using native function calling. ::::{tab-set} :::{tab-item} Container **Harness:** `bfcl` **Container:** ``` nvcr.io/nvidia/eval-factory/bfcl:26.01 ``` **Container Digest:** ``` sha256:5016e1f2b9984f5d348ac3806974d7b5d6ff6f550605f3220a3f08318e0c60c9 ``` **Container Arch:** `multiarch` **Task Type:** `bfclv3` ::: :::{tab-item} Command ```bash {%- if config.params.extra.custom_dataset.path is not none and config.params.extra.custom_dataset.format is not none -%} echo "Processing custom dataset..." && export BFCL_DATA_DIR=$(core-evals-process-custom-dataset \ --dataset_format {{config.params.extra.custom_dataset.format}} \ --dataset_path {{config.params.extra.custom_dataset.path}} \ --test_category {{config.params.task}} \ --processing_output_dir {{config.output_dir ~ "/custom_dataset_processing"}} \ {% if config.params.extra.custom_dataset.data_template_path %}--data_template_path {{config.params.extra.custom_dataset.data_template_path}}{% endif %}) && \ echo "Using custom dataset at ${BFCL_DATA_DIR}" && \ {% endif -%} {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl generate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} --num-threads {{config.params.parallelism}} && \ {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl evaluate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --score-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bfcl pkg_name: bfcl config: params: parallelism: 10 task: all extra: native_calling: false custom_dataset: path: null format: null data_template_path: null supported_endpoint_types: - chat - vlm type: bfclv3 target: api_endpoint: {} ``` ::: :::: --- (bfcl-bfclv3-ast)= ## bfclv3_ast BFCL v3 with Single-turn and Multi-turn, Live and Non-Live, AST evaluation. Uses native function calling. ::::{tab-set} :::{tab-item} Container **Harness:** `bfcl` **Container:** ``` nvcr.io/nvidia/eval-factory/bfcl:26.01 ``` **Container Digest:** ``` sha256:5016e1f2b9984f5d348ac3806974d7b5d6ff6f550605f3220a3f08318e0c60c9 ``` **Container Arch:** `multiarch` **Task Type:** `bfclv3_ast` ::: :::{tab-item} Command ```bash {%- if config.params.extra.custom_dataset.path is not none and config.params.extra.custom_dataset.format is not none -%} echo "Processing custom dataset..." && export BFCL_DATA_DIR=$(core-evals-process-custom-dataset \ --dataset_format {{config.params.extra.custom_dataset.format}} \ --dataset_path {{config.params.extra.custom_dataset.path}} \ --test_category {{config.params.task}} \ --processing_output_dir {{config.output_dir ~ "/custom_dataset_processing"}} \ {% if config.params.extra.custom_dataset.data_template_path %}--data_template_path {{config.params.extra.custom_dataset.data_template_path}}{% endif %}) && \ echo "Using custom dataset at ${BFCL_DATA_DIR}" && \ {% endif -%} {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl generate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} --num-threads {{config.params.parallelism}} && \ {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl evaluate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --score-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bfcl pkg_name: bfcl config: params: parallelism: 10 task: multi_turn,ast extra: native_calling: true custom_dataset: path: null format: null data_template_path: null supported_endpoint_types: - chat - vlm type: bfclv3_ast target: api_endpoint: {} ``` ::: :::: --- (bfcl-bfclv3-ast-prompting)= ## bfclv3_ast_prompting BFCL v3 with Single-turn and Multi-turn, Live and Non-Live, AST evaluation. Not using native function calling. ::::{tab-set} :::{tab-item} Container **Harness:** `bfcl` **Container:** ``` nvcr.io/nvidia/eval-factory/bfcl:26.01 ``` **Container Digest:** ``` sha256:5016e1f2b9984f5d348ac3806974d7b5d6ff6f550605f3220a3f08318e0c60c9 ``` **Container Arch:** `multiarch` **Task Type:** `bfclv3_ast_prompting` ::: :::{tab-item} Command ```bash {%- if config.params.extra.custom_dataset.path is not none and config.params.extra.custom_dataset.format is not none -%} echo "Processing custom dataset..." && export BFCL_DATA_DIR=$(core-evals-process-custom-dataset \ --dataset_format {{config.params.extra.custom_dataset.format}} \ --dataset_path {{config.params.extra.custom_dataset.path}} \ --test_category {{config.params.task}} \ --processing_output_dir {{config.output_dir ~ "/custom_dataset_processing"}} \ {% if config.params.extra.custom_dataset.data_template_path %}--data_template_path {{config.params.extra.custom_dataset.data_template_path}}{% endif %}) && \ echo "Using custom dataset at ${BFCL_DATA_DIR}" && \ {% endif -%} {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl generate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} --num-threads {{config.params.parallelism}} && \ {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} bfcl evaluate --model {{target.api_endpoint.model_id}} --test-category {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --score-dir {{config.output_dir}} --model-args base_url={{target.api_endpoint.url}},native_calling={{config.params.extra.native_calling}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bfcl pkg_name: bfcl config: params: parallelism: 10 task: multi_turn,ast extra: native_calling: false custom_dataset: path: null format: null data_template_path: null supported_endpoint_types: - chat - vlm type: bfclv3_ast_prompting target: api_endpoint: {} ``` ::: :::: # bigcode-evaluation-harness This page contains all evaluation tasks for the **bigcode-evaluation-harness** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [humaneval](#bigcode-evaluation-harness-humaneval) - HumanEval is used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions. * - [humaneval_instruct](#bigcode-evaluation-harness-humaneval-instruct) - InstructHumanEval is a modified version of OpenAI HumanEval. For a given prompt, we extracted its signature, its docstring as well as its header to create a flexing setting which would allow to evaluation instruction-tuned LLM. The delimiters used in the instruction-tuning procedure can be use to build and instruction that would allow the model to elicit its best capabilities. * - [humanevalplus](#bigcode-evaluation-harness-humanevalplus) - HumanEvalPlus is a modified version of HumanEval containing 80x more test cases. * - [mbpp-chat](#bigcode-evaluation-harness-mbpp-chat) - MBPP consists of Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. This variant uses the chat endpoint. * - [mbpp-completions](#bigcode-evaluation-harness-mbpp-completions) - MBPP consists of Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. This variant uses the completions endpoint. * - [mbppplus-chat](#bigcode-evaluation-harness-mbppplus-chat) - MBPP+ is a modified version of MBPP containing 35x more test cases. This variant uses the chat endpoint. * - [mbppplus-completions](#bigcode-evaluation-harness-mbppplus-completions) - MBPP+ is a modified version of MBPP containing 35x more test cases. This variant uses the completions endpoint. * - [mbppplus_nemo](#bigcode-evaluation-harness-mbppplus-nemo) - MBPP+NeMo is a modified version of MBPP+ that uses the NeMo alignment prompt template. * - [multiple-cpp](#bigcode-evaluation-harness-multiple-cpp) - MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "cpp" subset. * - [multiple-cs](#bigcode-evaluation-harness-multiple-cs) - MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "cs" subset. * - [multiple-d](#bigcode-evaluation-harness-multiple-d) - MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "d" subset. * - [multiple-go](#bigcode-evaluation-harness-multiple-go) - MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "go" subset. * - [multiple-java](#bigcode-evaluation-harness-multiple-java) - MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "java" subset. * - [multiple-jl](#bigcode-evaluation-harness-multiple-jl) - MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "jl" subset. * - [multiple-js](#bigcode-evaluation-harness-multiple-js) - MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "js" subset. * - [multiple-lua](#bigcode-evaluation-harness-multiple-lua) - MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "lua" subset. * - [multiple-php](#bigcode-evaluation-harness-multiple-php) - MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "php" subset. * - [multiple-pl](#bigcode-evaluation-harness-multiple-pl) - MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "pl" subset. * - [multiple-py](#bigcode-evaluation-harness-multiple-py) - MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "py" subset. * - [multiple-r](#bigcode-evaluation-harness-multiple-r) - MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "r" subset. * - [multiple-rb](#bigcode-evaluation-harness-multiple-rb) - MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "rb" subset. * - [multiple-rkt](#bigcode-evaluation-harness-multiple-rkt) - MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "rkt" subset. * - [multiple-rs](#bigcode-evaluation-harness-multiple-rs) - MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "rs" subset. * - [multiple-scala](#bigcode-evaluation-harness-multiple-scala) - MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "scala" subset. * - [multiple-sh](#bigcode-evaluation-harness-multiple-sh) - MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "sh" subset. * - [multiple-swift](#bigcode-evaluation-harness-multiple-swift) - MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "swift" subset. * - [multiple-ts](#bigcode-evaluation-harness-multiple-ts) - MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "ts" subset. ``` (bigcode-evaluation-harness-humaneval)= ## humaneval HumanEval is used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `humaneval` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: humaneval temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 20 supported_endpoint_types: - completions type: humaneval target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-humaneval-instruct)= ## humaneval_instruct InstructHumanEval is a modified version of OpenAI HumanEval. For a given prompt, we extracted its signature, its docstring as well as its header to create a flexing setting which would allow to evaluation instruction-tuned LLM. The delimiters used in the instruction-tuning procedure can be use to build and instruction that would allow the model to elicit its best capabilities. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `humaneval_instruct` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: instruct-humaneval-nocontext-py temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 20 supported_endpoint_types: - chat type: humaneval_instruct target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-humanevalplus)= ## humanevalplus HumanEvalPlus is a modified version of HumanEval containing 80x more test cases. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `humanevalplus` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: humanevalplus temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - completions type: humanevalplus target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-mbpp-chat)= ## mbpp-chat MBPP consists of Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. This variant uses the chat endpoint. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `mbpp-chat` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 2048 max_retries: 5 parallelism: 10 task: mbpp temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 10 supported_endpoint_types: - chat type: mbpp-chat target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-mbpp-completions)= ## mbpp-completions MBPP consists of Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. This variant uses the completions endpoint. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `mbpp-completions` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 2048 max_retries: 5 parallelism: 10 task: mbpp temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 10 supported_endpoint_types: - completions type: mbpp-completions target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-mbppplus-chat)= ## mbppplus-chat MBPP+ is a modified version of MBPP containing 35x more test cases. This variant uses the chat endpoint. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `mbppplus-chat` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 2048 max_retries: 5 parallelism: 10 task: mbppplus temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - chat type: mbppplus-chat target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-mbppplus-completions)= ## mbppplus-completions MBPP+ is a modified version of MBPP containing 35x more test cases. This variant uses the completions endpoint. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `mbppplus-completions` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 2048 max_retries: 5 parallelism: 10 task: mbppplus temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - completions type: mbppplus-completions target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-mbppplus-nemo)= ## mbppplus_nemo MBPP+NeMo is a modified version of MBPP+ that uses the NeMo alignment prompt template. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `mbppplus_nemo` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 2048 max_retries: 5 parallelism: 10 task: mbppplus_nemo temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - chat type: mbppplus_nemo target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-multiple-cpp)= ## multiple-cpp MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "cpp" subset. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `multiple-cpp` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: multiple-cpp temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - completions type: multiple-cpp target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-multiple-cs)= ## multiple-cs MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "cs" subset. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `multiple-cs` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: multiple-cs temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - completions type: multiple-cs target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-multiple-d)= ## multiple-d MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "d" subset. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `multiple-d` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: multiple-d temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - completions type: multiple-d target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-multiple-go)= ## multiple-go MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "go" subset. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `multiple-go` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: multiple-go temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - completions type: multiple-go target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-multiple-java)= ## multiple-java MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "java" subset. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `multiple-java` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: multiple-java temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - completions type: multiple-java target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-multiple-jl)= ## multiple-jl MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "jl" subset. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `multiple-jl` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: multiple-jl temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - completions type: multiple-jl target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-multiple-js)= ## multiple-js MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "js" subset. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `multiple-js` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: multiple-js temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - completions type: multiple-js target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-multiple-lua)= ## multiple-lua MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "lua" subset. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `multiple-lua` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: multiple-lua temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - completions type: multiple-lua target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-multiple-php)= ## multiple-php MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "php" subset. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `multiple-php` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: multiple-php temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - completions type: multiple-php target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-multiple-pl)= ## multiple-pl MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "pl" subset. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `multiple-pl` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: multiple-pl temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - completions type: multiple-pl target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-multiple-py)= ## multiple-py MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "py" subset. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `multiple-py` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: multiple-py temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - completions type: multiple-py target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-multiple-r)= ## multiple-r MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "r" subset. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `multiple-r` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: multiple-r temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - completions type: multiple-r target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-multiple-rb)= ## multiple-rb MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "rb" subset. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `multiple-rb` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: multiple-rb temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - completions type: multiple-rb target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-multiple-rkt)= ## multiple-rkt MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "rkt" subset. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `multiple-rkt` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: multiple-rkt temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - completions type: multiple-rkt target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-multiple-rs)= ## multiple-rs MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "rs" subset. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `multiple-rs` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: multiple-rs temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - completions type: multiple-rs target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-multiple-scala)= ## multiple-scala MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "scala" subset. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `multiple-scala` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: multiple-scala temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - completions type: multiple-scala target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-multiple-sh)= ## multiple-sh MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "sh" subset. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `multiple-sh` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: multiple-sh temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - completions type: multiple-sh target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-multiple-swift)= ## multiple-swift MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "swift" subset. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `multiple-swift` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: multiple-swift temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - completions type: multiple-swift target: api_endpoint: {} ``` ::: :::: --- (bigcode-evaluation-harness-multiple-ts)= ## multiple-ts MultiPL-E is a suite of coding tasks for many programming languages. This task covers the "ts" subset. ::::{tab-set} :::{tab-item} Container **Harness:** `bigcode-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:4efb7706a525e248347773d7368c8305ca311cbed2fdafc837a9315164170acd ``` **Container Arch:** `multiarch` **Task Type:** `multiple-ts` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}NVCF_TOKEN=${{target.api_endpoint.api_key_name}}{% endif %} bigcode-eval --model_type {% if target.api_endpoint.type == "completions" %}nim-base{% elif target.api_endpoint.type == "chat" %}nim-chat{% endif %} --url {{target.api_endpoint.url}} --model_kwargs '{"model_name": "{{target.api_endpoint.model_id}}", "timeout": {{config.params.request_timeout}}, "connection_retries": {{config.params.max_retries}}}' --out_dir {{config.output_dir}} --task {{config.params.task}} --allow_code_execution --n_samples={{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --max_new_tokens={{config.params.max_new_tokens}} --do_sample={{config.params.extra.do_sample}} --top_p {{config.params.top_p}} --temperature {{config.params.temperature}} --async_limit {{config.params.parallelism}}{% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: bigcode-evaluation-harness pkg_name: bigcode_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: multiple-ts temperature: 0.1 request_timeout: 30 top_p: 0.95 extra: do_sample: true n_samples: 5 supported_endpoint_types: - completions type: multiple-ts target: api_endpoint: {} ``` ::: :::: # codec This page contains all evaluation tasks for the **codec** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [aime_2024](#codec-aime-2024) - Task for detecting contamination with the AIME 2024 dataset * - [aime_2025](#codec-aime-2025) - Task for detecting contamination with the AIME 2025 dataset * - [bbq](#codec-bbq) - Task for detecting contamination with the BBQ dataset * - [bfcl_v3](#codec-bfcl-v3) - Task for detecting contamination with the BFCL v3 dataset * - [frames](#codec-frames) - Task for detecting contamination with the FRAMES dataset * - [gpqa_diamond](#codec-gpqa-diamond) - Task for detecting contamination with the GPQA diamond * - [gsm8k_test](#codec-gsm8k-test) - Task for detecting contamination with the GSM8K test set * - [gsm8k_train](#codec-gsm8k-train) - Task for detecting contamination with the GSM8K train set * - [hellaswag_test](#codec-hellaswag-test) - Task for detecting contamination with the Hellaswag test set * - [hellaswag_train](#codec-hellaswag-train) - Task for detecting contamination with the Hellaswag train set * - [hle](#codec-hle) - Task for detecting contamination with the HLE dataset * - [ifbench](#codec-ifbench) - Task for detecting contamination with the IFBench dataset * - [ifeval](#codec-ifeval) - Task for detecting contamination with the IFeval dataset * - [livecodebench_v1](#codec-livecodebench-v1) - Task for detecting contamination with the LiveCodeBench v1 dataset * - [livecodebench_v5](#codec-livecodebench-v5) - Task for detecting contamination with the LiveCodeBench v5 dataset * - [math_500_problem](#codec-math-500-problem) - Task for detecting contamination with the Math 500 dataset (problem statements) * - [math_500_solution](#codec-math-500-solution) - Task for detecting contamination with the Math 500 dataset (solutions) * - [mmlu_pro_test](#codec-mmlu-pro-test) - Task for detecting contamination with the MMLU-Pro test set * - [mmlu_test](#codec-mmlu-test) - Task for detecting contamination with the MMLU test set * - [openai_humaneval](#codec-openai-humaneval) - Task for detecting contamination with the OpenAI HumanEval dataset * - [reward_bench_v1](#codec-reward-bench-v1) - Task for detecting contamination with the Reward Bench v1 dataset * - [reward_bench_v2](#codec-reward-bench-v2) - Task for detecting contamination with the Reward Bench v2 dataset * - [scicode](#codec-scicode) - Task for detecting contamination with the SciCode dataset * - [swebench_test](#codec-swebench-test) - Task for detecting contamination with the SWE-bench dataset (test split) * - [swebench_train](#codec-swebench-train) - Task for detecting contamination with the SWE-bench dataset (train split) * - [taubench](#codec-taubench) - Task for detecting contamination with the Tau-bench dataset * - [terminalbench](#codec-terminalbench) - Task for detecting contamination with the Terminal-Bench dataset ``` (codec-aime-2024)= ## aime_2024 Task for detecting contamination with the AIME 2024 dataset ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `aime_2024` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: aime_2024 temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: aime_2024 target: api_endpoint: {} ``` ::: :::: --- (codec-aime-2025)= ## aime_2025 Task for detecting contamination with the AIME 2025 dataset ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `aime_2025` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: aime_2025 temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: aime_2025 target: api_endpoint: {} ``` ::: :::: --- (codec-bbq)= ## bbq Task for detecting contamination with the BBQ dataset ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `bbq` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: bbq temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: bbq target: api_endpoint: {} ``` ::: :::: --- (codec-bfcl-v3)= ## bfcl_v3 Task for detecting contamination with the BFCL v3 dataset ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `bfcl_v3` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: bfcl_v3 temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: bfcl_v3 target: api_endpoint: {} ``` ::: :::: --- (codec-frames)= ## frames Task for detecting contamination with the FRAMES dataset ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `frames` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: frames temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: frames target: api_endpoint: {} ``` ::: :::: --- (codec-gpqa-diamond)= ## gpqa_diamond Task for detecting contamination with the GPQA diamond ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `gpqa_diamond` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: gpqa_diamond temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: gpqa_diamond target: api_endpoint: {} ``` ::: :::: --- (codec-gsm8k-test)= ## gsm8k_test Task for detecting contamination with the GSM8K test set ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `gsm8k_test` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: gsm8k_test temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: gsm8k_test target: api_endpoint: {} ``` ::: :::: --- (codec-gsm8k-train)= ## gsm8k_train Task for detecting contamination with the GSM8K train set ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `gsm8k_train` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: gsm8k_train temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: gsm8k_train target: api_endpoint: {} ``` ::: :::: --- (codec-hellaswag-test)= ## hellaswag_test Task for detecting contamination with the Hellaswag test set ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `hellaswag_test` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: hellaswag_test temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: hellaswag_test target: api_endpoint: {} ``` ::: :::: --- (codec-hellaswag-train)= ## hellaswag_train Task for detecting contamination with the Hellaswag train set ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `hellaswag_train` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: hellaswag_train temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: hellaswag_train target: api_endpoint: {} ``` ::: :::: --- (codec-hle)= ## hle Task for detecting contamination with the HLE dataset ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `hle` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: hle temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: hle target: api_endpoint: {} ``` ::: :::: --- (codec-ifbench)= ## ifbench Task for detecting contamination with the IFBench dataset ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `ifbench` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: ifbench temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: ifbench target: api_endpoint: {} ``` ::: :::: --- (codec-ifeval)= ## ifeval Task for detecting contamination with the IFeval dataset ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `ifeval` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: ifeval temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: ifeval target: api_endpoint: {} ``` ::: :::: --- (codec-livecodebench-v1)= ## livecodebench_v1 Task for detecting contamination with the LiveCodeBench v1 dataset ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `livecodebench_v1` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: livecodebench_v1 temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: livecodebench_v1 target: api_endpoint: {} ``` ::: :::: --- (codec-livecodebench-v5)= ## livecodebench_v5 Task for detecting contamination with the LiveCodeBench v5 dataset ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `livecodebench_v5` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: livecodebench_v5 temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: livecodebench_v5 target: api_endpoint: {} ``` ::: :::: --- (codec-math-500-problem)= ## math_500_problem Task for detecting contamination with the Math 500 dataset (problem statements) ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `math_500_problem` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: math_500_problem temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: math_500_problem target: api_endpoint: {} ``` ::: :::: --- (codec-math-500-solution)= ## math_500_solution Task for detecting contamination with the Math 500 dataset (solutions) ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `math_500_solution` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: math_500_solution temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: math_500_solution target: api_endpoint: {} ``` ::: :::: --- (codec-mmlu-pro-test)= ## mmlu_pro_test Task for detecting contamination with the MMLU-Pro test set ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `mmlu_pro_test` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: mmlu_pro_test temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: mmlu_pro_test target: api_endpoint: {} ``` ::: :::: --- (codec-mmlu-test)= ## mmlu_test Task for detecting contamination with the MMLU test set ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `mmlu_test` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: mmlu_test temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: mmlu_test target: api_endpoint: {} ``` ::: :::: --- (codec-openai-humaneval)= ## openai_humaneval Task for detecting contamination with the OpenAI HumanEval dataset ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `openai_humaneval` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: openai_humaneval temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: openai_humaneval target: api_endpoint: {} ``` ::: :::: --- (codec-reward-bench-v1)= ## reward_bench_v1 Task for detecting contamination with the Reward Bench v1 dataset ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `reward_bench_v1` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: reward_bench_v1 temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: reward_bench_v1 target: api_endpoint: {} ``` ::: :::: --- (codec-reward-bench-v2)= ## reward_bench_v2 Task for detecting contamination with the Reward Bench v2 dataset ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `reward_bench_v2` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: reward_bench_v2 temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: reward_bench_v2 target: api_endpoint: {} ``` ::: :::: --- (codec-scicode)= ## scicode Task for detecting contamination with the SciCode dataset ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `scicode` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: scicode temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: scicode target: api_endpoint: {} ``` ::: :::: --- (codec-swebench-test)= ## swebench_test Task for detecting contamination with the SWE-bench dataset (test split) ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `swebench_test` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: swebench_test temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: swebench_test target: api_endpoint: {} ``` ::: :::: --- (codec-swebench-train)= ## swebench_train Task for detecting contamination with the SWE-bench dataset (train split) ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `swebench_train` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: swebench_train temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: swebench_train target: api_endpoint: {} ``` ::: :::: --- (codec-taubench)= ## taubench Task for detecting contamination with the Tau-bench dataset ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `taubench` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: taubench temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: taubench target: api_endpoint: {} ``` ::: :::: --- (codec-terminalbench)= ## terminalbench Task for detecting contamination with the Terminal-Bench dataset ::::{tab-set} :::{tab-item} Container **Harness:** `codec` **Container:** ``` nvcr.io/nvidia/eval-factory/contamination-detection:26.01 ``` **Container Digest:** ``` sha256:e16f56f78f4b36b3b1b6ce6da3d6bef7937ea578b1b0ba4595a1c71f927550e2 ``` **Container Arch:** `amd` **Task Type:** `terminalbench` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} codec --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --contamination_type {{config.params.extra.contamination_type}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --n_context_seeds {{config.params.extra.n_context_seeds}} --out_dir {{config.output_dir}}/results --cache_dir {{config.output_dir}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} --min_length {{config.params.extra.min_length}} --max_length {{config.params.extra.max_length}} {% if config.params.limit_samples is not none %} --n_samples {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: codec pkg_name: codec config: params: limit_samples: 1000 max_retries: 10 parallelism: 20 task: terminalbench temperature: 0.0 request_timeout: 120 top_p: 1.0 extra: contamination_type: in_context n_context_seeds: 5 min_length: 100 max_length: 2048 supported_endpoint_types: - completions type: terminalbench target: api_endpoint: {} ``` ::: :::: # garak This page contains all evaluation tasks for the **garak** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [garak](#garak-garak) - Task for running the default set of Garak probes. This variant uses the chat endpoint. * - [garak-completions](#garak-garak-completions) - Task for running the default set of Garak probes. This variant uses the completions endpoint. ``` (garak-garak)= ## garak Task for running the default set of Garak probes. This variant uses the chat endpoint. ::::{tab-set} :::{tab-item} Container **Harness:** `garak` **Container:** ``` nvcr.io/nvidia/eval-factory/garak:26.01 ``` **Container Digest:** ``` sha256:72514ac2c35f76fdb139b02f1c1d4159103969946a121592e50b129087dd455e ``` **Container Arch:** `multiarch` **Task Type:** `garak` ::: :::{tab-item} Command ```bash cat > garak_config.yaml << 'EOF' {% if config.params.extra.seed is not none %}run: seed: {{config.params.extra.seed}}{% endif %} plugins: {% if config.params.extra.probes is not none %}probe_spec: {{config.params.extra.probes}}{% endif %} extended_detectors: true target_type: {% if target.api_endpoint.type == "completions" %}nim.NVOpenAICompletion{% elif target.api_endpoint.type == "chat" %}nim.NVOpenAIChat{% endif %} target_name: {{target.api_endpoint.model_id}} generators: nim: uri: {{target.api_endpoint.url | replace('/chat/completions', '') | replace('/completions', '')}} {% if config.params.temperature is not none %}temperature: {{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}top_p: {{config.params.top_p}}{% endif %} {% if config.params.max_new_tokens is not none %}max_tokens: {{config.params.max_new_tokens}}{% endif %} skip_seq_start: {{config.params.extra.skip_seq_start}} skip_seq_end: {{config.params.extra.skip_seq_end}} system: parallel_attempts: {{config.params.parallelism}} lite: false EOF {% if target.api_endpoint.api_key_name is not none %} export NIM_API_KEY=${{target.api_endpoint.api_key_name}} && {% else %} export NIM_API_KEY=dummy && {% endif %} export XDG_DATA_HOME={{config.output_dir}} && garak --config garak_config.yaml --report_prefix=results ``` ::: :::{tab-item} Defaults ```yaml framework_name: garak pkg_name: garak config: params: max_new_tokens: 150 parallelism: 32 temperature: 0.1 top_p: 0.7 extra: probes: null seed: 42 skip_seq_start: skip_seq_end: supported_endpoint_types: - chat type: garak target: api_endpoint: api_key_name: API_KEY ``` ::: :::: --- (garak-garak-completions)= ## garak-completions Task for running the default set of Garak probes. This variant uses the completions endpoint. ::::{tab-set} :::{tab-item} Container **Harness:** `garak` **Container:** ``` nvcr.io/nvidia/eval-factory/garak:26.01 ``` **Container Digest:** ``` sha256:72514ac2c35f76fdb139b02f1c1d4159103969946a121592e50b129087dd455e ``` **Container Arch:** `multiarch` **Task Type:** `garak-completions` ::: :::{tab-item} Command ```bash cat > garak_config.yaml << 'EOF' {% if config.params.extra.seed is not none %}run: seed: {{config.params.extra.seed}}{% endif %} plugins: {% if config.params.extra.probes is not none %}probe_spec: {{config.params.extra.probes}}{% endif %} extended_detectors: true target_type: {% if target.api_endpoint.type == "completions" %}nim.NVOpenAICompletion{% elif target.api_endpoint.type == "chat" %}nim.NVOpenAIChat{% endif %} target_name: {{target.api_endpoint.model_id}} generators: nim: uri: {{target.api_endpoint.url | replace('/chat/completions', '') | replace('/completions', '')}} {% if config.params.temperature is not none %}temperature: {{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}top_p: {{config.params.top_p}}{% endif %} {% if config.params.max_new_tokens is not none %}max_tokens: {{config.params.max_new_tokens}}{% endif %} skip_seq_start: {{config.params.extra.skip_seq_start}} skip_seq_end: {{config.params.extra.skip_seq_end}} system: parallel_attempts: {{config.params.parallelism}} lite: false EOF {% if target.api_endpoint.api_key_name is not none %} export NIM_API_KEY=${{target.api_endpoint.api_key_name}} && {% else %} export NIM_API_KEY=dummy && {% endif %} export XDG_DATA_HOME={{config.output_dir}} && garak --config garak_config.yaml --report_prefix=results ``` ::: :::{tab-item} Defaults ```yaml framework_name: garak pkg_name: garak config: params: max_new_tokens: 150 parallelism: 32 temperature: 0.1 top_p: 0.7 extra: probes: null seed: 42 skip_seq_start: skip_seq_end: supported_endpoint_types: - completions type: garak-completions target: api_endpoint: api_key_name: API_KEY ``` ::: :::: # genai_perf_eval This page contains all evaluation tasks for the **genai_perf_eval** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [genai_perf_generation](#genai-perf-eval-genai-perf-generation) - GenAI Perf speed evaluation for chat endpoint, generation task - short input, long output * - [genai_perf_generation_completions](#genai-perf-eval-genai-perf-generation-completions) - GenAI Perf speed evaluation for completions endpoint, generation task - short input, long output * - [genai_perf_summarization](#genai-perf-eval-genai-perf-summarization) - GenAI Perf speed evaluation for chat endpoint, summarization task - long input, short output * - [genai_perf_summarization_completions](#genai-perf-eval-genai-perf-summarization-completions) - GenAI Perf speed evaluation for completions endpoint, summarization task - long input, short output ``` (genai-perf-eval-genai-perf-generation)= ## genai_perf_generation GenAI Perf speed evaluation for chat endpoint, generation task - short input, long output ::::{tab-set} :::{tab-item} Container **Harness:** `genai_perf_eval` **Container:** ``` nvcr.io/nvidia/eval-factory/genai-perf:26.01 ``` **Container Digest:** ``` sha256:ab3f8b34a6cb63f7e48e8847fb069be71a3b73eb4f924bcf274cb02c6cc975b6 ``` **Container Arch:** `amd` **Task Type:** `genai_perf_generation` ::: :::{tab-item} Command ```bash genai_perf_eval --model_id {{target.api_endpoint.model_id}} --url {{target.api_endpoint.url}} {% if target.api_endpoint.api_key_name is not none %}--api-key {{target.api_endpoint.api_key_name}} {% endif %} --concurrencies {{config.params.parallelism}} --isl {{config.params.extra.isl}} --osl {{config.params.extra.osl}} --tokenizer {{config.params.extra.tokenizer}} --endpoint-type {{target.api_endpoint.type}} --artifact-dir {{config.output_dir}} {% if target.api_endpoint.stream %}--streaming {% endif %}{% if config.params.extra.warmup %}--warmup{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: genai_perf_eval pkg_name: genai_perf config: params: parallelism: 1 extra: tokenizer: null warmup: true isl: 500 osl: 5000 supported_endpoint_types: - chat type: genai_perf_generation target: api_endpoint: {} ``` ::: :::: --- (genai-perf-eval-genai-perf-generation-completions)= ## genai_perf_generation_completions GenAI Perf speed evaluation for completions endpoint, generation task - short input, long output ::::{tab-set} :::{tab-item} Container **Harness:** `genai_perf_eval` **Container:** ``` nvcr.io/nvidia/eval-factory/genai-perf:26.01 ``` **Container Digest:** ``` sha256:ab3f8b34a6cb63f7e48e8847fb069be71a3b73eb4f924bcf274cb02c6cc975b6 ``` **Container Arch:** `amd` **Task Type:** `genai_perf_generation_completions` ::: :::{tab-item} Command ```bash genai_perf_eval --model_id {{target.api_endpoint.model_id}} --url {{target.api_endpoint.url}} {% if target.api_endpoint.api_key_name is not none %}--api-key {{target.api_endpoint.api_key_name}} {% endif %} --concurrencies {{config.params.parallelism}} --isl {{config.params.extra.isl}} --osl {{config.params.extra.osl}} --tokenizer {{config.params.extra.tokenizer}} --endpoint-type {{target.api_endpoint.type}} --artifact-dir {{config.output_dir}} {% if target.api_endpoint.stream %}--streaming {% endif %}{% if config.params.extra.warmup %}--warmup{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: genai_perf_eval pkg_name: genai_perf config: params: parallelism: 1 task: genai_perf_generation extra: tokenizer: null warmup: true isl: 500 osl: 5000 supported_endpoint_types: - completions type: genai_perf_generation_completions target: api_endpoint: {} ``` ::: :::: --- (genai-perf-eval-genai-perf-summarization)= ## genai_perf_summarization GenAI Perf speed evaluation for chat endpoint, summarization task - long input, short output ::::{tab-set} :::{tab-item} Container **Harness:** `genai_perf_eval` **Container:** ``` nvcr.io/nvidia/eval-factory/genai-perf:26.01 ``` **Container Digest:** ``` sha256:ab3f8b34a6cb63f7e48e8847fb069be71a3b73eb4f924bcf274cb02c6cc975b6 ``` **Container Arch:** `amd` **Task Type:** `genai_perf_summarization` ::: :::{tab-item} Command ```bash genai_perf_eval --model_id {{target.api_endpoint.model_id}} --url {{target.api_endpoint.url}} {% if target.api_endpoint.api_key_name is not none %}--api-key {{target.api_endpoint.api_key_name}} {% endif %} --concurrencies {{config.params.parallelism}} --isl {{config.params.extra.isl}} --osl {{config.params.extra.osl}} --tokenizer {{config.params.extra.tokenizer}} --endpoint-type {{target.api_endpoint.type}} --artifact-dir {{config.output_dir}} {% if target.api_endpoint.stream %}--streaming {% endif %}{% if config.params.extra.warmup %}--warmup{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: genai_perf_eval pkg_name: genai_perf config: params: parallelism: 1 extra: tokenizer: null warmup: true isl: 5000 osl: 500 supported_endpoint_types: - chat type: genai_perf_summarization target: api_endpoint: {} ``` ::: :::: --- (genai-perf-eval-genai-perf-summarization-completions)= ## genai_perf_summarization_completions GenAI Perf speed evaluation for completions endpoint, summarization task - long input, short output ::::{tab-set} :::{tab-item} Container **Harness:** `genai_perf_eval` **Container:** ``` nvcr.io/nvidia/eval-factory/genai-perf:26.01 ``` **Container Digest:** ``` sha256:ab3f8b34a6cb63f7e48e8847fb069be71a3b73eb4f924bcf274cb02c6cc975b6 ``` **Container Arch:** `amd` **Task Type:** `genai_perf_summarization_completions` ::: :::{tab-item} Command ```bash genai_perf_eval --model_id {{target.api_endpoint.model_id}} --url {{target.api_endpoint.url}} {% if target.api_endpoint.api_key_name is not none %}--api-key {{target.api_endpoint.api_key_name}} {% endif %} --concurrencies {{config.params.parallelism}} --isl {{config.params.extra.isl}} --osl {{config.params.extra.osl}} --tokenizer {{config.params.extra.tokenizer}} --endpoint-type {{target.api_endpoint.type}} --artifact-dir {{config.output_dir}} {% if target.api_endpoint.stream %}--streaming {% endif %}{% if config.params.extra.warmup %}--warmup{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: genai_perf_eval pkg_name: genai_perf config: params: parallelism: 1 task: genai_perf_summarization extra: tokenizer: null warmup: true isl: 5000 osl: 500 supported_endpoint_types: - completions type: genai_perf_summarization_completions target: api_endpoint: {} ``` ::: :::: # helm This page contains all evaluation tasks for the **helm** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [aci_bench](#helm-aci-bench) - Extract and structure information from patient-doctor conversations * - [ehr_sql](#helm-ehr-sql) - Given a natural language instruction, generate an SQL query that would be used in clinical research. * - [head_qa](#helm-head-qa) - A collection of biomedical multiple-choice questions for testing medical knowledge (Vilares et al., 2019). * - [med_dialog_healthcaremagic](#helm-med-dialog-healthcaremagic) - Generate summaries of doctor-patient conversations, healthcaremagic version * - [med_dialog_icliniq](#helm-med-dialog-icliniq) - Generate summaries of doctor-patient conversations, icliniq version * - [medbullets](#helm-medbullets) - A USMLE-style medical question dataset with multiple-choice answers and explanations (MedBullets, 2025). * - [medcalc_bench](#helm-medcalc-bench) - A dataset which consists of a patient note, a question requesting to compute a specific medical value, and a ground truth answer (Khandekar et al., 2024). * - [medec](#helm-medec) - A dataset containing medical narratives with error detection and correction pairs (Abacha et al., 2025). * - [medhallu](#helm-medhallu) - A dataset of PubMed articles and associated questions, with the objective being to classify whether the answer is factual or hallucinated. * - [medi_qa](#helm-medi-qa) - Retrieve and rank answers based on medical question understanding * - [medication_qa](#helm-medication-qa) - Answer consumer medication-related questions * - [mtsamples_procedures](#helm-mtsamples-procedures) - Document and extract information about medical procedures * - [mtsamples_replicate](#helm-mtsamples-replicate) - Generate treatment plans based on clinical notes * - [pubmed_qa](#helm-pubmed-qa) - A dataset that provides PubMed abstracts and asks associated questions (yes/no/maybe format). * - [race_based_med](#helm-race-based-med) - A collection of LLM outputs in response to medical questions with race-based biases, with the objective being to classify whether the output contains racially biased content. ``` (helm-aci-bench)= ## aci_bench Extract and structure information from patient-doctor conversations ::::{tab-set} :::{tab-item} Container **Harness:** `helm` **Container:** ``` nvcr.io/nvidia/eval-factory/helm:26.01 ``` **Container Digest:** ``` sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589 ``` **Container Arch:** `amd` **Task Type:** `aci_bench` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: helm pkg_name: helm config: params: parallelism: 1 task: aci_bench extra: data_path: null num_output_tokens: null subject: null condition: null max_length: null num_train_trials: null subset: null gpt_judge_api_key: GPT_JUDGE_API_KEY llama_judge_api_key: LLAMA_JUDGE_API_KEY claude_judge_api_key: CLAUDE_JUDGE_API_KEY supported_endpoint_types: - chat type: aci_bench target: api_endpoint: {} ``` ::: :::: --- (helm-ehr-sql)= ## ehr_sql Given a natural language instruction, generate an SQL query that would be used in clinical research. ::::{tab-set} :::{tab-item} Container **Harness:** `helm` **Container:** ``` nvcr.io/nvidia/eval-factory/helm:26.01 ``` **Container Digest:** ``` sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589 ``` **Container Arch:** `amd` **Task Type:** `ehr_sql` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: helm pkg_name: helm config: params: parallelism: 1 task: ehr_sql extra: data_path: null num_output_tokens: null subject: null condition: null max_length: null num_train_trials: null subset: null gpt_judge_api_key: GPT_JUDGE_API_KEY llama_judge_api_key: LLAMA_JUDGE_API_KEY claude_judge_api_key: CLAUDE_JUDGE_API_KEY supported_endpoint_types: - chat type: ehr_sql target: api_endpoint: {} ``` ::: :::: --- (helm-head-qa)= ## head_qa A collection of biomedical multiple-choice questions for testing medical knowledge (Vilares et al., 2019). ::::{tab-set} :::{tab-item} Container **Harness:** `helm` **Container:** ``` nvcr.io/nvidia/eval-factory/helm:26.01 ``` **Container Digest:** ``` sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589 ``` **Container Arch:** `amd` **Task Type:** `head_qa` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: helm pkg_name: helm config: params: parallelism: 1 task: head_qa extra: data_path: null num_output_tokens: null subject: null condition: null max_length: null num_train_trials: null subset: null gpt_judge_api_key: GPT_JUDGE_API_KEY llama_judge_api_key: LLAMA_JUDGE_API_KEY claude_judge_api_key: CLAUDE_JUDGE_API_KEY supported_endpoint_types: - chat type: head_qa target: api_endpoint: {} ``` ::: :::: --- (helm-med-dialog-healthcaremagic)= ## med_dialog_healthcaremagic Generate summaries of doctor-patient conversations, healthcaremagic version ::::{tab-set} :::{tab-item} Container **Harness:** `helm` **Container:** ``` nvcr.io/nvidia/eval-factory/helm:26.01 ``` **Container Digest:** ``` sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589 ``` **Container Arch:** `amd` **Task Type:** `med_dialog_healthcaremagic` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: helm pkg_name: helm config: params: parallelism: 1 task: med_dialog extra: data_path: null num_output_tokens: null subject: null condition: null max_length: null num_train_trials: null subset: healthcaremagic gpt_judge_api_key: GPT_JUDGE_API_KEY llama_judge_api_key: LLAMA_JUDGE_API_KEY claude_judge_api_key: CLAUDE_JUDGE_API_KEY supported_endpoint_types: - chat type: med_dialog_healthcaremagic target: api_endpoint: {} ``` ::: :::: --- (helm-med-dialog-icliniq)= ## med_dialog_icliniq Generate summaries of doctor-patient conversations, icliniq version ::::{tab-set} :::{tab-item} Container **Harness:** `helm` **Container:** ``` nvcr.io/nvidia/eval-factory/helm:26.01 ``` **Container Digest:** ``` sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589 ``` **Container Arch:** `amd` **Task Type:** `med_dialog_icliniq` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: helm pkg_name: helm config: params: parallelism: 1 task: med_dialog extra: data_path: null num_output_tokens: null subject: null condition: null max_length: null num_train_trials: null subset: icliniq gpt_judge_api_key: GPT_JUDGE_API_KEY llama_judge_api_key: LLAMA_JUDGE_API_KEY claude_judge_api_key: CLAUDE_JUDGE_API_KEY supported_endpoint_types: - chat type: med_dialog_icliniq target: api_endpoint: {} ``` ::: :::: --- (helm-medbullets)= ## medbullets A USMLE-style medical question dataset with multiple-choice answers and explanations (MedBullets, 2025). ::::{tab-set} :::{tab-item} Container **Harness:** `helm` **Container:** ``` nvcr.io/nvidia/eval-factory/helm:26.01 ``` **Container Digest:** ``` sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589 ``` **Container Arch:** `amd` **Task Type:** `medbullets` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: helm pkg_name: helm config: params: parallelism: 1 task: medbullets extra: data_path: null num_output_tokens: null subject: null condition: null max_length: null num_train_trials: null subset: null gpt_judge_api_key: GPT_JUDGE_API_KEY llama_judge_api_key: LLAMA_JUDGE_API_KEY claude_judge_api_key: CLAUDE_JUDGE_API_KEY supported_endpoint_types: - chat type: medbullets target: api_endpoint: {} ``` ::: :::: --- (helm-medcalc-bench)= ## medcalc_bench A dataset which consists of a patient note, a question requesting to compute a specific medical value, and a ground truth answer (Khandekar et al., 2024). ::::{tab-set} :::{tab-item} Container **Harness:** `helm` **Container:** ``` nvcr.io/nvidia/eval-factory/helm:26.01 ``` **Container Digest:** ``` sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589 ``` **Container Arch:** `amd` **Task Type:** `medcalc_bench` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: helm pkg_name: helm config: params: parallelism: 1 task: medcalc_bench extra: data_path: null num_output_tokens: null subject: null condition: null max_length: null num_train_trials: null subset: null gpt_judge_api_key: GPT_JUDGE_API_KEY llama_judge_api_key: LLAMA_JUDGE_API_KEY claude_judge_api_key: CLAUDE_JUDGE_API_KEY supported_endpoint_types: - chat type: medcalc_bench target: api_endpoint: {} ``` ::: :::: --- (helm-medec)= ## medec A dataset containing medical narratives with error detection and correction pairs (Abacha et al., 2025). ::::{tab-set} :::{tab-item} Container **Harness:** `helm` **Container:** ``` nvcr.io/nvidia/eval-factory/helm:26.01 ``` **Container Digest:** ``` sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589 ``` **Container Arch:** `amd` **Task Type:** `medec` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: helm pkg_name: helm config: params: parallelism: 1 task: medec extra: data_path: null num_output_tokens: null subject: null condition: null max_length: null num_train_trials: null subset: null gpt_judge_api_key: GPT_JUDGE_API_KEY llama_judge_api_key: LLAMA_JUDGE_API_KEY claude_judge_api_key: CLAUDE_JUDGE_API_KEY supported_endpoint_types: - chat type: medec target: api_endpoint: {} ``` ::: :::: --- (helm-medhallu)= ## medhallu A dataset of PubMed articles and associated questions, with the objective being to classify whether the answer is factual or hallucinated. ::::{tab-set} :::{tab-item} Container **Harness:** `helm` **Container:** ``` nvcr.io/nvidia/eval-factory/helm:26.01 ``` **Container Digest:** ``` sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589 ``` **Container Arch:** `amd` **Task Type:** `medhallu` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: helm pkg_name: helm config: params: parallelism: 1 task: medhallu extra: data_path: null num_output_tokens: null subject: null condition: null max_length: null num_train_trials: null subset: null gpt_judge_api_key: GPT_JUDGE_API_KEY llama_judge_api_key: LLAMA_JUDGE_API_KEY claude_judge_api_key: CLAUDE_JUDGE_API_KEY supported_endpoint_types: - chat type: medhallu target: api_endpoint: {} ``` ::: :::: --- (helm-medi-qa)= ## medi_qa Retrieve and rank answers based on medical question understanding ::::{tab-set} :::{tab-item} Container **Harness:** `helm` **Container:** ``` nvcr.io/nvidia/eval-factory/helm:26.01 ``` **Container Digest:** ``` sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589 ``` **Container Arch:** `amd` **Task Type:** `medi_qa` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: helm pkg_name: helm config: params: parallelism: 1 task: medi_qa extra: data_path: null num_output_tokens: null subject: null condition: null max_length: null num_train_trials: null subset: null gpt_judge_api_key: GPT_JUDGE_API_KEY llama_judge_api_key: LLAMA_JUDGE_API_KEY claude_judge_api_key: CLAUDE_JUDGE_API_KEY supported_endpoint_types: - chat type: medi_qa target: api_endpoint: {} ``` ::: :::: --- (helm-medication-qa)= ## medication_qa Answer consumer medication-related questions ::::{tab-set} :::{tab-item} Container **Harness:** `helm` **Container:** ``` nvcr.io/nvidia/eval-factory/helm:26.01 ``` **Container Digest:** ``` sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589 ``` **Container Arch:** `amd` **Task Type:** `medication_qa` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: helm pkg_name: helm config: params: parallelism: 1 task: medication_qa extra: data_path: null num_output_tokens: null subject: null condition: null max_length: null num_train_trials: null subset: null gpt_judge_api_key: GPT_JUDGE_API_KEY llama_judge_api_key: LLAMA_JUDGE_API_KEY claude_judge_api_key: CLAUDE_JUDGE_API_KEY supported_endpoint_types: - chat type: medication_qa target: api_endpoint: {} ``` ::: :::: --- (helm-mtsamples-procedures)= ## mtsamples_procedures Document and extract information about medical procedures ::::{tab-set} :::{tab-item} Container **Harness:** `helm` **Container:** ``` nvcr.io/nvidia/eval-factory/helm:26.01 ``` **Container Digest:** ``` sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589 ``` **Container Arch:** `amd` **Task Type:** `mtsamples_procedures` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: helm pkg_name: helm config: params: parallelism: 1 task: mtsamples_procedures extra: data_path: null num_output_tokens: null subject: null condition: null max_length: null num_train_trials: null subset: null gpt_judge_api_key: GPT_JUDGE_API_KEY llama_judge_api_key: LLAMA_JUDGE_API_KEY claude_judge_api_key: CLAUDE_JUDGE_API_KEY supported_endpoint_types: - chat type: mtsamples_procedures target: api_endpoint: {} ``` ::: :::: --- (helm-mtsamples-replicate)= ## mtsamples_replicate Generate treatment plans based on clinical notes ::::{tab-set} :::{tab-item} Container **Harness:** `helm` **Container:** ``` nvcr.io/nvidia/eval-factory/helm:26.01 ``` **Container Digest:** ``` sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589 ``` **Container Arch:** `amd` **Task Type:** `mtsamples_replicate` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: helm pkg_name: helm config: params: parallelism: 1 task: mtsamples_replicate extra: data_path: null num_output_tokens: null subject: null condition: null max_length: null num_train_trials: null subset: null gpt_judge_api_key: GPT_JUDGE_API_KEY llama_judge_api_key: LLAMA_JUDGE_API_KEY claude_judge_api_key: CLAUDE_JUDGE_API_KEY supported_endpoint_types: - chat type: mtsamples_replicate target: api_endpoint: {} ``` ::: :::: --- (helm-pubmed-qa)= ## pubmed_qa A dataset that provides PubMed abstracts and asks associated questions (yes/no/maybe format). ::::{tab-set} :::{tab-item} Container **Harness:** `helm` **Container:** ``` nvcr.io/nvidia/eval-factory/helm:26.01 ``` **Container Digest:** ``` sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589 ``` **Container Arch:** `amd` **Task Type:** `pubmed_qa` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: helm pkg_name: helm config: params: parallelism: 1 task: pubmed_qa extra: data_path: null num_output_tokens: null subject: null condition: null max_length: null num_train_trials: null subset: null gpt_judge_api_key: GPT_JUDGE_API_KEY llama_judge_api_key: LLAMA_JUDGE_API_KEY claude_judge_api_key: CLAUDE_JUDGE_API_KEY supported_endpoint_types: - chat type: pubmed_qa target: api_endpoint: {} ``` ::: :::: --- (helm-race-based-med)= ## race_based_med A collection of LLM outputs in response to medical questions with race-based biases, with the objective being to classify whether the output contains racially biased content. ::::{tab-set} :::{tab-item} Container **Harness:** `helm` **Container:** ``` nvcr.io/nvidia/eval-factory/helm:26.01 ``` **Container Digest:** ``` sha256:58be32aed07b94d104b9b72130bf94ee03dc16b16ded14416e21c97b62970589 ``` **Container Arch:** `amd` **Task Type:** `race_based_med` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.gpt_judge_api_key is not none %}export GPT_JUDGE_API_KEY=${{config.params.extra.gpt_judge_api_key}} && {% endif %} {% if config.params.extra.llama_judge_api_key is not none %}export LLAMA_JUDGE_API_KEY=${{config.params.extra.llama_judge_api_key}} && {% endif %} {% if config.params.extra.claude_judge_api_key is not none %}export CLAUDE_JUDGE_API_KEY=${{config.params.extra.claude_judge_api_key}} && {% endif %} helm-generate-dynamic-model-configs --model-name {{target.api_endpoint.model_id}} --base-url {{target.api_endpoint.url}} --openai-model-name {{target.api_endpoint.model_id}} --output-dir {{config.output_dir}} && helm-run --run-entries {{config.params.task}}:{% if config.params.extra.subset is not none %}subset={{config.params.extra.subset}},{% endif %}model={{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --max-eval-instances {{config.params.limit_samples}} {% endif %} {% if config.params.parallelism is not none %} -n {{config.params.parallelism}} {% endif %} --suite {{config.params.task}} {% if config.params.extra.num_train_trials is not none %} --num-train-trials {{config.params.extra.num_train_trials}} {% endif %} {% if config.params.extra.data_path is not none %} --data-path {{config.params.extra.data_path}} {% endif %} {% if config.params.extra.num_output_tokens is not none %} --num-output-tokens {{config.params.extra.num_output_tokens}} {% endif %} {% if config.params.extra.subject is not none %} --subject {{config.params.extra.subject}} {% endif %} {% if config.params.extra.condition is not none %} --condition {{config.params.extra.condition}} {% endif %} {% if config.params.extra.max_length is not none %} --max-length {{config.params.extra.max_length}} {% endif %} -o {{config.output_dir}} --local-path {{config.output_dir}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: helm pkg_name: helm config: params: parallelism: 1 task: race_based_med extra: data_path: null num_output_tokens: null subject: null condition: null max_length: null num_train_trials: null subset: null gpt_judge_api_key: GPT_JUDGE_API_KEY llama_judge_api_key: LLAMA_JUDGE_API_KEY claude_judge_api_key: CLAUDE_JUDGE_API_KEY supported_endpoint_types: - chat type: race_based_med target: api_endpoint: {} ``` ::: :::: # hle This page contains all evaluation tasks for the **hle** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [hle](#hle-hle) - Text-only questions from Humanity's Last Exam * - [hle_aa_v2](#hle-hle-aa-v2) - Text-only questions from Humanity's Last Exam and params aligned with Artificial Analysis Index v2 ``` (hle-hle)= ## hle Text-only questions from Humanity's Last Exam ::::{tab-set} :::{tab-item} Container **Harness:** `hle` **Container:** ``` nvcr.io/nvidia/eval-factory/hle:26.01 ``` **Container Digest:** ``` sha256:59fa69e20bbaaa251effa5f9d440d60920bc601cfb26f9e03866f1b6aff6dc33 ``` **Container Arch:** `multiarch` **Task Type:** `hle` ::: :::{tab-item} Command ```bash hle_eval --dataset=cais/hle --model_name={{target.api_endpoint.model_id}} --model_url={{target.api_endpoint.url}} --temperature={{config.params.temperature}} --top_p={{config.params.top_p}} --timeout={{config.params.request_timeout}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --output_dir={{config.output_dir}} {% if target.api_endpoint.api_key_name is not none %}--api_key_name={{target.api_endpoint.api_key_name}}{% endif %} --max_retries={{config.params.max_retries}} --num_workers={{config.params.parallelism}} --max_new_tokens={{config.params.max_new_tokens}} --text_only --generate --judge ``` ::: :::{tab-item} Defaults ```yaml framework_name: hle pkg_name: hle config: params: max_new_tokens: 8192 max_retries: 30 parallelism: 10 temperature: 0.0 request_timeout: 600 top_p: 1.0 extra: {} supported_endpoint_types: - chat type: hle target: api_endpoint: {} ``` ::: :::: --- (hle-hle-aa-v2)= ## hle_aa_v2 Text-only questions from Humanity's Last Exam and params aligned with Artificial Analysis Index v2 ::::{tab-set} :::{tab-item} Container **Harness:** `hle` **Container:** ``` nvcr.io/nvidia/eval-factory/hle:26.01 ``` **Container Digest:** ``` sha256:59fa69e20bbaaa251effa5f9d440d60920bc601cfb26f9e03866f1b6aff6dc33 ``` **Container Arch:** `multiarch` **Task Type:** `hle_aa_v2` ::: :::{tab-item} Command ```bash hle_eval --dataset=cais/hle --model_name={{target.api_endpoint.model_id}} --model_url={{target.api_endpoint.url}} --temperature={{config.params.temperature}} --top_p={{config.params.top_p}} --timeout={{config.params.request_timeout}} {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} --output_dir={{config.output_dir}} {% if target.api_endpoint.api_key_name is not none %}--api_key_name={{target.api_endpoint.api_key_name}}{% endif %} --max_retries={{config.params.max_retries}} --num_workers={{config.params.parallelism}} --max_new_tokens={{config.params.max_new_tokens}} --text_only --generate --judge ``` ::: :::{tab-item} Defaults ```yaml framework_name: hle pkg_name: hle config: params: max_new_tokens: 16384 max_retries: 30 parallelism: 10 temperature: 0.0 request_timeout: 600 top_p: 1.0 extra: {} supported_endpoint_types: - chat type: hle_aa_v2 target: api_endpoint: {} ``` ::: :::: # ifbench This page contains all evaluation tasks for the **ifbench** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [ifbench](#ifbench-ifbench) - IFBench with vanilla settings * - [ifbench_aa_v2](#ifbench-ifbench-aa-v2) - IFBench - params aligned with Artificial Analysis Index v2 ``` (ifbench-ifbench)= ## ifbench IFBench with vanilla settings ::::{tab-set} :::{tab-item} Container **Harness:** `ifbench` **Container:** ``` nvcr.io/nvidia/eval-factory/ifbench:26.01 ``` **Container Digest:** ``` sha256:e99059d2e334ef97826629a004c888f7daed1adb9d724ca73274e1b93c743ac1 ``` **Container Arch:** `multiarch` **Task Type:** `ifbench` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} ifbench --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --results-dir {{config.output_dir}} --inference-params max_tokens={{config.params.max_new_tokens}},temperature={{config.params.temperature}},top_p={{config.params.top_p}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: ifbench pkg_name: ifbench config: params: max_new_tokens: 4096 max_retries: 5 parallelism: 8 temperature: 0.01 top_p: 0.95 extra: {} supported_endpoint_types: - chat type: ifbench target: api_endpoint: stream: false ``` ::: :::: --- (ifbench-ifbench-aa-v2)= ## ifbench_aa_v2 IFBench - params aligned with Artificial Analysis Index v2 ::::{tab-set} :::{tab-item} Container **Harness:** `ifbench` **Container:** ``` nvcr.io/nvidia/eval-factory/ifbench:26.01 ``` **Container Digest:** ``` sha256:e99059d2e334ef97826629a004c888f7daed1adb9d724ca73274e1b93c743ac1 ``` **Container Arch:** `multiarch` **Task Type:** `ifbench_aa_v2` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} ifbench --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --results-dir {{config.output_dir}} --inference-params max_tokens={{config.params.max_new_tokens}},temperature={{config.params.temperature}},top_p={{config.params.top_p}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: ifbench pkg_name: ifbench config: params: max_new_tokens: 16384 max_retries: 30 parallelism: 8 temperature: 0.0 top_p: 0.95 extra: {} supported_endpoint_types: - chat type: ifbench_aa_v2 target: api_endpoint: stream: false ``` ::: :::: # livecodebench This page contains all evaluation tasks for the **livecodebench** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [codeexecution_v2](#livecodebench-codeexecution-v2) - “Execute” a program on an input, evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. * - [codeexecution_v2_cot](#livecodebench-codeexecution-v2-cot) - “CoT. Execute” a program on an input, evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. Chain-of-Thought version of the task. * - [codegeneration_notfast](#livecodebench-codegeneration-notfast) - Not fast version of code generation (v2). * - [codegeneration_release_latest](#livecodebench-codegeneration-release-latest) - Code generation latest version * - [codegeneration_release_v1](#livecodebench-codegeneration-release-v1) - The initial release of the dataset (v1) with problems released between May 2023 and Mar 2024 containing 400 problems. * - [codegeneration_release_v2](#livecodebench-codegeneration-release-v2) - The updated release of the dataset (v2) with problems released between May 2023 and May 2024 containing 511 problems. * - [codegeneration_release_v3](#livecodebench-codegeneration-release-v3) - The updated release of the dataset (v3) with problems released between May 2023 and Jul 2024 containing 612 problems. * - [codegeneration_release_v4](#livecodebench-codegeneration-release-v4) - The updated release of the dataset (v4) with problems released between May 2023 and Sep 2024 containing 713 problems. * - [codegeneration_release_v5](#livecodebench-codegeneration-release-v5) - The updated release of the dataset (v5) with problems released between May 2023 and Jan 2025 containing 880 problems. * - [codegeneration_release_v6](#livecodebench-codegeneration-release-v6) - The updated release of the dataset (v6) with problems released between May 2023 and Apr 2025 containing 1055 problems. * - [livecodebench_0724_0125](#livecodebench-livecodebench-0724-0125) - - Code generation evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. - The data period and sampling parameters used by Artificial Analaysis (https://artificialanalysis.ai/methodology/intelligence-benchmarking) * - [livecodebench_0824_0225](#livecodebench-livecodebench-0824-0225) - ['Code generation evaluating code comprehension ability. The model is given a program and an input, and the output should be the result.', 'The data period and sampling parameters used by NeMo Alignment team.'] * - [livecodebench_aa_v2](#livecodebench-livecodebench-aa-v2) - - Code generation evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. - The data period and sampling parameters used by Artificial Analaysis (https://artificialanalysis.ai/methodology/intelligence-benchmarking) * - [testoutputprediction](#livecodebench-testoutputprediction) - Solve the natural language task on a specified input, evaluating the ability to generate testing outputs. The model is given the natural language problem description and an input, and the output should be the output for the problem. ``` (livecodebench-codeexecution-v2)= ## codeexecution_v2 “Execute” a program on an input, evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. ::::{tab-set} :::{tab-item} Container **Harness:** `livecodebench` **Container:** ``` nvcr.io/nvidia/eval-factory/livecodebench:26.01 ``` **Container Digest:** ``` sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e ``` **Container Arch:** `multiarch` **Task Type:** `codeexecution_v2` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %} export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} livecodebench --model {{target.api_endpoint.model_id}} \ --scenario {{config.params.task}} \ --release_version {{config.params.extra.release_version}} \ --url {{target.api_endpoint.url}} \ --temperature {{config.params.temperature}} \ --top_p {{config.params.top_p}} \ --evaluate \ --codegen_n {{config.params.extra.n_samples}} \ --use_cache \ --cache_batch_size {{config.params.extra.cache_batch_size}} \ --num_process_evaluate {{config.params.extra.num_process_evaluate}} \ --n {{config.params.extra.n_samples}} \ --max_tokens {{config.params.max_new_tokens}} \ --out_dir {{config.output_dir}} \ --multiprocess {{config.params.parallelism}} \ --max_retries {{config.params.max_retries}} \ --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: livecodebench pkg_name: livecodebench config: params: max_new_tokens: 4096 max_retries: 5 parallelism: 10 task: codeexecution temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 10 num_process_evaluate: 5 cache_batch_size: 10 support_system_role: false start_date: null end_date: null cot_code_execution: false release_version: release_v2 supported_endpoint_types: - chat type: codeexecution_v2 target: api_endpoint: {} ``` ::: :::: --- (livecodebench-codeexecution-v2-cot)= ## codeexecution_v2_cot “CoT. Execute” a program on an input, evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. Chain-of-Thought version of the task. ::::{tab-set} :::{tab-item} Container **Harness:** `livecodebench` **Container:** ``` nvcr.io/nvidia/eval-factory/livecodebench:26.01 ``` **Container Digest:** ``` sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e ``` **Container Arch:** `multiarch` **Task Type:** `codeexecution_v2_cot` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %} export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} livecodebench --model {{target.api_endpoint.model_id}} \ --scenario {{config.params.task}} \ --release_version {{config.params.extra.release_version}} \ --url {{target.api_endpoint.url}} \ --temperature {{config.params.temperature}} \ --top_p {{config.params.top_p}} \ --evaluate \ --codegen_n {{config.params.extra.n_samples}} \ --use_cache \ --cache_batch_size {{config.params.extra.cache_batch_size}} \ --num_process_evaluate {{config.params.extra.num_process_evaluate}} \ --n {{config.params.extra.n_samples}} \ --max_tokens {{config.params.max_new_tokens}} \ --out_dir {{config.output_dir}} \ --multiprocess {{config.params.parallelism}} \ --max_retries {{config.params.max_retries}} \ --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: livecodebench pkg_name: livecodebench config: params: max_new_tokens: 4096 max_retries: 5 parallelism: 10 task: codeexecution temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 10 num_process_evaluate: 5 cache_batch_size: 10 support_system_role: false start_date: null end_date: null cot_code_execution: true release_version: release_v2 supported_endpoint_types: - chat type: codeexecution_v2_cot target: api_endpoint: {} ``` ::: :::: --- (livecodebench-codegeneration-notfast)= ## codegeneration_notfast Not fast version of code generation (v2). ::::{tab-set} :::{tab-item} Container **Harness:** `livecodebench` **Container:** ``` nvcr.io/nvidia/eval-factory/livecodebench:26.01 ``` **Container Digest:** ``` sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e ``` **Container Arch:** `multiarch` **Task Type:** `codegeneration_notfast` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %} export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} livecodebench --model {{target.api_endpoint.model_id}} \ --scenario {{config.params.task}} \ --release_version {{config.params.extra.release_version}} \ --url {{target.api_endpoint.url}} \ --temperature {{config.params.temperature}} \ --top_p {{config.params.top_p}} \ --evaluate \ --codegen_n {{config.params.extra.n_samples}} \ --use_cache \ --cache_batch_size {{config.params.extra.cache_batch_size}} \ --num_process_evaluate {{config.params.extra.num_process_evaluate}} \ --n {{config.params.extra.n_samples}} \ --max_tokens {{config.params.max_new_tokens}} \ --out_dir {{config.output_dir}} \ --multiprocess {{config.params.parallelism}} \ --max_retries {{config.params.max_retries}} \ --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: livecodebench pkg_name: livecodebench config: params: max_new_tokens: 4096 max_retries: 5 parallelism: 10 task: codegeneration temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 10 num_process_evaluate: 5 cache_batch_size: 10 support_system_role: false start_date: null end_date: null cot_code_execution: false args: --not_fast supported_endpoint_types: - chat type: codegeneration_notfast target: api_endpoint: {} ``` ::: :::: --- (livecodebench-codegeneration-release-latest)= ## codegeneration_release_latest Code generation latest version ::::{tab-set} :::{tab-item} Container **Harness:** `livecodebench` **Container:** ``` nvcr.io/nvidia/eval-factory/livecodebench:26.01 ``` **Container Digest:** ``` sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e ``` **Container Arch:** `multiarch` **Task Type:** `codegeneration_release_latest` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %} export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} livecodebench --model {{target.api_endpoint.model_id}} \ --scenario {{config.params.task}} \ --release_version {{config.params.extra.release_version}} \ --url {{target.api_endpoint.url}} \ --temperature {{config.params.temperature}} \ --top_p {{config.params.top_p}} \ --evaluate \ --codegen_n {{config.params.extra.n_samples}} \ --use_cache \ --cache_batch_size {{config.params.extra.cache_batch_size}} \ --num_process_evaluate {{config.params.extra.num_process_evaluate}} \ --n {{config.params.extra.n_samples}} \ --max_tokens {{config.params.max_new_tokens}} \ --out_dir {{config.output_dir}} \ --multiprocess {{config.params.parallelism}} \ --max_retries {{config.params.max_retries}} \ --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: livecodebench pkg_name: livecodebench config: params: max_new_tokens: 4096 max_retries: 5 parallelism: 10 task: codegeneration temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 10 num_process_evaluate: 5 cache_batch_size: 10 support_system_role: false start_date: null end_date: null cot_code_execution: false release_version: release_latest supported_endpoint_types: - chat type: codegeneration_release_latest target: api_endpoint: {} ``` ::: :::: --- (livecodebench-codegeneration-release-v1)= ## codegeneration_release_v1 The initial release of the dataset (v1) with problems released between May 2023 and Mar 2024 containing 400 problems. ::::{tab-set} :::{tab-item} Container **Harness:** `livecodebench` **Container:** ``` nvcr.io/nvidia/eval-factory/livecodebench:26.01 ``` **Container Digest:** ``` sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e ``` **Container Arch:** `multiarch` **Task Type:** `codegeneration_release_v1` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %} export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} livecodebench --model {{target.api_endpoint.model_id}} \ --scenario {{config.params.task}} \ --release_version {{config.params.extra.release_version}} \ --url {{target.api_endpoint.url}} \ --temperature {{config.params.temperature}} \ --top_p {{config.params.top_p}} \ --evaluate \ --codegen_n {{config.params.extra.n_samples}} \ --use_cache \ --cache_batch_size {{config.params.extra.cache_batch_size}} \ --num_process_evaluate {{config.params.extra.num_process_evaluate}} \ --n {{config.params.extra.n_samples}} \ --max_tokens {{config.params.max_new_tokens}} \ --out_dir {{config.output_dir}} \ --multiprocess {{config.params.parallelism}} \ --max_retries {{config.params.max_retries}} \ --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: livecodebench pkg_name: livecodebench config: params: max_new_tokens: 4096 max_retries: 5 parallelism: 10 task: codegeneration temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 10 num_process_evaluate: 5 cache_batch_size: 10 support_system_role: false start_date: null end_date: null cot_code_execution: false release_version: release_v1 supported_endpoint_types: - chat type: codegeneration_release_v1 target: api_endpoint: {} ``` ::: :::: --- (livecodebench-codegeneration-release-v2)= ## codegeneration_release_v2 The updated release of the dataset (v2) with problems released between May 2023 and May 2024 containing 511 problems. ::::{tab-set} :::{tab-item} Container **Harness:** `livecodebench` **Container:** ``` nvcr.io/nvidia/eval-factory/livecodebench:26.01 ``` **Container Digest:** ``` sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e ``` **Container Arch:** `multiarch` **Task Type:** `codegeneration_release_v2` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %} export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} livecodebench --model {{target.api_endpoint.model_id}} \ --scenario {{config.params.task}} \ --release_version {{config.params.extra.release_version}} \ --url {{target.api_endpoint.url}} \ --temperature {{config.params.temperature}} \ --top_p {{config.params.top_p}} \ --evaluate \ --codegen_n {{config.params.extra.n_samples}} \ --use_cache \ --cache_batch_size {{config.params.extra.cache_batch_size}} \ --num_process_evaluate {{config.params.extra.num_process_evaluate}} \ --n {{config.params.extra.n_samples}} \ --max_tokens {{config.params.max_new_tokens}} \ --out_dir {{config.output_dir}} \ --multiprocess {{config.params.parallelism}} \ --max_retries {{config.params.max_retries}} \ --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: livecodebench pkg_name: livecodebench config: params: max_new_tokens: 4096 max_retries: 5 parallelism: 10 task: codegeneration temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 10 num_process_evaluate: 5 cache_batch_size: 10 support_system_role: false start_date: null end_date: null cot_code_execution: false release_version: release_v2 supported_endpoint_types: - chat type: codegeneration_release_v2 target: api_endpoint: {} ``` ::: :::: --- (livecodebench-codegeneration-release-v3)= ## codegeneration_release_v3 The updated release of the dataset (v3) with problems released between May 2023 and Jul 2024 containing 612 problems. ::::{tab-set} :::{tab-item} Container **Harness:** `livecodebench` **Container:** ``` nvcr.io/nvidia/eval-factory/livecodebench:26.01 ``` **Container Digest:** ``` sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e ``` **Container Arch:** `multiarch` **Task Type:** `codegeneration_release_v3` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %} export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} livecodebench --model {{target.api_endpoint.model_id}} \ --scenario {{config.params.task}} \ --release_version {{config.params.extra.release_version}} \ --url {{target.api_endpoint.url}} \ --temperature {{config.params.temperature}} \ --top_p {{config.params.top_p}} \ --evaluate \ --codegen_n {{config.params.extra.n_samples}} \ --use_cache \ --cache_batch_size {{config.params.extra.cache_batch_size}} \ --num_process_evaluate {{config.params.extra.num_process_evaluate}} \ --n {{config.params.extra.n_samples}} \ --max_tokens {{config.params.max_new_tokens}} \ --out_dir {{config.output_dir}} \ --multiprocess {{config.params.parallelism}} \ --max_retries {{config.params.max_retries}} \ --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: livecodebench pkg_name: livecodebench config: params: max_new_tokens: 4096 max_retries: 5 parallelism: 10 task: codegeneration temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 10 num_process_evaluate: 5 cache_batch_size: 10 support_system_role: false start_date: null end_date: null cot_code_execution: false release_version: release_v3 supported_endpoint_types: - chat type: codegeneration_release_v3 target: api_endpoint: {} ``` ::: :::: --- (livecodebench-codegeneration-release-v4)= ## codegeneration_release_v4 The updated release of the dataset (v4) with problems released between May 2023 and Sep 2024 containing 713 problems. ::::{tab-set} :::{tab-item} Container **Harness:** `livecodebench` **Container:** ``` nvcr.io/nvidia/eval-factory/livecodebench:26.01 ``` **Container Digest:** ``` sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e ``` **Container Arch:** `multiarch` **Task Type:** `codegeneration_release_v4` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %} export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} livecodebench --model {{target.api_endpoint.model_id}} \ --scenario {{config.params.task}} \ --release_version {{config.params.extra.release_version}} \ --url {{target.api_endpoint.url}} \ --temperature {{config.params.temperature}} \ --top_p {{config.params.top_p}} \ --evaluate \ --codegen_n {{config.params.extra.n_samples}} \ --use_cache \ --cache_batch_size {{config.params.extra.cache_batch_size}} \ --num_process_evaluate {{config.params.extra.num_process_evaluate}} \ --n {{config.params.extra.n_samples}} \ --max_tokens {{config.params.max_new_tokens}} \ --out_dir {{config.output_dir}} \ --multiprocess {{config.params.parallelism}} \ --max_retries {{config.params.max_retries}} \ --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: livecodebench pkg_name: livecodebench config: params: max_new_tokens: 4096 max_retries: 5 parallelism: 10 task: codegeneration temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 10 num_process_evaluate: 5 cache_batch_size: 10 support_system_role: false start_date: null end_date: null cot_code_execution: false release_version: release_v4 supported_endpoint_types: - chat type: codegeneration_release_v4 target: api_endpoint: {} ``` ::: :::: --- (livecodebench-codegeneration-release-v5)= ## codegeneration_release_v5 The updated release of the dataset (v5) with problems released between May 2023 and Jan 2025 containing 880 problems. ::::{tab-set} :::{tab-item} Container **Harness:** `livecodebench` **Container:** ``` nvcr.io/nvidia/eval-factory/livecodebench:26.01 ``` **Container Digest:** ``` sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e ``` **Container Arch:** `multiarch` **Task Type:** `codegeneration_release_v5` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %} export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} livecodebench --model {{target.api_endpoint.model_id}} \ --scenario {{config.params.task}} \ --release_version {{config.params.extra.release_version}} \ --url {{target.api_endpoint.url}} \ --temperature {{config.params.temperature}} \ --top_p {{config.params.top_p}} \ --evaluate \ --codegen_n {{config.params.extra.n_samples}} \ --use_cache \ --cache_batch_size {{config.params.extra.cache_batch_size}} \ --num_process_evaluate {{config.params.extra.num_process_evaluate}} \ --n {{config.params.extra.n_samples}} \ --max_tokens {{config.params.max_new_tokens}} \ --out_dir {{config.output_dir}} \ --multiprocess {{config.params.parallelism}} \ --max_retries {{config.params.max_retries}} \ --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: livecodebench pkg_name: livecodebench config: params: max_new_tokens: 4096 max_retries: 5 parallelism: 10 task: codegeneration temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 10 num_process_evaluate: 5 cache_batch_size: 10 support_system_role: false start_date: null end_date: null cot_code_execution: false release_version: release_v5 supported_endpoint_types: - chat type: codegeneration_release_v5 target: api_endpoint: {} ``` ::: :::: --- (livecodebench-codegeneration-release-v6)= ## codegeneration_release_v6 The updated release of the dataset (v6) with problems released between May 2023 and Apr 2025 containing 1055 problems. ::::{tab-set} :::{tab-item} Container **Harness:** `livecodebench` **Container:** ``` nvcr.io/nvidia/eval-factory/livecodebench:26.01 ``` **Container Digest:** ``` sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e ``` **Container Arch:** `multiarch` **Task Type:** `codegeneration_release_v6` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %} export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} livecodebench --model {{target.api_endpoint.model_id}} \ --scenario {{config.params.task}} \ --release_version {{config.params.extra.release_version}} \ --url {{target.api_endpoint.url}} \ --temperature {{config.params.temperature}} \ --top_p {{config.params.top_p}} \ --evaluate \ --codegen_n {{config.params.extra.n_samples}} \ --use_cache \ --cache_batch_size {{config.params.extra.cache_batch_size}} \ --num_process_evaluate {{config.params.extra.num_process_evaluate}} \ --n {{config.params.extra.n_samples}} \ --max_tokens {{config.params.max_new_tokens}} \ --out_dir {{config.output_dir}} \ --multiprocess {{config.params.parallelism}} \ --max_retries {{config.params.max_retries}} \ --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: livecodebench pkg_name: livecodebench config: params: max_new_tokens: 4096 max_retries: 5 parallelism: 10 task: codegeneration temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 10 num_process_evaluate: 5 cache_batch_size: 10 support_system_role: false start_date: null end_date: null cot_code_execution: false release_version: release_v6 supported_endpoint_types: - chat type: codegeneration_release_v6 target: api_endpoint: {} ``` ::: :::: --- (livecodebench-livecodebench-0724-0125)= ## livecodebench_0724_0125 - Code generation evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. - The data period and sampling parameters used by Artificial Analaysis (https://artificialanalysis.ai/methodology/intelligence-benchmarking) ::::{tab-set} :::{tab-item} Container **Harness:** `livecodebench` **Container:** ``` nvcr.io/nvidia/eval-factory/livecodebench:26.01 ``` **Container Digest:** ``` sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e ``` **Container Arch:** `multiarch` **Task Type:** `livecodebench_0724_0125` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %} export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} livecodebench --model {{target.api_endpoint.model_id}} \ --scenario {{config.params.task}} \ --release_version {{config.params.extra.release_version}} \ --url {{target.api_endpoint.url}} \ --temperature {{config.params.temperature}} \ --top_p {{config.params.top_p}} \ --evaluate \ --codegen_n {{config.params.extra.n_samples}} \ --use_cache \ --cache_batch_size {{config.params.extra.cache_batch_size}} \ --num_process_evaluate {{config.params.extra.num_process_evaluate}} \ --n {{config.params.extra.n_samples}} \ --max_tokens {{config.params.max_new_tokens}} \ --out_dir {{config.output_dir}} \ --multiprocess {{config.params.parallelism}} \ --max_retries {{config.params.max_retries}} \ --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: livecodebench pkg_name: livecodebench config: params: max_new_tokens: 4096 max_retries: 5 parallelism: 10 task: codegeneration temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 3 num_process_evaluate: 5 cache_batch_size: 10 support_system_role: false start_date: 2024-07-01 end_date: 2025-01-01 cot_code_execution: false release_version: release_v5 supported_endpoint_types: - chat type: livecodebench_0724_0125 target: api_endpoint: {} ``` ::: :::: --- (livecodebench-livecodebench-0824-0225)= ## livecodebench_0824_0225 ['Code generation evaluating code comprehension ability. The model is given a program and an input, and the output should be the result.', 'The data period and sampling parameters used by NeMo Alignment team.'] ::::{tab-set} :::{tab-item} Container **Harness:** `livecodebench` **Container:** ``` nvcr.io/nvidia/eval-factory/livecodebench:26.01 ``` **Container Digest:** ``` sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e ``` **Container Arch:** `multiarch` **Task Type:** `livecodebench_0824_0225` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %} export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} livecodebench --model {{target.api_endpoint.model_id}} \ --scenario {{config.params.task}} \ --release_version {{config.params.extra.release_version}} \ --url {{target.api_endpoint.url}} \ --temperature {{config.params.temperature}} \ --top_p {{config.params.top_p}} \ --evaluate \ --codegen_n {{config.params.extra.n_samples}} \ --use_cache \ --cache_batch_size {{config.params.extra.cache_batch_size}} \ --num_process_evaluate {{config.params.extra.num_process_evaluate}} \ --n {{config.params.extra.n_samples}} \ --max_tokens {{config.params.max_new_tokens}} \ --out_dir {{config.output_dir}} \ --multiprocess {{config.params.parallelism}} \ --max_retries {{config.params.max_retries}} \ --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: livecodebench pkg_name: livecodebench config: params: max_new_tokens: 4096 max_retries: 5 parallelism: 10 task: codegeneration temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 3 num_process_evaluate: 5 cache_batch_size: 10 support_system_role: false start_date: 2024-08-01 end_date: 2025-02-01 cot_code_execution: false release_version: release_v5 supported_endpoint_types: - chat type: livecodebench_0824_0225 target: api_endpoint: {} ``` ::: :::: --- (livecodebench-livecodebench-aa-v2)= ## livecodebench_aa_v2 - Code generation evaluating code comprehension ability. The model is given a program and an input, and the output should be the result. - The data period and sampling parameters used by Artificial Analaysis (https://artificialanalysis.ai/methodology/intelligence-benchmarking) ::::{tab-set} :::{tab-item} Container **Harness:** `livecodebench` **Container:** ``` nvcr.io/nvidia/eval-factory/livecodebench:26.01 ``` **Container Digest:** ``` sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e ``` **Container Arch:** `multiarch` **Task Type:** `livecodebench_aa_v2` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %} export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} livecodebench --model {{target.api_endpoint.model_id}} \ --scenario {{config.params.task}} \ --release_version {{config.params.extra.release_version}} \ --url {{target.api_endpoint.url}} \ --temperature {{config.params.temperature}} \ --top_p {{config.params.top_p}} \ --evaluate \ --codegen_n {{config.params.extra.n_samples}} \ --use_cache \ --cache_batch_size {{config.params.extra.cache_batch_size}} \ --num_process_evaluate {{config.params.extra.num_process_evaluate}} \ --n {{config.params.extra.n_samples}} \ --max_tokens {{config.params.max_new_tokens}} \ --out_dir {{config.output_dir}} \ --multiprocess {{config.params.parallelism}} \ --max_retries {{config.params.max_retries}} \ --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: livecodebench pkg_name: livecodebench config: params: max_new_tokens: 16384 max_retries: 30 parallelism: 10 task: codegeneration temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 3 num_process_evaluate: 5 cache_batch_size: 10 support_system_role: false start_date: 2024-07-01 end_date: 2025-01-01 cot_code_execution: false release_version: release_v5 supported_endpoint_types: - chat type: livecodebench_aa_v2 target: api_endpoint: {} ``` ::: :::: --- (livecodebench-testoutputprediction)= ## testoutputprediction Solve the natural language task on a specified input, evaluating the ability to generate testing outputs. The model is given the natural language problem description and an input, and the output should be the output for the problem. ::::{tab-set} :::{tab-item} Container **Harness:** `livecodebench` **Container:** ``` nvcr.io/nvidia/eval-factory/livecodebench:26.01 ``` **Container Digest:** ``` sha256:76b4ce10b3e0f839bb5f86d11319d62dfc94fc49ac72c2cb126c27c021f7f69e ``` **Container Arch:** `multiarch` **Task Type:** `testoutputprediction` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %} export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} livecodebench --model {{target.api_endpoint.model_id}} \ --scenario {{config.params.task}} \ --release_version {{config.params.extra.release_version}} \ --url {{target.api_endpoint.url}} \ --temperature {{config.params.temperature}} \ --top_p {{config.params.top_p}} \ --evaluate \ --codegen_n {{config.params.extra.n_samples}} \ --use_cache \ --cache_batch_size {{config.params.extra.cache_batch_size}} \ --num_process_evaluate {{config.params.extra.num_process_evaluate}} \ --n {{config.params.extra.n_samples}} \ --max_tokens {{config.params.max_new_tokens}} \ --out_dir {{config.output_dir}} \ --multiprocess {{config.params.parallelism}} \ --max_retries {{config.params.max_retries}} \ --timeout {{config.params.request_timeout}}{% if config.params.extra.start_date is not none %} --start_date {{config.params.extra.start_date}} {% endif %} {% if config.params.extra.end_date is not none %} --end_date {{config.params.extra.end_date}} {% endif %} {% if config.params.extra.support_system_role %} --support_system_role {% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %}{% if config.params.extra.cot_code_execution == true %} --cot_code_execution {% endif %}{% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: livecodebench pkg_name: livecodebench config: params: max_new_tokens: 4096 max_retries: 5 parallelism: 10 task: testoutputprediction temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 10 num_process_evaluate: 5 cache_batch_size: 10 support_system_role: false start_date: null end_date: null cot_code_execution: false release_version: release_latest supported_endpoint_types: - chat type: testoutputprediction target: api_endpoint: {} ``` ::: :::: # lm-evaluation-harness This page contains all evaluation tasks for the **lm-evaluation-harness** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [adlr_agieval_en_cot](#lm-evaluation-harness-adlr-agieval-en-cot) - Version of the AGIEval-EN-CoT benchmark used by NVIDIA Applied Deep Learning Research team (ADLR). * - [adlr_arc_challenge_llama_25_shot](#lm-evaluation-harness-adlr-arc-challenge-llama-25-shot) - ARC-Challenge-Llama version used by NVIDIA Applied Deep Learning Research team (ADLR). * - [adlr_commonsense_qa_7_shot](#lm-evaluation-harness-adlr-commonsense-qa-7-shot) - CommonsenseQA version used by NVIDIA Applied Deep Learning Research team (ADLR). * - [adlr_global_mmlu_lite_5_shot](#lm-evaluation-harness-adlr-global-mmlu-lite-5-shot) - Global-MMLU subset (8 languages - es, de, fr, zh, it, ja, pt, ko) used by NVIDIA Applied Deep Learning Research team (ADLR). * - [adlr_gpqa_diamond_cot_5_shot](#lm-evaluation-harness-adlr-gpqa-diamond-cot-5-shot) - Version of the GPQA-Diamond-CoT benchmark used by NVIDIA Applied Deep Learning Research team (ADLR). * - [adlr_gsm8k_cot_8_shot](#lm-evaluation-harness-adlr-gsm8k-cot-8-shot) - GSM8K-CoT version used by NVIDIA Applied Deep Learning Research team (ADLR). * - [adlr_humaneval_greedy](#lm-evaluation-harness-adlr-humaneval-greedy) - HumanEval Greedy version used by NVIDIA Applied Deep Learning Research team (ADLR). * - [adlr_humaneval_sampled](#lm-evaluation-harness-adlr-humaneval-sampled) - HumanEval Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR). * - [adlr_math_500_4_shot_sampled](#lm-evaluation-harness-adlr-math-500-4-shot-sampled) - MATH-500 Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR). * - [adlr_mbpp_sanitized_3_shot_greedy](#lm-evaluation-harness-adlr-mbpp-sanitized-3-shot-greedy) - MBPP Greedy version used by NVIDIA Applied Deep Learning Research team (ADLR). * - [adlr_mbpp_sanitized_3_shot_sampled](#lm-evaluation-harness-adlr-mbpp-sanitized-3-shot-sampled) - MBPP Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR). * - [adlr_mgsm_native_cot_8_shot](#lm-evaluation-harness-adlr-mgsm-native-cot-8-shot) - MGSM native CoT subset (6 languages - es, de, fr, zh, ja, ru) used by NVIDIA Applied Deep Learning Research team (ADLR). * - [adlr_minerva_math_nemo_4_shot](#lm-evaluation-harness-adlr-minerva-math-nemo-4-shot) - Minerva-Math version used by NVIDIA Applied Deep Learning Research team (ADLR). * - [adlr_mmlu](#lm-evaluation-harness-adlr-mmlu) - MMLU version used by NVIDIA Applied Deep Learning Research team (ADLR). * - [adlr_mmlu_pro_5_shot_base](#lm-evaluation-harness-adlr-mmlu-pro-5-shot-base) - MMLU-Pro 5-shot base version used by NVIDIA Applied Deep Learning Research team (ADLR). * - [adlr_race](#lm-evaluation-harness-adlr-race) - RACE version used by NVIDIA Applied Deep Learning Research team (ADLR). * - [adlr_truthfulqa_mc2](#lm-evaluation-harness-adlr-truthfulqa-mc2) - TruthfulQA-MC2 version used by NVIDIA Applied Deep Learning Research team (ADLR). * - [adlr_winogrande_5_shot](#lm-evaluation-harness-adlr-winogrande-5-shot) - Winogrande version used by NVIDIA Applied Deep Learning Research team (ADLR). * - [agieval](#lm-evaluation-harness-agieval) - AGIEval - A Human-Centric Benchmark for Evaluating Foundation Models * - [arc_challenge](#lm-evaluation-harness-arc-challenge) - The ARC challenge dataset consists of 2,590 multiple-choice science exam questions. * - [arc_challenge_chat](#lm-evaluation-harness-arc-challenge-chat) - - The ARC challenge dataset consists of 2,590 multiple-choice science exam questions. - This variant applies a chat template and defaults to zero-shot evaluation. * - [arc_multilingual](#lm-evaluation-harness-arc-multilingual) - The multilingual versions of the ARC challenge dataset. * - [bbh](#lm-evaluation-harness-bbh) - The BIG-Bench Hard (BBH) benchmark is a part of the BIG-Bench evaluation suite, focusing on 23 particularly difficult tasks that current language models struggle with. * - [bbh_instruct](#lm-evaluation-harness-bbh-instruct) - - The BIG-Bench Hard (BBH) benchmark is a part of the BIG-Bench evaluation suite, focusing on 23 particularly difficult tasks that current language models struggle with. - This variant aaplies chat template and defaults to zero-shot evaluation. * - [bbq_chat](#lm-evaluation-harness-bbq-chat) - The BBQ (Bias Benchmark for QA) is a benchmark designed to measure social biases in question answering systems. It contains ambiguous questions spanning 9 categories - disability, gender, nationality, physical appearance, race/ethnicity, religion, sexual orientation, socioeconomic status, and age (chat endpoint). * - [bbq_completions](#lm-evaluation-harness-bbq-completions) - The BBQ (Bias Benchmark for QA) is a benchmark designed to measure social biases in question answering systems. It contains ambiguous questions spanning 9 categories - disability, gender, nationality, physical appearance, race/ethnicity, religion, sexual orientation, socioeconomic status, and age (completions endpoint). * - [commonsense_qa](#lm-evaluation-harness-commonsense-qa) - - CommonsenseQA is a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers. - It contains 12,102 questions with one correct answer and four distractor answers. * - [global_mmlu](#lm-evaluation-harness-global-mmlu) - - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - It is designed for efficient evaluation of multilingual models in 15 languages (including English). * - [global_mmlu_ar](#lm-evaluation-harness-global-mmlu-ar) - - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the AR subset. * - [global_mmlu_bn](#lm-evaluation-harness-global-mmlu-bn) - - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the BN subset. * - [global_mmlu_de](#lm-evaluation-harness-global-mmlu-de) - - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the DE subset. * - [global_mmlu_en](#lm-evaluation-harness-global-mmlu-en) - - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the EN subset. * - [global_mmlu_es](#lm-evaluation-harness-global-mmlu-es) - - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ES subset. * - [global_mmlu_fr](#lm-evaluation-harness-global-mmlu-fr) - - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the FR subset. * - [global_mmlu_full](#lm-evaluation-harness-global-mmlu-full) - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. * - [global_mmlu_full_am](#lm-evaluation-harness-global-mmlu-full-am) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the AM subset. * - [global_mmlu_full_ar](#lm-evaluation-harness-global-mmlu-full-ar) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the AR subset. * - [global_mmlu_full_bn](#lm-evaluation-harness-global-mmlu-full-bn) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the BN subset. * - [global_mmlu_full_cs](#lm-evaluation-harness-global-mmlu-full-cs) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the CS subset. * - [global_mmlu_full_de](#lm-evaluation-harness-global-mmlu-full-de) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the DE subset. * - [global_mmlu_full_el](#lm-evaluation-harness-global-mmlu-full-el) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the EL subset. * - [global_mmlu_full_en](#lm-evaluation-harness-global-mmlu-full-en) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the EN subset. * - [global_mmlu_full_es](#lm-evaluation-harness-global-mmlu-full-es) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ES subset. * - [global_mmlu_full_fa](#lm-evaluation-harness-global-mmlu-full-fa) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FA subset. * - [global_mmlu_full_fil](#lm-evaluation-harness-global-mmlu-full-fil) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FIL subset. * - [global_mmlu_full_fr](#lm-evaluation-harness-global-mmlu-full-fr) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FR subset. * - [global_mmlu_full_ha](#lm-evaluation-harness-global-mmlu-full-ha) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HA subset. * - [global_mmlu_full_he](#lm-evaluation-harness-global-mmlu-full-he) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HE subset. * - [global_mmlu_full_hi](#lm-evaluation-harness-global-mmlu-full-hi) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HI subset. * - [global_mmlu_full_id](#lm-evaluation-harness-global-mmlu-full-id) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ID subset. * - [global_mmlu_full_ig](#lm-evaluation-harness-global-mmlu-full-ig) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the IG subset. * - [global_mmlu_full_it](#lm-evaluation-harness-global-mmlu-full-it) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the IT subset. * - [global_mmlu_full_ja](#lm-evaluation-harness-global-mmlu-full-ja) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the JA subset. * - [global_mmlu_full_ko](#lm-evaluation-harness-global-mmlu-full-ko) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the KO subset. * - [global_mmlu_full_ky](#lm-evaluation-harness-global-mmlu-full-ky) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the KY subset. * - [global_mmlu_full_lt](#lm-evaluation-harness-global-mmlu-full-lt) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the LT subset. * - [global_mmlu_full_mg](#lm-evaluation-harness-global-mmlu-full-mg) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the MG subset. * - [global_mmlu_full_ms](#lm-evaluation-harness-global-mmlu-full-ms) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the MS subset. * - [global_mmlu_full_ne](#lm-evaluation-harness-global-mmlu-full-ne) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NE subset. * - [global_mmlu_full_nl](#lm-evaluation-harness-global-mmlu-full-nl) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NL subset. * - [global_mmlu_full_ny](#lm-evaluation-harness-global-mmlu-full-ny) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NY subset. * - [global_mmlu_full_pl](#lm-evaluation-harness-global-mmlu-full-pl) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the PL subset. * - [global_mmlu_full_pt](#lm-evaluation-harness-global-mmlu-full-pt) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the PT subset. * - [global_mmlu_full_ro](#lm-evaluation-harness-global-mmlu-full-ro) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the RO subset. * - [global_mmlu_full_ru](#lm-evaluation-harness-global-mmlu-full-ru) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the RU subset. * - [global_mmlu_full_si](#lm-evaluation-harness-global-mmlu-full-si) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SI subset. * - [global_mmlu_full_sn](#lm-evaluation-harness-global-mmlu-full-sn) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SN subset. * - [global_mmlu_full_so](#lm-evaluation-harness-global-mmlu-full-so) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SO subset. * - [global_mmlu_full_sr](#lm-evaluation-harness-global-mmlu-full-sr) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SR subset. * - [global_mmlu_full_sv](#lm-evaluation-harness-global-mmlu-full-sv) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SV subset. * - [global_mmlu_full_sw](#lm-evaluation-harness-global-mmlu-full-sw) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SW subset. * - [global_mmlu_full_te](#lm-evaluation-harness-global-mmlu-full-te) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the TE subset. * - [global_mmlu_full_tr](#lm-evaluation-harness-global-mmlu-full-tr) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the TR subset. * - [global_mmlu_full_uk](#lm-evaluation-harness-global-mmlu-full-uk) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the UK subset. * - [global_mmlu_full_vi](#lm-evaluation-harness-global-mmlu-full-vi) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the VI subset. * - [global_mmlu_full_yo](#lm-evaluation-harness-global-mmlu-full-yo) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the YO subset. * - [global_mmlu_full_zh](#lm-evaluation-harness-global-mmlu-full-zh) - - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ZH subset. * - [global_mmlu_hi](#lm-evaluation-harness-global-mmlu-hi) - - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the HI subset. * - [global_mmlu_id](#lm-evaluation-harness-global-mmlu-id) - - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ID subset. * - [global_mmlu_it](#lm-evaluation-harness-global-mmlu-it) - - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the IT subset. * - [global_mmlu_ja](#lm-evaluation-harness-global-mmlu-ja) - - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the JA subset. * - [global_mmlu_ko](#lm-evaluation-harness-global-mmlu-ko) - - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the KO subset. * - [global_mmlu_pt](#lm-evaluation-harness-global-mmlu-pt) - - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the PT subset. * - [global_mmlu_sw](#lm-evaluation-harness-global-mmlu-sw) - - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the SW subset. * - [global_mmlu_yo](#lm-evaluation-harness-global-mmlu-yo) - - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the YO subset. * - [global_mmlu_zh](#lm-evaluation-harness-global-mmlu-zh) - - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ZH subset. * - [gpqa](#lm-evaluation-harness-gpqa) - The GPQA (Graduate-Level Google-Proof Q&A) benchmark is a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry. * - [gpqa_diamond_cot](#lm-evaluation-harness-gpqa-diamond-cot) - - The GPQA (Graduate-Level Google-Proof Q&A) benchmark is a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry. - This variant uses the Diamond subset and defaults to zero-shot chain-of-thought evaluation. * - [gsm8k](#lm-evaluation-harness-gsm8k) - The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. * - [gsm8k_cot_instruct](#lm-evaluation-harness-gsm8k-cot-instruct) - - The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation with custom instructions. * - [gsm8k_cot_llama](#lm-evaluation-harness-gsm8k-cot-llama) - - The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought evaluation - implementation taken from llama. * - [gsm8k_cot_zeroshot](#lm-evaluation-harness-gsm8k-cot-zeroshot) - - The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation. * - [gsm8k_cot_zeroshot_llama](#lm-evaluation-harness-gsm8k-cot-zeroshot-llama) - - The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation - implementation taken from llama. * - [hellaswag](#lm-evaluation-harness-hellaswag) - The HellaSwag benchmark tests a language model's commonsense reasoning by having it choose the most logical ending for a given story. * - [hellaswag_multilingual](#lm-evaluation-harness-hellaswag-multilingual) - The multilingual versions of the HellaSwag benchmark. * - [humaneval_instruct](#lm-evaluation-harness-humaneval-instruct) - - The HumanEval benchmark measures functional correctness for synthesizing programs from docstrings. - Implementation taken from llama. * - [ifeval](#lm-evaluation-harness-ifeval) - IFEval is a dataset designed to test a model's ability to follow explicit instructions, such as "include keyword x" or "use format y." The focus is on the model's adherence to formatting instructions rather than the content generated, allowing for the use of strict and rigorous metrics. * - [m_mmlu_id_str_chat](#lm-evaluation-harness-m-mmlu-id-str-chat) - - The MMLU (Massive Multitask Language Understanding) benchmark translated to Indonesian with string-based evaluation (chat endpoint). * - [m_mmlu_id_str_completions](#lm-evaluation-harness-m-mmlu-id-str-completions) - - The MMLU (Massive Multitask Language Understanding) benchmark translated to Indonesian with string-based evaluation (completions endpoint). * - [mbpp_plus_chat](#lm-evaluation-harness-mbpp-plus-chat) - MBPP EvalPlus is an extension of the MBPP benchmark with 35x more test cases (chat endpoint). * - [mbpp_plus_completions](#lm-evaluation-harness-mbpp-plus-completions) - MBPP EvalPlus is an extension of the MBPP benchmark with 35x more test cases (completions endpoint). * - [mgsm](#lm-evaluation-harness-mgsm) - - The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages. * - [mgsm_cot_chat](#lm-evaluation-harness-mgsm-cot-chat) - - The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages. - This variant uses the chat endpoint and defaults to chain-of-thought evaluation. * - [mgsm_cot_completions](#lm-evaluation-harness-mgsm-cot-completions) - - The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages. - This variant uses the completions endpoint and defaults to chain-of-thought evaluation. * - [mmlu](#lm-evaluation-harness-mmlu) - - The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses text generation. * - [mmlu_cot_0_shot_chat](#lm-evaluation-harness-mmlu-cot-0-shot-chat) - - The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant defaults to chain-of-thought zero-shot evaluation. * - [mmlu_instruct](#lm-evaluation-harness-mmlu-instruct) - - The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the chat endpoint, defaults to zero-shot evaluation and instructs the model to produce a single letter response. * - [mmlu_instruct_completions](#lm-evaluation-harness-mmlu-instruct-completions) - - The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the completions endpoint, defaults to zero-shot evaluation and instructs the model to produce a single letter response. * - [mmlu_logits](#lm-evaluation-harness-mmlu-logits) - - The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the logits of the model to evaluate the accuracy. * - [mmlu_pro](#lm-evaluation-harness-mmlu-pro) - MMLU-Pro is a refined version of the MMLU dataset with 10 choices instead of 4 (completions endpoint). * - [mmlu_pro_instruct](#lm-evaluation-harness-mmlu-pro-instruct) - - MMLU-Pro is a refined version of the MMLU dataset with 10 choices instead of 4. - This variant applies a chat template and defaults to zero-shot evaluation. * - [mmlu_prox_chat](#lm-evaluation-harness-mmlu-prox-chat) - A Multilingual Benchmark for Advanced Large Language Model Evaluation (chat endpoint) * - [mmlu_prox_completions](#lm-evaluation-harness-mmlu-prox-completions) - A Multilingual Benchmark for Advanced Large Language Model Evaluation (completions endpoint) * - [mmlu_prox_de_chat](#lm-evaluation-harness-mmlu-prox-de-chat) - A Multilingual Benchmark for Advanced Large Language Model Evaluation - German dataset (chat endpoint) * - [mmlu_prox_de_completions](#lm-evaluation-harness-mmlu-prox-de-completions) - A Multilingual Benchmark for Advanced Large Language Model Evaluation - German dataset (completions endpoint) * - [mmlu_prox_es_chat](#lm-evaluation-harness-mmlu-prox-es-chat) - A Multilingual Benchmark for Advanced Large Language Model Evaluation - Spanish dataset (chat endpoint) * - [mmlu_prox_es_completions](#lm-evaluation-harness-mmlu-prox-es-completions) - A Multilingual Benchmark for Advanced Large Language Model Evaluation - Spanish dataset (completions endpoint) * - [mmlu_prox_fr_chat](#lm-evaluation-harness-mmlu-prox-fr-chat) - A Multilingual Benchmark for Advanced Large Language Model Evaluation - French dataset (chat endpoint) * - [mmlu_prox_fr_completions](#lm-evaluation-harness-mmlu-prox-fr-completions) - A Multilingual Benchmark for Advanced Large Language Model Evaluation - French dataset (completions endpoint) * - [mmlu_prox_it_chat](#lm-evaluation-harness-mmlu-prox-it-chat) - A Multilingual Benchmark for Advanced Large Language Model Evaluation - Italian dataset (chat endpoint) * - [mmlu_prox_it_completions](#lm-evaluation-harness-mmlu-prox-it-completions) - A Multilingual Benchmark for Advanced Large Language Model Evaluation - Italian dataset (completions endpoint) * - [mmlu_prox_ja_chat](#lm-evaluation-harness-mmlu-prox-ja-chat) - A Multilingual Benchmark for Advanced Large Language Model Evaluation - Japanese dataset (chat endpoint) * - [mmlu_prox_ja_completions](#lm-evaluation-harness-mmlu-prox-ja-completions) - A Multilingual Benchmark for Advanced Large Language Model Evaluation - Japanese dataset (completions endpoint) * - [mmlu_redux](#lm-evaluation-harness-mmlu-redux) - MMLU-Redux is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. * - [mmlu_redux_instruct](#lm-evaluation-harness-mmlu-redux-instruct) - - MMLU-Redux is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. - This variant applies a chat template and defaults to zero-shot evaluation. * - [musr](#lm-evaluation-harness-musr) - The MuSR (Multistep Soft Reasoning) benchmark evaluates the reasoning capabilities of large language models through complex, multistep tasks specified in natural language narratives. * - [openbookqa](#lm-evaluation-harness-openbookqa) - - OpenBookQA is a question-answering dataset modeled after open book exams for assessing human understanding of a subject. - Answering OpenBookQA questions requires additional broad common knowledge, not contained in the book. - The questions, by design, are answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. * - [piqa](#lm-evaluation-harness-piqa) - - Physical Interaction: Question Answering (PIQA) is a physical commonsense reasoning benchmark designed to investigate the physical knowledge of large language models. * - [social_iqa](#lm-evaluation-harness-social-iqa) - - Social IQa contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations. * - [truthfulqa](#lm-evaluation-harness-truthfulqa) - - The TruthfulQA benchmark measures the truthfulness of language models in generating answers to questions. - It consists of 817 questions across 38 categories, such as health, law, finance, and politics, designed to test whether models can avoid generating false answers that mimic common human misconceptions. * - [wikilingua](#lm-evaluation-harness-wikilingua) - - The WikiLingua benchmark is a large-scale, multilingual dataset designed for evaluating cross-lingual abstractive summarization systems. * - [wikitext](#lm-evaluation-harness-wikitext) - - The WikiText language modeling dataset is a collection of over 100 million tokens extracted from verified Good and Featured articles on Wikipedia. - This task measures perplexity on the WikiText-2 dataset via rolling loglikelihoods. * - [winogrande](#lm-evaluation-harness-winogrande) - WinoGrande is a collection of 44k problems formulated as a fill-in-a-blank task with binary options testing commonsense reasoning. ``` (lm-evaluation-harness-adlr-agieval-en-cot)= ## adlr_agieval_en_cot Version of the AGIEval-EN-CoT benchmark used by NVIDIA Applied Deep Learning Research team (ADLR). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `adlr_agieval_en_cot` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: adlr_agieval_en_cot temperature: 0.0 request_timeout: 30 top_p: 1.0e-05 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: adlr_agieval_en_cot target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-adlr-arc-challenge-llama-25-shot)= ## adlr_arc_challenge_llama_25_shot ARC-Challenge-Llama version used by NVIDIA Applied Deep Learning Research team (ADLR). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `adlr_arc_challenge_llama_25_shot` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: adlr_arc_challenge_llama temperature: 1.0 request_timeout: 30 top_p: 1.0 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 25 supported_endpoint_types: - completions type: adlr_arc_challenge_llama_25_shot target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-adlr-commonsense-qa-7-shot)= ## adlr_commonsense_qa_7_shot CommonsenseQA version used by NVIDIA Applied Deep Learning Research team (ADLR). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `adlr_commonsense_qa_7_shot` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: commonsense_qa temperature: 1.0 request_timeout: 30 top_p: 1.0 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 7 supported_endpoint_types: - completions type: adlr_commonsense_qa_7_shot target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-adlr-global-mmlu-lite-5-shot)= ## adlr_global_mmlu_lite_5_shot Global-MMLU subset (8 languages - es, de, fr, zh, it, ja, pt, ko) used by NVIDIA Applied Deep Learning Research team (ADLR). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `adlr_global_mmlu_lite_5_shot` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: adlr_global_mmlu temperature: 1.0 request_timeout: 30 top_p: 1.0 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 5 supported_endpoint_types: - completions type: adlr_global_mmlu_lite_5_shot target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-adlr-gpqa-diamond-cot-5-shot)= ## adlr_gpqa_diamond_cot_5_shot Version of the GPQA-Diamond-CoT benchmark used by NVIDIA Applied Deep Learning Research team (ADLR). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `adlr_gpqa_diamond_cot_5_shot` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: adlr_gpqa_diamond_cot_5_shot temperature: 0.0 request_timeout: 30 top_p: 1.0e-05 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 5 supported_endpoint_types: - completions type: adlr_gpqa_diamond_cot_5_shot target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-adlr-gsm8k-cot-8-shot)= ## adlr_gsm8k_cot_8_shot GSM8K-CoT version used by NVIDIA Applied Deep Learning Research team (ADLR). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `adlr_gsm8k_cot_8_shot` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: adlr_gsm8k_fewshot_cot temperature: 0.0 request_timeout: 30 top_p: 1.0e-05 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 8 supported_endpoint_types: - completions type: adlr_gsm8k_cot_8_shot target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-adlr-humaneval-greedy)= ## adlr_humaneval_greedy HumanEval Greedy version used by NVIDIA Applied Deep Learning Research team (ADLR). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `adlr_humaneval_greedy` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: adlr_humaneval_greedy temperature: 0.0 request_timeout: 30 top_p: 1.0e-05 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: adlr_humaneval_greedy target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-adlr-humaneval-sampled)= ## adlr_humaneval_sampled HumanEval Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `adlr_humaneval_sampled` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: adlr_humaneval_sampled temperature: 0.6 request_timeout: 30 top_p: 0.95 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: adlr_humaneval_sampled target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-adlr-math-500-4-shot-sampled)= ## adlr_math_500_4_shot_sampled MATH-500 Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `adlr_math_500_4_shot_sampled` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: adlr_math_500_4_shot_sampled temperature: 0.7 request_timeout: 30 top_p: 1.0 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 4 supported_endpoint_types: - completions type: adlr_math_500_4_shot_sampled target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-adlr-mbpp-sanitized-3-shot-greedy)= ## adlr_mbpp_sanitized_3_shot_greedy MBPP Greedy version used by NVIDIA Applied Deep Learning Research team (ADLR). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `adlr_mbpp_sanitized_3_shot_greedy` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: adlr_mbpp_sanitized_3_shot_greedy temperature: 0.0 request_timeout: 30 top_p: 1.0e-05 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 3 supported_endpoint_types: - completions type: adlr_mbpp_sanitized_3_shot_greedy target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-adlr-mbpp-sanitized-3-shot-sampled)= ## adlr_mbpp_sanitized_3_shot_sampled MBPP Sampled version used by NVIDIA Applied Deep Learning Research team (ADLR). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `adlr_mbpp_sanitized_3_shot_sampled` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: adlr_mbpp_sanitized_3shot_sampled temperature: 0.6 request_timeout: 30 top_p: 0.95 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 3 supported_endpoint_types: - completions type: adlr_mbpp_sanitized_3_shot_sampled target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-adlr-mgsm-native-cot-8-shot)= ## adlr_mgsm_native_cot_8_shot MGSM native CoT subset (6 languages - es, de, fr, zh, ja, ru) used by NVIDIA Applied Deep Learning Research team (ADLR). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `adlr_mgsm_native_cot_8_shot` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: adlr_mgsm_native_cot_8_shot temperature: 0.0 request_timeout: 30 top_p: 1.0e-05 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 8 supported_endpoint_types: - completions type: adlr_mgsm_native_cot_8_shot target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-adlr-minerva-math-nemo-4-shot)= ## adlr_minerva_math_nemo_4_shot Minerva-Math version used by NVIDIA Applied Deep Learning Research team (ADLR). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `adlr_minerva_math_nemo_4_shot` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: adlr_minerva_math_nemo temperature: 0.0 request_timeout: 30 top_p: 1.0e-05 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 4 supported_endpoint_types: - completions type: adlr_minerva_math_nemo_4_shot target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-adlr-mmlu)= ## adlr_mmlu MMLU version used by NVIDIA Applied Deep Learning Research team (ADLR). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `adlr_mmlu` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mmlu_str temperature: 0.0 request_timeout: 30 top_p: 1.0e-05 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 5 args: --trust_remote_code supported_endpoint_types: - completions type: adlr_mmlu target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-adlr-mmlu-pro-5-shot-base)= ## adlr_mmlu_pro_5_shot_base MMLU-Pro 5-shot base version used by NVIDIA Applied Deep Learning Research team (ADLR). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `adlr_mmlu_pro_5_shot_base` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: adlr_mmlu_pro_5_shot_base temperature: 0.0 request_timeout: 30 top_p: 1.0e-05 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 5 supported_endpoint_types: - completions type: adlr_mmlu_pro_5_shot_base target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-adlr-race)= ## adlr_race RACE version used by NVIDIA Applied Deep Learning Research team (ADLR). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `adlr_race` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: adlr_race temperature: 1.0 request_timeout: 30 top_p: 1.0 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: adlr_race target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-adlr-truthfulqa-mc2)= ## adlr_truthfulqa_mc2 TruthfulQA-MC2 version used by NVIDIA Applied Deep Learning Research team (ADLR). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `adlr_truthfulqa_mc2` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: adlr_truthfulqa_mc2 temperature: 1.0 request_timeout: 30 top_p: 1.0 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: adlr_truthfulqa_mc2 target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-adlr-winogrande-5-shot)= ## adlr_winogrande_5_shot Winogrande version used by NVIDIA Applied Deep Learning Research team (ADLR). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `adlr_winogrande_5_shot` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: winogrande temperature: 1.0 request_timeout: 30 top_p: 1.0 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 5 supported_endpoint_types: - completions type: adlr_winogrande_5_shot target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-agieval)= ## agieval AGIEval - A Human-Centric Benchmark for Evaluating Foundation Models ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `agieval` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: agieval temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: agieval target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-arc-challenge)= ## arc_challenge The ARC challenge dataset consists of 2,590 multiple-choice science exam questions. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `arc_challenge` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: arc_challenge temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: arc_challenge target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-arc-challenge-chat)= ## arc_challenge_chat - The ARC challenge dataset consists of 2,590 multiple-choice science exam questions. - This variant applies a chat template and defaults to zero-shot evaluation. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `arc_challenge_chat` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: arc_challenge_chat temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 0 supported_endpoint_types: - chat type: arc_challenge_chat target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-arc-multilingual)= ## arc_multilingual The multilingual versions of the ARC challenge dataset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `arc_multilingual` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: arc_multilingual temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: arc_multilingual target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-bbh)= ## bbh The BIG-Bench Hard (BBH) benchmark is a part of the BIG-Bench evaluation suite, focusing on 23 particularly difficult tasks that current language models struggle with. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `bbh` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: leaderboard_bbh temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: bbh target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-bbh-instruct)= ## bbh_instruct - The BIG-Bench Hard (BBH) benchmark is a part of the BIG-Bench evaluation suite, focusing on 23 particularly difficult tasks that current language models struggle with. - This variant aaplies chat template and defaults to zero-shot evaluation. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `bbh_instruct` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: bbh_zeroshot temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - chat type: bbh_instruct target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-bbq-chat)= ## bbq_chat The BBQ (Bias Benchmark for QA) is a benchmark designed to measure social biases in question answering systems. It contains ambiguous questions spanning 9 categories - disability, gender, nationality, physical appearance, race/ethnicity, religion, sexual orientation, socioeconomic status, and age (chat endpoint). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `bbq_chat` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: bbq_generate temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - chat type: bbq_chat target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-bbq-completions)= ## bbq_completions The BBQ (Bias Benchmark for QA) is a benchmark designed to measure social biases in question answering systems. It contains ambiguous questions spanning 9 categories - disability, gender, nationality, physical appearance, race/ethnicity, religion, sexual orientation, socioeconomic status, and age (completions endpoint). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `bbq_completions` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: bbq_generate temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: bbq_completions target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-commonsense-qa)= ## commonsense_qa - CommonsenseQA is a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers. - It contains 12,102 questions with one correct answer and four distractor answers. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `commonsense_qa` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: commonsense_qa temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 7 supported_endpoint_types: - completions type: commonsense_qa target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu)= ## global_mmlu - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - It is designed for efficient evaluation of multilingual models in 15 languages (including English). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-ar)= ## global_mmlu_ar - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the AR subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_ar` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_ar temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_ar target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-bn)= ## global_mmlu_bn - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the BN subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_bn` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_bn temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_bn target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-de)= ## global_mmlu_de - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the DE subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_de` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_de temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_de target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-en)= ## global_mmlu_en - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the EN subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_en` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_en temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_en target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-es)= ## global_mmlu_es - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ES subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_es` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_es temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_es target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-fr)= ## global_mmlu_fr - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the FR subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_fr` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_fr temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_fr target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full)= ## global_mmlu_full Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-am)= ## global_mmlu_full_am - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the AM subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_am` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_am temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_am target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-ar)= ## global_mmlu_full_ar - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the AR subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_ar` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_ar temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_ar target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-bn)= ## global_mmlu_full_bn - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the BN subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_bn` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_bn temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_bn target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-cs)= ## global_mmlu_full_cs - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the CS subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_cs` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_cs temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_cs target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-de)= ## global_mmlu_full_de - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the DE subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_de` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_de temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_de target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-el)= ## global_mmlu_full_el - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the EL subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_el` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_el temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_el target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-en)= ## global_mmlu_full_en - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the EN subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_en` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_en temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_en target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-es)= ## global_mmlu_full_es - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ES subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_es` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_es temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_es target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-fa)= ## global_mmlu_full_fa - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FA subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_fa` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_fa temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_fa target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-fil)= ## global_mmlu_full_fil - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FIL subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_fil` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_fil temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_fil target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-fr)= ## global_mmlu_full_fr - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the FR subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_fr` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_fr temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_fr target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-ha)= ## global_mmlu_full_ha - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HA subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_ha` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_ha temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_ha target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-he)= ## global_mmlu_full_he - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HE subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_he` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_he temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_he target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-hi)= ## global_mmlu_full_hi - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the HI subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_hi` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_hi temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_hi target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-id)= ## global_mmlu_full_id - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ID subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_id` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_id temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_id target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-ig)= ## global_mmlu_full_ig - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the IG subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_ig` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_ig temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_ig target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-it)= ## global_mmlu_full_it - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the IT subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_it` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_it temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_it target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-ja)= ## global_mmlu_full_ja - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the JA subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_ja` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_ja temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_ja target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-ko)= ## global_mmlu_full_ko - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the KO subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_ko` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_ko temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_ko target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-ky)= ## global_mmlu_full_ky - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the KY subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_ky` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_ky temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_ky target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-lt)= ## global_mmlu_full_lt - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the LT subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_lt` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_lt temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_lt target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-mg)= ## global_mmlu_full_mg - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the MG subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_mg` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_mg temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_mg target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-ms)= ## global_mmlu_full_ms - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the MS subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_ms` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_ms temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_ms target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-ne)= ## global_mmlu_full_ne - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NE subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_ne` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_ne temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_ne target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-nl)= ## global_mmlu_full_nl - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NL subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_nl` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_nl temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_nl target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-ny)= ## global_mmlu_full_ny - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the NY subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_ny` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_ny temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_ny target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-pl)= ## global_mmlu_full_pl - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the PL subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_pl` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_pl temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_pl target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-pt)= ## global_mmlu_full_pt - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the PT subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_pt` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_pt temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_pt target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-ro)= ## global_mmlu_full_ro - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the RO subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_ro` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_ro temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_ro target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-ru)= ## global_mmlu_full_ru - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the RU subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_ru` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_ru temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_ru target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-si)= ## global_mmlu_full_si - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SI subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_si` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_si temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_si target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-sn)= ## global_mmlu_full_sn - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SN subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_sn` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_sn temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_sn target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-so)= ## global_mmlu_full_so - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SO subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_so` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_so temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_so target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-sr)= ## global_mmlu_full_sr - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SR subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_sr` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_sr temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_sr target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-sv)= ## global_mmlu_full_sv - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SV subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_sv` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_sv temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_sv target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-sw)= ## global_mmlu_full_sw - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the SW subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_sw` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_sw temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_sw target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-te)= ## global_mmlu_full_te - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the TE subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_te` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_te temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_te target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-tr)= ## global_mmlu_full_tr - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the TR subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_tr` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_tr temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_tr target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-uk)= ## global_mmlu_full_uk - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the UK subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_uk` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_uk temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_uk target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-vi)= ## global_mmlu_full_vi - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the VI subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_vi` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_vi temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_vi target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-yo)= ## global_mmlu_full_yo - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the YO subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_yo` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_yo temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_yo target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-full-zh)= ## global_mmlu_full_zh - Global-MMLU is a multilingual evaluation set spanning 42 languages, including English. - This variant uses the ZH subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_full_zh` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_full_zh temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_full_zh target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-hi)= ## global_mmlu_hi - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the HI subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_hi` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_hi temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_hi target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-id)= ## global_mmlu_id - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ID subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_id` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_id temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_id target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-it)= ## global_mmlu_it - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the IT subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_it` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_it temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_it target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-ja)= ## global_mmlu_ja - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the JA subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_ja` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_ja temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_ja target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-ko)= ## global_mmlu_ko - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the KO subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_ko` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_ko temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_ko target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-pt)= ## global_mmlu_pt - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the PT subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_pt` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_pt temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_pt target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-sw)= ## global_mmlu_sw - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the SW subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_sw` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_sw temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_sw target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-yo)= ## global_mmlu_yo - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the YO subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_yo` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_yo temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_yo target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-global-mmlu-zh)= ## global_mmlu_zh - Global-MMLU-Lite is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks. - This variant uses the ZH subset. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `global_mmlu_zh` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: global_mmlu_zh temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: global_mmlu_zh target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-gpqa)= ## gpqa The GPQA (Graduate-Level Google-Proof Q&A) benchmark is a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `gpqa` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: leaderboard_gpqa temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: gpqa target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-gpqa-diamond-cot)= ## gpqa_diamond_cot - The GPQA (Graduate-Level Google-Proof Q&A) benchmark is a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry. - This variant uses the Diamond subset and defaults to zero-shot chain-of-thought evaluation. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `gpqa_diamond_cot` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: gpqa_diamond_cot_zeroshot temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - chat type: gpqa_diamond_cot target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-gsm8k)= ## gsm8k The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `gsm8k` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: gsm8k temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: gsm8k target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-gsm8k-cot-instruct)= ## gsm8k_cot_instruct - The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation with custom instructions. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `gsm8k_cot_instruct` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: gsm8k_zeroshot_cot temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false args: --add_instruction supported_endpoint_types: - chat type: gsm8k_cot_instruct target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-gsm8k-cot-llama)= ## gsm8k_cot_llama - The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought evaluation - implementation taken from llama. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `gsm8k_cot_llama` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: gsm8k_cot_llama temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - chat type: gsm8k_cot_llama target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-gsm8k-cot-zeroshot)= ## gsm8k_cot_zeroshot - The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `gsm8k_cot_zeroshot` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: gsm8k_cot_zeroshot temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - chat type: gsm8k_cot_zeroshot target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-gsm8k-cot-zeroshot-llama)= ## gsm8k_cot_zeroshot_llama - The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. - This variant defaults to chain-of-thought zero-shot evaluation - implementation taken from llama. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `gsm8k_cot_zeroshot_llama` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: gsm8k_cot_llama temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 0 supported_endpoint_types: - chat type: gsm8k_cot_zeroshot_llama target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-hellaswag)= ## hellaswag The HellaSwag benchmark tests a language model's commonsense reasoning by having it choose the most logical ending for a given story. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `hellaswag` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: hellaswag temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 10 supported_endpoint_types: - completions type: hellaswag target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-hellaswag-multilingual)= ## hellaswag_multilingual The multilingual versions of the HellaSwag benchmark. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `hellaswag_multilingual` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: hellaswag_multilingual temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 10 supported_endpoint_types: - completions type: hellaswag_multilingual target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-humaneval-instruct)= ## humaneval_instruct - The HumanEval benchmark measures functional correctness for synthesizing programs from docstrings. - Implementation taken from llama. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `humaneval_instruct` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: humaneval_instruct temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - chat type: humaneval_instruct target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-ifeval)= ## ifeval IFEval is a dataset designed to test a model's ability to follow explicit instructions, such as "include keyword x" or "use format y." The focus is on the model's adherence to formatting instructions rather than the content generated, allowing for the use of strict and rigorous metrics. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `ifeval` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: ifeval temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - chat type: ifeval target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-m-mmlu-id-str-chat)= ## m_mmlu_id_str_chat - The MMLU (Massive Multitask Language Understanding) benchmark translated to Indonesian with string-based evaluation (chat endpoint). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `m_mmlu_id_str_chat` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: m_mmlu_id_str temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 0 args: --trust_remote_code supported_endpoint_types: - chat type: m_mmlu_id_str_chat target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-m-mmlu-id-str-completions)= ## m_mmlu_id_str_completions - The MMLU (Massive Multitask Language Understanding) benchmark translated to Indonesian with string-based evaluation (completions endpoint). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `m_mmlu_id_str_completions` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: m_mmlu_id_str temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 0 args: --trust_remote_code supported_endpoint_types: - completions type: m_mmlu_id_str_completions target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mbpp-plus-chat)= ## mbpp_plus_chat MBPP EvalPlus is an extension of the MBPP benchmark with 35x more test cases (chat endpoint). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mbpp_plus_chat` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mbpp_plus temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false args: --confirm_run_unsafe_code supported_endpoint_types: - chat type: mbpp_plus_chat target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mbpp-plus-completions)= ## mbpp_plus_completions MBPP EvalPlus is an extension of the MBPP benchmark with 35x more test cases (completions endpoint). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mbpp_plus_completions` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mbpp_plus temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false args: --confirm_run_unsafe_code supported_endpoint_types: - completions type: mbpp_plus_completions target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mgsm)= ## mgsm - The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mgsm` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mgsm_direct temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: mgsm target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mgsm-cot-chat)= ## mgsm_cot_chat - The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages. - This variant uses the chat endpoint and defaults to chain-of-thought evaluation. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mgsm_cot_chat` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: mgsm_cot_native temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 0 supported_endpoint_types: - chat type: mgsm_cot_chat target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mgsm-cot-completions)= ## mgsm_cot_completions - The Multilingual Grade School Math (MGSM) benchmark consists of 250 grade-school math problems from the GSM8K dataset, translated into ten languages. - This variant uses the completions endpoint and defaults to chain-of-thought evaluation. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mgsm_cot_completions` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: mgsm_cot_native temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 0 supported_endpoint_types: - completions type: mgsm_cot_completions target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mmlu)= ## mmlu - The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses text generation. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mmlu_str temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 5 args: --trust_remote_code supported_endpoint_types: - completions type: mmlu target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mmlu-cot-0-shot-chat)= ## mmlu_cot_0_shot_chat - The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant defaults to chain-of-thought zero-shot evaluation. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_cot_0_shot_chat` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mmlu_cot_0_shot_chat temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false args: --trust_remote_code supported_endpoint_types: - chat type: mmlu_cot_0_shot_chat target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mmlu-instruct)= ## mmlu_instruct - The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the chat endpoint, defaults to zero-shot evaluation and instructs the model to produce a single letter response. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_instruct` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mmlu_str temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 0 args: --trust_remote_code --add_instruction supported_endpoint_types: - chat type: mmlu_instruct target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mmlu-instruct-completions)= ## mmlu_instruct_completions - The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the completions endpoint, defaults to zero-shot evaluation and instructs the model to produce a single letter response. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_instruct_completions` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mmlu_str temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 0 args: --trust_remote_code --add_instruction supported_endpoint_types: - completions type: mmlu_instruct_completions target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mmlu-logits)= ## mmlu_logits - The MMLU (Massive Multitask Language Understanding) benchmark covers 57 subjects across various fields, testing both world knowledge and problem-solving abilities. - This variant uses the logits of the model to evaluate the accuracy. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_logits` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mmlu temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 5 supported_endpoint_types: - completions type: mmlu_logits target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mmlu-pro)= ## mmlu_pro MMLU-Pro is a refined version of the MMLU dataset with 10 choices instead of 4 (completions endpoint). ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_pro` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mmlu_pro temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 5 supported_endpoint_types: - completions type: mmlu_pro target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mmlu-pro-instruct)= ## mmlu_pro_instruct - MMLU-Pro is a refined version of the MMLU dataset with 10 choices instead of 4. - This variant applies a chat template and defaults to zero-shot evaluation. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_pro_instruct` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 task: mmlu_pro temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 0 supported_endpoint_types: - chat type: mmlu_pro_instruct target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mmlu-prox-chat)= ## mmlu_prox_chat A Multilingual Benchmark for Advanced Large Language Model Evaluation (chat endpoint) ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_prox_chat` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mmlu_prox temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - chat type: mmlu_prox_chat target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mmlu-prox-completions)= ## mmlu_prox_completions A Multilingual Benchmark for Advanced Large Language Model Evaluation (completions endpoint) ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_prox_completions` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mmlu_prox temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: mmlu_prox_completions target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mmlu-prox-de-chat)= ## mmlu_prox_de_chat A Multilingual Benchmark for Advanced Large Language Model Evaluation - German dataset (chat endpoint) ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_prox_de_chat` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mmlu_prox_de temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - chat type: mmlu_prox_de_chat target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mmlu-prox-de-completions)= ## mmlu_prox_de_completions A Multilingual Benchmark for Advanced Large Language Model Evaluation - German dataset (completions endpoint) ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_prox_de_completions` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mmlu_prox_de temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: mmlu_prox_de_completions target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mmlu-prox-es-chat)= ## mmlu_prox_es_chat A Multilingual Benchmark for Advanced Large Language Model Evaluation - Spanish dataset (chat endpoint) ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_prox_es_chat` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mmlu_prox_es temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - chat type: mmlu_prox_es_chat target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mmlu-prox-es-completions)= ## mmlu_prox_es_completions A Multilingual Benchmark for Advanced Large Language Model Evaluation - Spanish dataset (completions endpoint) ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_prox_es_completions` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mmlu_prox_es temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: mmlu_prox_es_completions target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mmlu-prox-fr-chat)= ## mmlu_prox_fr_chat A Multilingual Benchmark for Advanced Large Language Model Evaluation - French dataset (chat endpoint) ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_prox_fr_chat` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mmlu_prox_fr temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - chat type: mmlu_prox_fr_chat target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mmlu-prox-fr-completions)= ## mmlu_prox_fr_completions A Multilingual Benchmark for Advanced Large Language Model Evaluation - French dataset (completions endpoint) ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_prox_fr_completions` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mmlu_prox_fr temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: mmlu_prox_fr_completions target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mmlu-prox-it-chat)= ## mmlu_prox_it_chat A Multilingual Benchmark for Advanced Large Language Model Evaluation - Italian dataset (chat endpoint) ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_prox_it_chat` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mmlu_prox_it temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - chat type: mmlu_prox_it_chat target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mmlu-prox-it-completions)= ## mmlu_prox_it_completions A Multilingual Benchmark for Advanced Large Language Model Evaluation - Italian dataset (completions endpoint) ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_prox_it_completions` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mmlu_prox_it temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: mmlu_prox_it_completions target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mmlu-prox-ja-chat)= ## mmlu_prox_ja_chat A Multilingual Benchmark for Advanced Large Language Model Evaluation - Japanese dataset (chat endpoint) ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_prox_ja_chat` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mmlu_prox_ja temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - chat type: mmlu_prox_ja_chat target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mmlu-prox-ja-completions)= ## mmlu_prox_ja_completions A Multilingual Benchmark for Advanced Large Language Model Evaluation - Japanese dataset (completions endpoint) ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_prox_ja_completions` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mmlu_prox_ja temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: mmlu_prox_ja_completions target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mmlu-redux)= ## mmlu_redux MMLU-Redux is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_redux` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: mmlu_redux temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: mmlu_redux target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-mmlu-redux-instruct)= ## mmlu_redux_instruct - MMLU-Redux is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. - This variant applies a chat template and defaults to zero-shot evaluation. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_redux_instruct` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_new_tokens: 8192 max_retries: 5 parallelism: 10 task: mmlu_redux temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 0 args: --add_instruction supported_endpoint_types: - chat type: mmlu_redux_instruct target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-musr)= ## musr The MuSR (Multistep Soft Reasoning) benchmark evaluates the reasoning capabilities of large language models through complex, multistep tasks specified in natural language narratives. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `musr` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: leaderboard_musr temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: musr target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-openbookqa)= ## openbookqa - OpenBookQA is a question-answering dataset modeled after open book exams for assessing human understanding of a subject. - Answering OpenBookQA questions requires additional broad common knowledge, not contained in the book. - The questions, by design, are answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `openbookqa` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: openbookqa temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: openbookqa target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-piqa)= ## piqa - Physical Interaction: Question Answering (PIQA) is a physical commonsense reasoning benchmark designed to investigate the physical knowledge of large language models. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `piqa` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: piqa temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: piqa target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-social-iqa)= ## social_iqa - Social IQa contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `social_iqa` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: social_iqa temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false args: --trust_remote_code supported_endpoint_types: - completions type: social_iqa target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-truthfulqa)= ## truthfulqa - The TruthfulQA benchmark measures the truthfulness of language models in generating answers to questions. - It consists of 817 questions across 38 categories, such as health, law, finance, and politics, designed to test whether models can avoid generating false answers that mimic common human misconceptions. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `truthfulqa` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: truthfulqa temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false supported_endpoint_types: - completions type: truthfulqa target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-wikilingua)= ## wikilingua - The WikiLingua benchmark is a large-scale, multilingual dataset designed for evaluating cross-lingual abstractive summarization systems. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `wikilingua` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: wikilingua temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false args: --trust_remote_code supported_endpoint_types: - chat type: wikilingua target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-wikitext)= ## wikitext - The WikiText language modeling dataset is a collection of over 100 million tokens extracted from verified Good and Featured articles on Wikipedia. - This task measures perplexity on the WikiText-2 dataset via rolling loglikelihoods. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `wikitext` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: wikitext temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false args: --trust_remote_code supported_endpoint_types: - completions type: wikitext target: api_endpoint: stream: false ``` ::: :::: --- (lm-evaluation-harness-winogrande)= ## winogrande WinoGrande is a collection of 44k problems formulated as a fill-in-a-blank task with binary options testing commonsense reasoning. ::::{tab-set} :::{tab-item} Container **Harness:** `lm-evaluation-harness` **Container:** ``` nvcr.io/nvidia/eval-factory/lm-evaluation-harness:26.01 ``` **Container Digest:** ``` sha256:f5e5b59b2893e48ce113c4e163b0f9baadadf80824384bcfc84591e0664aba26 ``` **Container Arch:** `multiarch` **Task Type:** `winogrande` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests={{config.params.extra.tokenized_requests}},{% if config.params.extra.tokenizer is not none %}tokenizer={{config.params.extra.tokenizer}}{% endif %},tokenizer_backend={{config.params.extra.tokenizer_backend}},num_concurrent={{config.params.parallelism}},timeout={{ config.params.request_timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{% endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.temperature is not none or config.params.top_p is not none or config.params.max_new_tokens is not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{ config.params.temperature }}{% endif %}{% if config.params.top_p is not none %},top_p={{ config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %}"{% endif %} {% if config.params.extra.downsampling_ratio is not none %}--downsampling_ratio {{ config.params.extra.downsampling_ratio }}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: lm-evaluation-harness pkg_name: lm_evaluation_harness config: params: max_retries: 5 parallelism: 10 task: winogrande temperature: 1.0e-07 request_timeout: 30 top_p: 0.9999999 extra: tokenizer: null tokenizer_backend: None downsampling_ratio: null tokenized_requests: false num_fewshot: 5 supported_endpoint_types: - completions type: winogrande target: api_endpoint: stream: false ``` ::: :::: # mmath This page contains all evaluation tasks for the **mmath** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [mmath_ar](#mmath-mmath-ar) - Arabic mmath * - [mmath_en](#mmath-mmath-en) - English mmath * - [mmath_es](#mmath-mmath-es) - Spanish mmath * - [mmath_fr](#mmath-mmath-fr) - French mmath * - [mmath_ja](#mmath-mmath-ja) - Japanese mmath * - [mmath_ko](#mmath-mmath-ko) - Korean mmath * - [mmath_pt](#mmath-mmath-pt) - Portuguese mmath * - [mmath_th](#mmath-mmath-th) - Thai mmath * - [mmath_vi](#mmath-mmath-vi) - Vietnamese mmath * - [mmath_zh](#mmath-mmath-zh) - Chinese mmath ``` (mmath-mmath-ar)= ## mmath_ar Arabic mmath ::::{tab-set} :::{tab-item} Container **Harness:** `mmath` **Container:** ``` nvcr.io/nvidia/eval-factory/mmath:26.01 ``` **Container Digest:** ``` sha256:da033bf95efd05af58d2ab06feb2344dbca60678f3075a4bf7f53899901c5efc ``` **Container Arch:** `multiarch` **Task Type:** `mmath_ar` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} mmath --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --lang {{config.params.extra.language}} --output-dir {{config.output_dir}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} --max-tokens {{config.params.max_new_tokens}} --temperature {{config.params.temperature}} --top-p {{config.params.top_p}} --request-timeout {{config.params.request_timeout}} --n-samples {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mmath pkg_name: mmath config: params: max_new_tokens: 32768 max_retries: 5 parallelism: 8 temperature: 0.6 request_timeout: 3600 top_p: 0.95 extra: language: ar n_samples: 4 supported_endpoint_types: - chat type: mmath_ar target: api_endpoint: stream: false ``` ::: :::: --- (mmath-mmath-en)= ## mmath_en English mmath ::::{tab-set} :::{tab-item} Container **Harness:** `mmath` **Container:** ``` nvcr.io/nvidia/eval-factory/mmath:26.01 ``` **Container Digest:** ``` sha256:da033bf95efd05af58d2ab06feb2344dbca60678f3075a4bf7f53899901c5efc ``` **Container Arch:** `multiarch` **Task Type:** `mmath_en` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} mmath --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --lang {{config.params.extra.language}} --output-dir {{config.output_dir}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} --max-tokens {{config.params.max_new_tokens}} --temperature {{config.params.temperature}} --top-p {{config.params.top_p}} --request-timeout {{config.params.request_timeout}} --n-samples {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mmath pkg_name: mmath config: params: max_new_tokens: 32768 max_retries: 5 parallelism: 8 temperature: 0.6 request_timeout: 3600 top_p: 0.95 extra: language: en n_samples: 4 supported_endpoint_types: - chat type: mmath_en target: api_endpoint: stream: false ``` ::: :::: --- (mmath-mmath-es)= ## mmath_es Spanish mmath ::::{tab-set} :::{tab-item} Container **Harness:** `mmath` **Container:** ``` nvcr.io/nvidia/eval-factory/mmath:26.01 ``` **Container Digest:** ``` sha256:da033bf95efd05af58d2ab06feb2344dbca60678f3075a4bf7f53899901c5efc ``` **Container Arch:** `multiarch` **Task Type:** `mmath_es` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} mmath --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --lang {{config.params.extra.language}} --output-dir {{config.output_dir}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} --max-tokens {{config.params.max_new_tokens}} --temperature {{config.params.temperature}} --top-p {{config.params.top_p}} --request-timeout {{config.params.request_timeout}} --n-samples {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mmath pkg_name: mmath config: params: max_new_tokens: 32768 max_retries: 5 parallelism: 8 temperature: 0.6 request_timeout: 3600 top_p: 0.95 extra: language: es n_samples: 4 supported_endpoint_types: - chat type: mmath_es target: api_endpoint: stream: false ``` ::: :::: --- (mmath-mmath-fr)= ## mmath_fr French mmath ::::{tab-set} :::{tab-item} Container **Harness:** `mmath` **Container:** ``` nvcr.io/nvidia/eval-factory/mmath:26.01 ``` **Container Digest:** ``` sha256:da033bf95efd05af58d2ab06feb2344dbca60678f3075a4bf7f53899901c5efc ``` **Container Arch:** `multiarch` **Task Type:** `mmath_fr` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} mmath --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --lang {{config.params.extra.language}} --output-dir {{config.output_dir}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} --max-tokens {{config.params.max_new_tokens}} --temperature {{config.params.temperature}} --top-p {{config.params.top_p}} --request-timeout {{config.params.request_timeout}} --n-samples {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mmath pkg_name: mmath config: params: max_new_tokens: 32768 max_retries: 5 parallelism: 8 temperature: 0.6 request_timeout: 3600 top_p: 0.95 extra: language: fr n_samples: 4 supported_endpoint_types: - chat type: mmath_fr target: api_endpoint: stream: false ``` ::: :::: --- (mmath-mmath-ja)= ## mmath_ja Japanese mmath ::::{tab-set} :::{tab-item} Container **Harness:** `mmath` **Container:** ``` nvcr.io/nvidia/eval-factory/mmath:26.01 ``` **Container Digest:** ``` sha256:da033bf95efd05af58d2ab06feb2344dbca60678f3075a4bf7f53899901c5efc ``` **Container Arch:** `multiarch` **Task Type:** `mmath_ja` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} mmath --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --lang {{config.params.extra.language}} --output-dir {{config.output_dir}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} --max-tokens {{config.params.max_new_tokens}} --temperature {{config.params.temperature}} --top-p {{config.params.top_p}} --request-timeout {{config.params.request_timeout}} --n-samples {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mmath pkg_name: mmath config: params: max_new_tokens: 32768 max_retries: 5 parallelism: 8 temperature: 0.6 request_timeout: 3600 top_p: 0.95 extra: language: en n_samples: 4 supported_endpoint_types: - chat type: mmath_ja target: api_endpoint: stream: false ``` ::: :::: --- (mmath-mmath-ko)= ## mmath_ko Korean mmath ::::{tab-set} :::{tab-item} Container **Harness:** `mmath` **Container:** ``` nvcr.io/nvidia/eval-factory/mmath:26.01 ``` **Container Digest:** ``` sha256:da033bf95efd05af58d2ab06feb2344dbca60678f3075a4bf7f53899901c5efc ``` **Container Arch:** `multiarch` **Task Type:** `mmath_ko` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} mmath --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --lang {{config.params.extra.language}} --output-dir {{config.output_dir}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} --max-tokens {{config.params.max_new_tokens}} --temperature {{config.params.temperature}} --top-p {{config.params.top_p}} --request-timeout {{config.params.request_timeout}} --n-samples {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mmath pkg_name: mmath config: params: max_new_tokens: 32768 max_retries: 5 parallelism: 8 temperature: 0.6 request_timeout: 3600 top_p: 0.95 extra: language: ko n_samples: 4 supported_endpoint_types: - chat type: mmath_ko target: api_endpoint: stream: false ``` ::: :::: --- (mmath-mmath-pt)= ## mmath_pt Portuguese mmath ::::{tab-set} :::{tab-item} Container **Harness:** `mmath` **Container:** ``` nvcr.io/nvidia/eval-factory/mmath:26.01 ``` **Container Digest:** ``` sha256:da033bf95efd05af58d2ab06feb2344dbca60678f3075a4bf7f53899901c5efc ``` **Container Arch:** `multiarch` **Task Type:** `mmath_pt` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} mmath --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --lang {{config.params.extra.language}} --output-dir {{config.output_dir}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} --max-tokens {{config.params.max_new_tokens}} --temperature {{config.params.temperature}} --top-p {{config.params.top_p}} --request-timeout {{config.params.request_timeout}} --n-samples {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mmath pkg_name: mmath config: params: max_new_tokens: 32768 max_retries: 5 parallelism: 8 temperature: 0.6 request_timeout: 3600 top_p: 0.95 extra: language: pt n_samples: 4 supported_endpoint_types: - chat type: mmath_pt target: api_endpoint: stream: false ``` ::: :::: --- (mmath-mmath-th)= ## mmath_th Thai mmath ::::{tab-set} :::{tab-item} Container **Harness:** `mmath` **Container:** ``` nvcr.io/nvidia/eval-factory/mmath:26.01 ``` **Container Digest:** ``` sha256:da033bf95efd05af58d2ab06feb2344dbca60678f3075a4bf7f53899901c5efc ``` **Container Arch:** `multiarch` **Task Type:** `mmath_th` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} mmath --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --lang {{config.params.extra.language}} --output-dir {{config.output_dir}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} --max-tokens {{config.params.max_new_tokens}} --temperature {{config.params.temperature}} --top-p {{config.params.top_p}} --request-timeout {{config.params.request_timeout}} --n-samples {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mmath pkg_name: mmath config: params: max_new_tokens: 32768 max_retries: 5 parallelism: 8 temperature: 0.6 request_timeout: 3600 top_p: 0.95 extra: language: th n_samples: 4 supported_endpoint_types: - chat type: mmath_th target: api_endpoint: stream: false ``` ::: :::: --- (mmath-mmath-vi)= ## mmath_vi Vietnamese mmath ::::{tab-set} :::{tab-item} Container **Harness:** `mmath` **Container:** ``` nvcr.io/nvidia/eval-factory/mmath:26.01 ``` **Container Digest:** ``` sha256:da033bf95efd05af58d2ab06feb2344dbca60678f3075a4bf7f53899901c5efc ``` **Container Arch:** `multiarch` **Task Type:** `mmath_vi` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} mmath --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --lang {{config.params.extra.language}} --output-dir {{config.output_dir}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} --max-tokens {{config.params.max_new_tokens}} --temperature {{config.params.temperature}} --top-p {{config.params.top_p}} --request-timeout {{config.params.request_timeout}} --n-samples {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mmath pkg_name: mmath config: params: max_new_tokens: 32768 max_retries: 5 parallelism: 8 temperature: 0.6 request_timeout: 3600 top_p: 0.95 extra: language: vi n_samples: 4 supported_endpoint_types: - chat type: mmath_vi target: api_endpoint: stream: false ``` ::: :::: --- (mmath-mmath-zh)= ## mmath_zh Chinese mmath ::::{tab-set} :::{tab-item} Container **Harness:** `mmath` **Container:** ``` nvcr.io/nvidia/eval-factory/mmath:26.01 ``` **Container Digest:** ``` sha256:da033bf95efd05af58d2ab06feb2344dbca60678f3075a4bf7f53899901c5efc ``` **Container Arch:** `multiarch` **Task Type:** `mmath_zh` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export OPENAI_API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} mmath --model-url {{target.api_endpoint.url}} --model-name {{target.api_endpoint.model_id}} --lang {{config.params.extra.language}} --output-dir {{config.output_dir}} --parallelism {{config.params.parallelism}} --retries {{config.params.max_retries}} --max-tokens {{config.params.max_new_tokens}} --temperature {{config.params.temperature}} --top-p {{config.params.top_p}} --request-timeout {{config.params.request_timeout}} --n-samples {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mmath pkg_name: mmath config: params: max_new_tokens: 32768 max_retries: 5 parallelism: 8 temperature: 0.6 request_timeout: 3600 top_p: 0.95 extra: language: zh n_samples: 4 supported_endpoint_types: - chat type: mmath_zh target: api_endpoint: stream: false ``` ::: :::: # mtbench This page contains all evaluation tasks for the **mtbench** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [mtbench](#mtbench-mtbench) - Standard MT-Bench * - [mtbench-cor1](#mtbench-mtbench-cor1) - Corrected MT-Bench ``` (mtbench-mtbench)= ## mtbench Standard MT-Bench ::::{tab-set} :::{tab-item} Container **Harness:** `mtbench` **Container:** ``` nvcr.io/nvidia/eval-factory/mtbench:26.01 ``` **Container Digest:** ``` sha256:69c930de81fdc8d3a55824fc7ebee9632c858ba56234f43ad9d1590e7fc861b1 ``` **Container Arch:** `multiarch` **Task Type:** `mtbench` ::: :::{tab-item} Command ```bash mtbench-evaluator {% if target.api_endpoint.model_id is not none %} --model {{target.api_endpoint.model_id}}{% endif %} {% if target.api_endpoint.url is not none %} --url {{target.api_endpoint.url}}{% endif %} {% if target.api_endpoint.api_key_name is not none %} --api_key {{target.api_endpoint.api_key_name}}{% endif %} {% if config.params.request_timeout is not none %} --timeout {{config.params.request_timeout}}{% endif %} {% if config.params.max_retries is not none %} --max_retries {{config.params.max_retries}}{% endif %} {% if config.params.parallelism is not none %} --parallelism {{config.params.parallelism}}{% endif %} {% if config.params.max_new_tokens is not none %} --max_tokens {{config.params.max_new_tokens}}{% endif %} --workdir {{config.output_dir}} {% if config.params.temperature is not none %} --temperature {{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %} --top_p {{config.params.top_p}}{% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.limit_samples is not none %}--first_n {{config.params.limit_samples}}{% endif %} --generate --judge {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key_name is not none %} --judge_api_key_name {{config.params.extra.judge.api_key_name}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mtbench pkg_name: mtbench config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 request_timeout: 30 extra: judge: url: null model_id: gpt-4 api_key_name: null request_timeout: 60 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 2048 supported_endpoint_types: - chat type: mtbench target: api_endpoint: {} ``` ::: :::: --- (mtbench-mtbench-cor1)= ## mtbench-cor1 Corrected MT-Bench ::::{tab-set} :::{tab-item} Container **Harness:** `mtbench` **Container:** ``` nvcr.io/nvidia/eval-factory/mtbench:26.01 ``` **Container Digest:** ``` sha256:69c930de81fdc8d3a55824fc7ebee9632c858ba56234f43ad9d1590e7fc861b1 ``` **Container Arch:** `multiarch` **Task Type:** `mtbench-cor1` ::: :::{tab-item} Command ```bash mtbench-evaluator {% if target.api_endpoint.model_id is not none %} --model {{target.api_endpoint.model_id}}{% endif %} {% if target.api_endpoint.url is not none %} --url {{target.api_endpoint.url}}{% endif %} {% if target.api_endpoint.api_key_name is not none %} --api_key {{target.api_endpoint.api_key_name}}{% endif %} {% if config.params.request_timeout is not none %} --timeout {{config.params.request_timeout}}{% endif %} {% if config.params.max_retries is not none %} --max_retries {{config.params.max_retries}}{% endif %} {% if config.params.parallelism is not none %} --parallelism {{config.params.parallelism}}{% endif %} {% if config.params.max_new_tokens is not none %} --max_tokens {{config.params.max_new_tokens}}{% endif %} --workdir {{config.output_dir}} {% if config.params.temperature is not none %} --temperature {{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %} --top_p {{config.params.top_p}}{% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}} {% endif %} {% if config.params.limit_samples is not none %}--first_n {{config.params.limit_samples}}{% endif %} --generate --judge {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key_name is not none %} --judge_api_key_name {{config.params.extra.judge.api_key_name}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mtbench pkg_name: mtbench config: params: max_new_tokens: 1024 max_retries: 5 parallelism: 10 request_timeout: 30 extra: judge: url: null model_id: gpt-4 api_key_name: null request_timeout: 60 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 2048 args: --judge_reference_model gpt-4-0125-preview supported_endpoint_types: - chat type: mtbench-cor1 target: api_endpoint: {} ``` ::: :::: # mteb This page contains all evaluation tasks for the **mteb** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [MMTEB](#mteb-mmteb) - MMTEB * - [MTEB](#mteb-mteb) - MTEB * - [MTEB_NL_RETRIEVAL](#mteb-mteb-nl-retrieval) - MTEB_NL_RETRIEVAL * - [MTEB_VDR](#mteb-mteb-vdr) - MTEB Visual Document Retrieval benchmark * - [RTEB](#mteb-rteb) - RTEB * - [ViDoReV1](#mteb-vidorev1) - ViDoReV1 * - [ViDoReV2](#mteb-vidorev2) - ViDoReV2 * - [ViDoReV3](#mteb-vidorev3) - ViDoReV3 * - [ViDoReV3_Text](#mteb-vidorev3-text) - ViDoReV3 Text (text_image markdown only) * - [ViDoReV3_Text_Image](#mteb-vidorev3-text-image) - ViDoReV3 Text+Image (text_image markdown + images) * - [custom_beir_task](#mteb-custom-beir-task) - Custom BEIR-formatted text retrieval benchmark * - [fiqa](#mteb-fiqa) - Financial Opinion Mining and Question Answering * - [hotpotqa](#mteb-hotpotqa) - HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. * - [miracl](#mteb-miracl) - MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset that focuses on search across 18 different languages. * - [miracl_lite](#mteb-miracl-lite) - MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset that focuses on search across 18 different languages. * - [mldr](#mteb-mldr) - MLDR * - [mlqa](#mteb-mlqa) - MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance. * - [nano_fiqa](#mteb-nano-fiqa) - NanoFiQA2018 is a smaller subset of the Financial Opinion Mining and Question Answering dataset. * - [nq](#mteb-nq) - Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question answering systems. ``` (mteb-mmteb)= ## MMTEB MMTEB ::::{tab-set} :::{tab-item} Container **Harness:** `mteb` **Container:** ``` nvcr.io/nvidia/eval-factory/mteb:26.01 ``` **Container Digest:** ``` sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb ``` **Container Arch:** `multiarch` **Task Type:** `MMTEB` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mteb pkg_name: mteb config: params: max_retries: 10 parallelism: 20 task: MTEB(Multilingual, v2) request_timeout: 300 extra: query_prompt_template: null document_prompt_template: null ranker: model_id: null url: null api_key: null endpoint_type: nim top_k: 40 truncate: END batch_size: 128 eval_split: test dataset_path: null cache_path: null args: null version_lite: null language: null supported_endpoint_types: - embedding type: MMTEB target: api_endpoint: {} ``` ::: :::: --- (mteb-mteb)= ## MTEB MTEB ::::{tab-set} :::{tab-item} Container **Harness:** `mteb` **Container:** ``` nvcr.io/nvidia/eval-factory/mteb:26.01 ``` **Container Digest:** ``` sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb ``` **Container Arch:** `multiarch` **Task Type:** `MTEB` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mteb pkg_name: mteb config: params: max_retries: 10 parallelism: 20 task: MTEB(eng, v2) request_timeout: 300 extra: query_prompt_template: null document_prompt_template: null ranker: model_id: null url: null api_key: null endpoint_type: nim top_k: 40 truncate: END batch_size: 128 eval_split: test dataset_path: null cache_path: null args: null version_lite: null language: null supported_endpoint_types: - embedding type: MTEB target: api_endpoint: {} ``` ::: :::: --- (mteb-mteb-nl-retrieval)= ## MTEB_NL_RETRIEVAL MTEB_NL_RETRIEVAL ::::{tab-set} :::{tab-item} Container **Harness:** `mteb` **Container:** ``` nvcr.io/nvidia/eval-factory/mteb:26.01 ``` **Container Digest:** ``` sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb ``` **Container Arch:** `multiarch` **Task Type:** `MTEB_NL_RETRIEVAL` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mteb pkg_name: mteb config: params: max_retries: 10 parallelism: 20 task: MTEB(nld, v1, retrieval) request_timeout: 300 extra: query_prompt_template: null document_prompt_template: null ranker: model_id: null url: null api_key: null endpoint_type: nim top_k: 40 truncate: END batch_size: 128 eval_split: test dataset_path: null cache_path: null args: null version_lite: null language: null supported_endpoint_types: - embedding type: MTEB_NL_RETRIEVAL target: api_endpoint: {} ``` ::: :::: --- (mteb-mteb-vdr)= ## MTEB_VDR MTEB Visual Document Retrieval benchmark ::::{tab-set} :::{tab-item} Container **Harness:** `mteb` **Container:** ``` nvcr.io/nvidia/eval-factory/mteb:26.01 ``` **Container Digest:** ``` sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb ``` **Container Arch:** `multiarch` **Task Type:** `MTEB_VDR` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mteb pkg_name: mteb config: params: max_retries: 10 parallelism: 20 task: VisualDocumentRetrieval request_timeout: 300 extra: query_prompt_template: null document_prompt_template: null ranker: model_id: null url: null api_key: null endpoint_type: nim top_k: 40 truncate: END batch_size: 128 eval_split: test dataset_path: null cache_path: null args: null version_lite: null language: null supported_endpoint_types: - embedding type: MTEB_VDR target: api_endpoint: {} ``` ::: :::: --- (mteb-rteb)= ## RTEB RTEB ::::{tab-set} :::{tab-item} Container **Harness:** `mteb` **Container:** ``` nvcr.io/nvidia/eval-factory/mteb:26.01 ``` **Container Digest:** ``` sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb ``` **Container Arch:** `multiarch` **Task Type:** `RTEB` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mteb pkg_name: mteb config: params: max_retries: 10 parallelism: 20 task: RTEB(beta) request_timeout: 300 extra: query_prompt_template: null document_prompt_template: null ranker: model_id: null url: null api_key: null endpoint_type: nim top_k: 40 truncate: END batch_size: 128 eval_split: test dataset_path: null cache_path: null args: null version_lite: null language: null supported_endpoint_types: - embedding type: RTEB target: api_endpoint: {} ``` ::: :::: --- (mteb-vidorev1)= ## ViDoReV1 ViDoReV1 ::::{tab-set} :::{tab-item} Container **Harness:** `mteb` **Container:** ``` nvcr.io/nvidia/eval-factory/mteb:26.01 ``` **Container Digest:** ``` sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb ``` **Container Arch:** `multiarch` **Task Type:** `ViDoReV1` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mteb pkg_name: mteb config: params: max_retries: 10 parallelism: 20 task: ViDoRe(v1) request_timeout: 300 extra: query_prompt_template: null document_prompt_template: null ranker: model_id: null url: null api_key: null endpoint_type: nim top_k: 40 truncate: END batch_size: 128 eval_split: test dataset_path: null cache_path: null args: null version_lite: null language: null supported_endpoint_types: - embedding type: ViDoReV1 target: api_endpoint: {} ``` ::: :::: --- (mteb-vidorev2)= ## ViDoReV2 ViDoReV2 ::::{tab-set} :::{tab-item} Container **Harness:** `mteb` **Container:** ``` nvcr.io/nvidia/eval-factory/mteb:26.01 ``` **Container Digest:** ``` sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb ``` **Container Arch:** `multiarch` **Task Type:** `ViDoReV2` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mteb pkg_name: mteb config: params: max_retries: 10 parallelism: 20 task: ViDoRe(v2) request_timeout: 300 extra: query_prompt_template: null document_prompt_template: null ranker: model_id: null url: null api_key: null endpoint_type: nim top_k: 40 truncate: END batch_size: 128 eval_split: test dataset_path: null cache_path: null args: null version_lite: null language: null supported_endpoint_types: - embedding type: ViDoReV2 target: api_endpoint: {} ``` ::: :::: --- (mteb-vidorev3)= ## ViDoReV3 ViDoReV3 ::::{tab-set} :::{tab-item} Container **Harness:** `mteb` **Container:** ``` nvcr.io/nvidia/eval-factory/mteb:26.01 ``` **Container Digest:** ``` sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb ``` **Container Arch:** `multiarch` **Task Type:** `ViDoReV3` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mteb pkg_name: mteb config: params: max_retries: 10 parallelism: 20 task: ViDoRe(v3) request_timeout: 300 extra: query_prompt_template: null document_prompt_template: null ranker: model_id: null url: null api_key: null endpoint_type: nim top_k: 40 truncate: END batch_size: 128 eval_split: test dataset_path: null cache_path: null args: null version_lite: null language: null supported_endpoint_types: - embedding type: ViDoReV3 target: api_endpoint: {} ``` ::: :::: --- (mteb-vidorev3-text)= ## ViDoReV3_Text ViDoReV3 Text (text_image markdown only) ::::{tab-set} :::{tab-item} Container **Harness:** `mteb` **Container:** ``` nvcr.io/nvidia/eval-factory/mteb:26.01 ``` **Container Digest:** ``` sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb ``` **Container Arch:** `multiarch` **Task Type:** `ViDoReV3_Text` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mteb pkg_name: mteb config: params: max_retries: 10 parallelism: 20 task: ViDoRe(v3, Text) request_timeout: 300 extra: query_prompt_template: null document_prompt_template: null ranker: model_id: null url: null api_key: null endpoint_type: nim top_k: 40 truncate: END batch_size: 128 eval_split: test dataset_path: null cache_path: null args: null version_lite: null language: null supported_endpoint_types: - embedding type: ViDoReV3_Text target: api_endpoint: {} ``` ::: :::: --- (mteb-vidorev3-text-image)= ## ViDoReV3_Text_Image ViDoReV3 Text+Image (text_image markdown + images) ::::{tab-set} :::{tab-item} Container **Harness:** `mteb` **Container:** ``` nvcr.io/nvidia/eval-factory/mteb:26.01 ``` **Container Digest:** ``` sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb ``` **Container Arch:** `multiarch` **Task Type:** `ViDoReV3_Text_Image` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mteb pkg_name: mteb config: params: max_retries: 10 parallelism: 20 task: ViDoRe(v3, Text+Image) request_timeout: 300 extra: query_prompt_template: null document_prompt_template: null ranker: model_id: null url: null api_key: null endpoint_type: nim top_k: 40 truncate: END batch_size: 128 eval_split: test dataset_path: null cache_path: null args: null version_lite: null language: null supported_endpoint_types: - embedding type: ViDoReV3_Text_Image target: api_endpoint: {} ``` ::: :::: --- (mteb-custom-beir-task)= ## custom_beir_task Custom BEIR-formatted text retrieval benchmark ::::{tab-set} :::{tab-item} Container **Harness:** `mteb` **Container:** ``` nvcr.io/nvidia/eval-factory/mteb:26.01 ``` **Container Digest:** ``` sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb ``` **Container Arch:** `multiarch` **Task Type:** `custom_beir_task` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mteb pkg_name: mteb config: params: max_retries: 10 parallelism: 20 task: custom_beir_task request_timeout: 300 extra: query_prompt_template: null document_prompt_template: null ranker: model_id: null url: null api_key: null endpoint_type: nim top_k: 40 truncate: END batch_size: 128 eval_split: test dataset_path: null cache_path: null args: null version_lite: null language: null supported_endpoint_types: - embedding type: custom_beir_task target: api_endpoint: {} ``` ::: :::: --- (mteb-fiqa)= ## fiqa Financial Opinion Mining and Question Answering ::::{tab-set} :::{tab-item} Container **Harness:** `mteb` **Container:** ``` nvcr.io/nvidia/eval-factory/mteb:26.01 ``` **Container Digest:** ``` sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb ``` **Container Arch:** `multiarch` **Task Type:** `fiqa` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mteb pkg_name: mteb config: params: max_retries: 10 parallelism: 20 task: FiQA2018 request_timeout: 300 extra: query_prompt_template: null document_prompt_template: null ranker: model_id: null url: null api_key: null endpoint_type: nim top_k: 40 truncate: END batch_size: 128 eval_split: test dataset_path: null cache_path: null args: null version_lite: null language: null supported_endpoint_types: - embedding type: fiqa target: api_endpoint: {} ``` ::: :::: --- (mteb-hotpotqa)= ## hotpotqa HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. ::::{tab-set} :::{tab-item} Container **Harness:** `mteb` **Container:** ``` nvcr.io/nvidia/eval-factory/mteb:26.01 ``` **Container Digest:** ``` sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb ``` **Container Arch:** `multiarch` **Task Type:** `hotpotqa` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mteb pkg_name: mteb config: params: max_retries: 10 parallelism: 20 task: HotpotQA request_timeout: 300 extra: query_prompt_template: null document_prompt_template: null ranker: model_id: null url: null api_key: null endpoint_type: nim top_k: 40 truncate: END batch_size: 128 eval_split: test dataset_path: null cache_path: null args: null version_lite: null language: null supported_endpoint_types: - embedding type: hotpotqa target: api_endpoint: {} ``` ::: :::: --- (mteb-miracl)= ## miracl MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset that focuses on search across 18 different languages. ::::{tab-set} :::{tab-item} Container **Harness:** `mteb` **Container:** ``` nvcr.io/nvidia/eval-factory/mteb:26.01 ``` **Container Digest:** ``` sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb ``` **Container Arch:** `multiarch` **Task Type:** `miracl` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mteb pkg_name: mteb config: params: max_retries: 10 parallelism: 20 task: MIRACLRetrieval request_timeout: 300 extra: query_prompt_template: null document_prompt_template: null ranker: model_id: null url: null api_key: null endpoint_type: nim top_k: 40 truncate: END batch_size: 128 eval_split: test dataset_path: null cache_path: null args: null version_lite: null language: null supported_endpoint_types: - embedding type: miracl target: api_endpoint: {} ``` ::: :::: --- (mteb-miracl-lite)= ## miracl_lite MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset that focuses on search across 18 different languages. ::::{tab-set} :::{tab-item} Container **Harness:** `mteb` **Container:** ``` nvcr.io/nvidia/eval-factory/mteb:26.01 ``` **Container Digest:** ``` sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb ``` **Container Arch:** `multiarch` **Task Type:** `miracl_lite` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mteb pkg_name: mteb config: params: max_retries: 10 parallelism: 20 task: MIRACLRetrieval request_timeout: 300 extra: query_prompt_template: null document_prompt_template: null ranker: model_id: null url: null api_key: null endpoint_type: nim top_k: 40 truncate: END batch_size: 128 eval_split: test dataset_path: null cache_path: null args: null version_lite: true language: null supported_endpoint_types: - embedding type: miracl_lite target: api_endpoint: {} ``` ::: :::: --- (mteb-mldr)= ## mldr MLDR ::::{tab-set} :::{tab-item} Container **Harness:** `mteb` **Container:** ``` nvcr.io/nvidia/eval-factory/mteb:26.01 ``` **Container Digest:** ``` sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb ``` **Container Arch:** `multiarch` **Task Type:** `mldr` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mteb pkg_name: mteb config: params: max_retries: 10 parallelism: 20 task: MultiLongDocRetrieval request_timeout: 300 extra: query_prompt_template: null document_prompt_template: null ranker: model_id: null url: null api_key: null endpoint_type: nim top_k: 40 truncate: END batch_size: 128 eval_split: test dataset_path: null cache_path: null args: null version_lite: null language: null supported_endpoint_types: - embedding type: mldr target: api_endpoint: {} ``` ::: :::: --- (mteb-mlqa)= ## mlqa MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance. ::::{tab-set} :::{tab-item} Container **Harness:** `mteb` **Container:** ``` nvcr.io/nvidia/eval-factory/mteb:26.01 ``` **Container Digest:** ``` sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb ``` **Container Arch:** `multiarch` **Task Type:** `mlqa` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mteb pkg_name: mteb config: params: max_retries: 10 parallelism: 20 task: MLQARetrieval request_timeout: 300 extra: query_prompt_template: null document_prompt_template: null ranker: model_id: null url: null api_key: null endpoint_type: nim top_k: 40 truncate: END batch_size: 128 eval_split: test dataset_path: null cache_path: null args: null version_lite: null language: null supported_endpoint_types: - embedding type: mlqa target: api_endpoint: {} ``` ::: :::: --- (mteb-nano-fiqa)= ## nano_fiqa NanoFiQA2018 is a smaller subset of the Financial Opinion Mining and Question Answering dataset. ::::{tab-set} :::{tab-item} Container **Harness:** `mteb` **Container:** ``` nvcr.io/nvidia/eval-factory/mteb:26.01 ``` **Container Digest:** ``` sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb ``` **Container Arch:** `multiarch` **Task Type:** `nano_fiqa` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mteb pkg_name: mteb config: params: max_retries: 10 parallelism: 20 task: NanoFiQA2018Retrieval request_timeout: 300 extra: query_prompt_template: null document_prompt_template: null ranker: model_id: null url: null api_key: null endpoint_type: nim top_k: 40 truncate: END batch_size: 128 eval_split: train dataset_path: null cache_path: null args: null version_lite: null language: null supported_endpoint_types: - embedding type: nano_fiqa target: api_endpoint: {} ``` ::: :::: --- (mteb-nq)= ## nq Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question answering systems. ::::{tab-set} :::{tab-item} Container **Harness:** `mteb` **Container:** ``` nvcr.io/nvidia/eval-factory/mteb:26.01 ``` **Container Digest:** ``` sha256:fb0ea5360bec880d4ecbfc63015d775dc3d22601e5ab17d760a992402646cbbb ``` **Container Arch:** `multiarch` **Task Type:** `nq` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_TOKEN=${{target.api_endpoint.api_key_name}} &&{% endif %} {% if config.params.extra.dataset_path is not none %} export MTEB_INTERNAL_DATASET_PATH={{config.params.extra.dataset_path}} &&{% endif %} {% if config.params.extra.ranker.api_key is not none %}export RANKER_API_TOKEN=${{config.params.extra.ranker.api_key}} &&{% endif %} mteb --encoder_name {{target.api_endpoint.model_id}} --encoder_url {{target.api_endpoint.url}} --task "{{config.params.task}}" --workdir {{config.output_dir}} --batch_size {{config.params.extra.batch_size}} --async_limit {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --request_timeout {{config.params.request_timeout}} {% if config.params.extra.cache_path is not none %} --cache_path {{config.params.extra.cache_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.language is not none %} --langs {{config.params.extra.language}} {% endif %} {% if config.params.extra.query_prompt_template is not none %} --query_prompt_template "{{config.params.extra.query_prompt_template}}"{% endif %} {% if config.params.extra.document_prompt_template is not none %} --document_prompt_template "{{config.params.extra.document_prompt_template}}"{% endif %} {% if config.params.extra.ranker.model_id is not none %} --ranker_name {{config.params.extra.ranker.model_id}} --ranker_url {{config.params.extra.ranker.url}} --ranker_endpoint_type {{config.params.extra.ranker.endpoint_type}}{% endif %} --truncate {{config.params.extra.truncate}} --top_k {{config.params.extra.top_k}} {% if config.params.extra.version_lite is not none%} --version_lite {{config.params.extra.version_lite}} {% endif %} {% if config.params.extra.eval_split is not none %} --eval_split {{config.params.extra.eval_split}} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: mteb pkg_name: mteb config: params: max_retries: 10 parallelism: 20 task: NQ request_timeout: 300 extra: query_prompt_template: null document_prompt_template: null ranker: model_id: null url: null api_key: null endpoint_type: nim top_k: 40 truncate: END batch_size: 128 eval_split: test dataset_path: null cache_path: null args: null version_lite: null language: null supported_endpoint_types: - embedding type: nq target: api_endpoint: {} ``` ::: :::: # nemo_skills This page contains all evaluation tasks for the **nemo_skills** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [ns_aa_lcr](#nemo-skills-ns-aa-lcr) - AA-LCR * - [ns_aime2024](#nemo-skills-ns-aime2024) - AIME2024 * - [ns_aime2025](#nemo-skills-ns-aime2025) - AIME2025 * - [ns_bfcl_v3](#nemo-skills-ns-bfcl-v3) - BFCLv3 * - [ns_bfcl_v4](#nemo-skills-ns-bfcl-v4) - BFCLv4 * - [ns_gpqa](#nemo-skills-ns-gpqa) - GPQA Diamond * - [ns_hle](#nemo-skills-ns-hle) - HumanityLastExam * - [ns_hle_aa](#nemo-skills-ns-hle-aa) - HumanityLastExam aligned with AA * - [ns_hmmt_feb2025](#nemo-skills-ns-hmmt-feb2025) - HMMT February 2025 (MathArena/hmmt_feb_2025) * - [ns_ifbench](#nemo-skills-ns-ifbench) - IFBench - Instruction Following Benchmark * - [ns_ifeval](#nemo-skills-ns-ifeval) - IFEval - Instruction-Following Evaluation for Large Language Models * - [ns_livecodebench](#nemo-skills-ns-livecodebench) - LiveCodeBench v6 * - [ns_livecodebench_aa](#nemo-skills-ns-livecodebench-aa) - LiveCodeBench with AA custom prompt format (315 problems from July 2024 to Dec 2024, release_v5) * - [ns_livecodebench_v5](#nemo-skills-ns-livecodebench-v5) - LiveCodeBench v5 * - [ns_mmlu](#nemo-skills-ns-mmlu) - MMLU * - [ns_mmlu_pro](#nemo-skills-ns-mmlu-pro) - MMLU-PRO * - [ns_mmlu_prox](#nemo-skills-ns-mmlu-prox) - MMLU-ProX * - [ns_ruler](#nemo-skills-ns-ruler) - RULER - Long Context Understanding * - [ns_scicode](#nemo-skills-ns-scicode) - SciCode * - [ns_wmt24pp](#nemo-skills-ns-wmt24pp) - WMT24++ ``` (nemo-skills-ns-aa-lcr)= ## ns_aa_lcr AA-LCR ::::{tab-set} :::{tab-item} Container **Harness:** `nemo_skills` **Container:** ``` nvcr.io/nvidia/eval-factory/nemo-skills:26.01 ``` **Container Digest:** ``` sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a ``` **Container Arch:** `multiarch` **Task Type:** `ns_aa_lcr` ::: :::{tab-item} Command ```bash cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: nemo_skills pkg_name: nemo_skills config: params: parallelism: 16 task: aalcr extra: use_sandbox: false num_repeats: null prompt_config: null args: null system_message: null dataset_split: null judge_support: true judge: url: null model_id: null api_key: null generation_type: null random_seed: 1234 temperature: 0.0 top_p: 1.0 max_new_tokens: 4096 args: null parallelism: null ruler: data_dir: null cluster: null setup: null max_seq_length: null tokenizer_path: null template_tokens: null num_samples: null tasks: null supported_endpoint_types: - chat type: ns_aa_lcr target: {} ``` ::: :::: --- (nemo-skills-ns-aime2024)= ## ns_aime2024 AIME2024 ::::{tab-set} :::{tab-item} Container **Harness:** `nemo_skills` **Container:** ``` nvcr.io/nvidia/eval-factory/nemo-skills:26.01 ``` **Container Digest:** ``` sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a ``` **Container Arch:** `multiarch` **Task Type:** `ns_aime2024` ::: :::{tab-item} Command ```bash cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: nemo_skills pkg_name: nemo_skills config: params: parallelism: 16 task: aime24 extra: use_sandbox: false num_repeats: null prompt_config: null args: null system_message: null dataset_split: null judge_support: true judge: url: null model_id: null api_key: null generation_type: math_judge random_seed: 1234 temperature: null top_p: null max_new_tokens: null args: null parallelism: null ruler: data_dir: null cluster: null setup: null max_seq_length: null tokenizer_path: null template_tokens: null num_samples: null tasks: null supported_endpoint_types: - chat type: ns_aime2024 target: {} ``` ::: :::: --- (nemo-skills-ns-aime2025)= ## ns_aime2025 AIME2025 ::::{tab-set} :::{tab-item} Container **Harness:** `nemo_skills` **Container:** ``` nvcr.io/nvidia/eval-factory/nemo-skills:26.01 ``` **Container Digest:** ``` sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a ``` **Container Arch:** `multiarch` **Task Type:** `ns_aime2025` ::: :::{tab-item} Command ```bash cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: nemo_skills pkg_name: nemo_skills config: params: parallelism: 16 task: aime25 extra: use_sandbox: false num_repeats: null prompt_config: null args: null system_message: null dataset_split: null judge_support: true judge: url: null model_id: null api_key: null generation_type: math_judge random_seed: 1234 temperature: null top_p: null max_new_tokens: null args: null parallelism: null ruler: data_dir: null cluster: null setup: null max_seq_length: null tokenizer_path: null template_tokens: null num_samples: null tasks: null supported_endpoint_types: - chat type: ns_aime2025 target: {} ``` ::: :::: --- (nemo-skills-ns-bfcl-v3)= ## ns_bfcl_v3 BFCLv3 ::::{tab-set} :::{tab-item} Container **Harness:** `nemo_skills` **Container:** ``` nvcr.io/nvidia/eval-factory/nemo-skills:26.01 ``` **Container Digest:** ``` sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a ``` **Container Arch:** `multiarch` **Task Type:** `ns_bfcl_v3` ::: :::{tab-item} Command ```bash cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: nemo_skills pkg_name: nemo_skills config: params: parallelism: 16 task: bfcl_v3 extra: use_sandbox: false num_repeats: null prompt_config: null args: ++use_client_parsing=False system_message: null dataset_split: null judge_support: false judge: url: null model_id: null api_key: null generation_type: null random_seed: 1234 temperature: null top_p: null max_new_tokens: null args: null parallelism: null ruler: data_dir: null cluster: null setup: null max_seq_length: null tokenizer_path: null template_tokens: null num_samples: null tasks: null supported_endpoint_types: - chat type: ns_bfcl_v3 target: {} ``` ::: :::: --- (nemo-skills-ns-bfcl-v4)= ## ns_bfcl_v4 BFCLv4 ::::{tab-set} :::{tab-item} Container **Harness:** `nemo_skills` **Container:** ``` nvcr.io/nvidia/eval-factory/nemo-skills:26.01 ``` **Container Digest:** ``` sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a ``` **Container Arch:** `multiarch` **Task Type:** `ns_bfcl_v4` ::: :::{tab-item} Command ```bash cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: nemo_skills pkg_name: nemo_skills config: params: parallelism: 16 task: bfcl_v4 extra: use_sandbox: false num_repeats: null prompt_config: null args: ++use_client_parsing=False system_message: null dataset_split: null judge_support: false judge: url: null model_id: null api_key: null generation_type: null random_seed: 1234 temperature: null top_p: null max_new_tokens: null args: null parallelism: null ruler: data_dir: null cluster: null setup: null max_seq_length: null tokenizer_path: null template_tokens: null num_samples: null tasks: null supported_endpoint_types: - chat type: ns_bfcl_v4 target: {} ``` ::: :::: --- (nemo-skills-ns-gpqa)= ## ns_gpqa GPQA Diamond ::::{tab-set} :::{tab-item} Container **Harness:** `nemo_skills` **Container:** ``` nvcr.io/nvidia/eval-factory/nemo-skills:26.01 ``` **Container Digest:** ``` sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a ``` **Container Arch:** `multiarch` **Task Type:** `ns_gpqa` ::: :::{tab-item} Command ```bash cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: nemo_skills pkg_name: nemo_skills config: params: parallelism: 16 task: gpqa extra: use_sandbox: false num_repeats: null prompt_config: null args: null system_message: null dataset_split: null judge_support: false judge: url: null model_id: null api_key: null generation_type: null random_seed: 1234 temperature: null top_p: null max_new_tokens: null args: null parallelism: null ruler: data_dir: null cluster: null setup: null max_seq_length: null tokenizer_path: null template_tokens: null num_samples: null tasks: null supported_endpoint_types: - chat type: ns_gpqa target: {} ``` ::: :::: --- (nemo-skills-ns-hle)= ## ns_hle HumanityLastExam ::::{tab-set} :::{tab-item} Container **Harness:** `nemo_skills` **Container:** ``` nvcr.io/nvidia/eval-factory/nemo-skills:26.01 ``` **Container Digest:** ``` sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a ``` **Container Arch:** `multiarch` **Task Type:** `ns_hle` ::: :::{tab-item} Command ```bash cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: nemo_skills pkg_name: nemo_skills config: params: parallelism: 16 task: hle extra: use_sandbox: false num_repeats: null prompt_config: null args: null system_message: null dataset_split: null judge_support: false judge: url: null model_id: null api_key: null generation_type: null random_seed: 1234 temperature: null top_p: null max_new_tokens: null args: null parallelism: null ruler: data_dir: null cluster: null setup: null max_seq_length: null tokenizer_path: null template_tokens: null num_samples: null tasks: null supported_endpoint_types: - chat type: ns_hle target: {} ``` ::: :::: --- (nemo-skills-ns-hle-aa)= ## ns_hle_aa HumanityLastExam aligned with AA ::::{tab-set} :::{tab-item} Container **Harness:** `nemo_skills` **Container:** ``` nvcr.io/nvidia/eval-factory/nemo-skills:26.01 ``` **Container Digest:** ``` sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a ``` **Container Arch:** `multiarch` **Task Type:** `ns_hle_aa` ::: :::{tab-item} Command ```bash cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: nemo_skills pkg_name: nemo_skills config: params: parallelism: 16 task: hle extra: use_sandbox: false num_repeats: 1 prompt_config: null args: null system_message: null dataset_split: null judge_support: true judge: url: https://inference-api.nvidia.com/v1 model_id: us/azure/openai/gpt-4.1 api_key: null generation_type: null random_seed: 1234 temperature: null top_p: null max_new_tokens: null args: null parallelism: null ruler: data_dir: null cluster: null setup: null max_seq_length: null tokenizer_path: null template_tokens: null num_samples: null tasks: null supported_endpoint_types: - chat type: ns_hle_aa target: {} ``` ::: :::: --- (nemo-skills-ns-hmmt-feb2025)= ## ns_hmmt_feb2025 HMMT February 2025 (MathArena/hmmt_feb_2025) ::::{tab-set} :::{tab-item} Container **Harness:** `nemo_skills` **Container:** ``` nvcr.io/nvidia/eval-factory/nemo-skills:26.01 ``` **Container Digest:** ``` sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a ``` **Container Arch:** `multiarch` **Task Type:** `ns_hmmt_feb2025` ::: :::{tab-item} Command ```bash cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: nemo_skills pkg_name: nemo_skills config: params: parallelism: 16 task: hmmt_feb25 extra: use_sandbox: false num_repeats: null prompt_config: null args: null system_message: null dataset_split: null judge_support: true judge: url: null model_id: null api_key: null generation_type: math_judge random_seed: 1234 temperature: null top_p: null max_new_tokens: null args: null parallelism: null ruler: data_dir: null cluster: null setup: null max_seq_length: null tokenizer_path: null template_tokens: null num_samples: null tasks: null supported_endpoint_types: - chat type: ns_hmmt_feb2025 target: {} ``` ::: :::: --- (nemo-skills-ns-ifbench)= ## ns_ifbench IFBench - Instruction Following Benchmark ::::{tab-set} :::{tab-item} Container **Harness:** `nemo_skills` **Container:** ``` nvcr.io/nvidia/eval-factory/nemo-skills:26.01 ``` **Container Digest:** ``` sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a ``` **Container Arch:** `multiarch` **Task Type:** `ns_ifbench` ::: :::{tab-item} Command ```bash cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: nemo_skills pkg_name: nemo_skills config: params: parallelism: 16 task: ifbench extra: use_sandbox: false num_repeats: null prompt_config: null args: null system_message: null dataset_split: null judge_support: false judge: url: null model_id: null api_key: null generation_type: null random_seed: 1234 temperature: null top_p: null max_new_tokens: null args: null parallelism: null ruler: data_dir: null cluster: null setup: null max_seq_length: null tokenizer_path: null template_tokens: null num_samples: null tasks: null supported_endpoint_types: - chat type: ns_ifbench target: {} ``` ::: :::: --- (nemo-skills-ns-ifeval)= ## ns_ifeval IFEval - Instruction-Following Evaluation for Large Language Models ::::{tab-set} :::{tab-item} Container **Harness:** `nemo_skills` **Container:** ``` nvcr.io/nvidia/eval-factory/nemo-skills:26.01 ``` **Container Digest:** ``` sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a ``` **Container Arch:** `multiarch` **Task Type:** `ns_ifeval` ::: :::{tab-item} Command ```bash cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: nemo_skills pkg_name: nemo_skills config: params: parallelism: 16 task: ifeval extra: use_sandbox: false num_repeats: null prompt_config: null args: null system_message: null dataset_split: null judge_support: false judge: url: null model_id: null api_key: null generation_type: null random_seed: 1234 temperature: null top_p: null max_new_tokens: null args: null parallelism: null ruler: data_dir: null cluster: null setup: null max_seq_length: null tokenizer_path: null template_tokens: null num_samples: null tasks: null supported_endpoint_types: - chat type: ns_ifeval target: {} ``` ::: :::: --- (nemo-skills-ns-livecodebench)= ## ns_livecodebench LiveCodeBench v6 ::::{tab-set} :::{tab-item} Container **Harness:** `nemo_skills` **Container:** ``` nvcr.io/nvidia/eval-factory/nemo-skills:26.01 ``` **Container Digest:** ``` sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a ``` **Container Arch:** `multiarch` **Task Type:** `ns_livecodebench` ::: :::{tab-item} Command ```bash cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: nemo_skills pkg_name: nemo_skills config: params: parallelism: 16 task: livecodebench extra: use_sandbox: false num_repeats: null prompt_config: null args: null system_message: null dataset_split: test_v6_2408_2505 judge_support: false judge: url: null model_id: null api_key: null generation_type: null random_seed: 1234 temperature: null top_p: null max_new_tokens: null args: null parallelism: null ruler: data_dir: null cluster: null setup: null max_seq_length: null tokenizer_path: null template_tokens: null num_samples: null tasks: null supported_endpoint_types: - chat type: ns_livecodebench target: {} ``` ::: :::: --- (nemo-skills-ns-livecodebench-aa)= ## ns_livecodebench_aa LiveCodeBench with AA custom prompt format (315 problems from July 2024 to Dec 2024, release_v5) ::::{tab-set} :::{tab-item} Container **Harness:** `nemo_skills` **Container:** ``` nvcr.io/nvidia/eval-factory/nemo-skills:26.01 ``` **Container Digest:** ``` sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a ``` **Container Arch:** `multiarch` **Task Type:** `ns_livecodebench_aa` ::: :::{tab-item} Command ```bash cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: nemo_skills pkg_name: nemo_skills config: params: parallelism: 16 task: livecodebench extra: use_sandbox: false num_repeats: 3 prompt_config: /nemo_run/code/eval_factory_prompts/livecodebench-aa.yaml args: null system_message: null dataset_split: test_v5_2407_2412 judge_support: false judge: url: null model_id: null api_key: null generation_type: null random_seed: 1234 temperature: null top_p: null max_new_tokens: null args: null parallelism: null ruler: data_dir: null cluster: null setup: null max_seq_length: null tokenizer_path: null template_tokens: null num_samples: null tasks: null supported_endpoint_types: - chat type: ns_livecodebench_aa target: {} ``` ::: :::: --- (nemo-skills-ns-livecodebench-v5)= ## ns_livecodebench_v5 LiveCodeBench v5 ::::{tab-set} :::{tab-item} Container **Harness:** `nemo_skills` **Container:** ``` nvcr.io/nvidia/eval-factory/nemo-skills:26.01 ``` **Container Digest:** ``` sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a ``` **Container Arch:** `multiarch` **Task Type:** `ns_livecodebench_v5` ::: :::{tab-item} Command ```bash cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: nemo_skills pkg_name: nemo_skills config: params: parallelism: 16 task: livecodebench extra: use_sandbox: false num_repeats: null prompt_config: null args: null system_message: null dataset_split: test_v5_2407_2412 judge_support: false judge: url: null model_id: null api_key: null generation_type: null random_seed: 1234 temperature: null top_p: null max_new_tokens: null args: null parallelism: null ruler: data_dir: null cluster: null setup: null max_seq_length: null tokenizer_path: null template_tokens: null num_samples: null tasks: null supported_endpoint_types: - chat type: ns_livecodebench_v5 target: {} ``` ::: :::: --- (nemo-skills-ns-mmlu)= ## ns_mmlu MMLU ::::{tab-set} :::{tab-item} Container **Harness:** `nemo_skills` **Container:** ``` nvcr.io/nvidia/eval-factory/nemo-skills:26.01 ``` **Container Digest:** ``` sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a ``` **Container Arch:** `multiarch` **Task Type:** `ns_mmlu` ::: :::{tab-item} Command ```bash cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: nemo_skills pkg_name: nemo_skills config: params: parallelism: 16 task: mmlu extra: use_sandbox: false num_repeats: null prompt_config: null args: null system_message: null dataset_split: null judge_support: false judge: url: null model_id: null api_key: null generation_type: null random_seed: 1234 temperature: null top_p: null max_new_tokens: null args: null parallelism: null ruler: data_dir: null cluster: null setup: null max_seq_length: null tokenizer_path: null template_tokens: null num_samples: null tasks: null supported_endpoint_types: - chat type: ns_mmlu target: {} ``` ::: :::: --- (nemo-skills-ns-mmlu-pro)= ## ns_mmlu_pro MMLU-PRO ::::{tab-set} :::{tab-item} Container **Harness:** `nemo_skills` **Container:** ``` nvcr.io/nvidia/eval-factory/nemo-skills:26.01 ``` **Container Digest:** ``` sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a ``` **Container Arch:** `multiarch` **Task Type:** `ns_mmlu_pro` ::: :::{tab-item} Command ```bash cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: nemo_skills pkg_name: nemo_skills config: params: parallelism: 16 task: mmlu-pro extra: use_sandbox: false num_repeats: null prompt_config: null args: null system_message: null dataset_split: null judge_support: false judge: url: null model_id: null api_key: null generation_type: null random_seed: 1234 temperature: null top_p: null max_new_tokens: null args: null parallelism: null ruler: data_dir: null cluster: null setup: null max_seq_length: null tokenizer_path: null template_tokens: null num_samples: null tasks: null supported_endpoint_types: - chat type: ns_mmlu_pro target: {} ``` ::: :::: --- (nemo-skills-ns-mmlu-prox)= ## ns_mmlu_prox MMLU-ProX ::::{tab-set} :::{tab-item} Container **Harness:** `nemo_skills` **Container:** ``` nvcr.io/nvidia/eval-factory/nemo-skills:26.01 ``` **Container Digest:** ``` sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a ``` **Container Arch:** `multiarch` **Task Type:** `ns_mmlu_prox` ::: :::{tab-item} Command ```bash cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: nemo_skills pkg_name: nemo_skills config: params: parallelism: 16 task: mmlu-prox extra: use_sandbox: false num_repeats: null prompt_config: null args: null system_message: null dataset_split: null judge_support: false judge: url: null model_id: null api_key: null generation_type: null random_seed: 1234 temperature: null top_p: null max_new_tokens: null args: null parallelism: null ruler: data_dir: null cluster: null setup: null max_seq_length: null tokenizer_path: null template_tokens: null num_samples: null tasks: null supported_endpoint_types: - chat type: ns_mmlu_prox target: {} ``` ::: :::: --- (nemo-skills-ns-ruler)= ## ns_ruler RULER - Long Context Understanding ::::{tab-set} :::{tab-item} Container **Harness:** `nemo_skills` **Container:** ``` nvcr.io/nvidia/eval-factory/nemo-skills:26.01 ``` **Container Digest:** ``` sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a ``` **Container Arch:** `multiarch` **Task Type:** `ns_ruler` ::: :::{tab-item} Command ```bash cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: nemo_skills pkg_name: nemo_skills config: params: parallelism: 16 task: ruler.evaluation_128k extra: use_sandbox: false num_repeats: null prompt_config: null args: null system_message: null dataset_split: null judge_support: false judge: url: null model_id: null api_key: null generation_type: null random_seed: 1234 temperature: null top_p: null max_new_tokens: null args: null parallelism: null ruler: data_dir: /workspace/ruler_data cluster: local setup: evaluation_128k max_seq_length: 131072 tokenizer_path: null template_tokens: 50 num_samples: null tasks: null supported_endpoint_types: - completions type: ns_ruler target: {} ``` ::: :::: --- (nemo-skills-ns-scicode)= ## ns_scicode SciCode ::::{tab-set} :::{tab-item} Container **Harness:** `nemo_skills` **Container:** ``` nvcr.io/nvidia/eval-factory/nemo-skills:26.01 ``` **Container Digest:** ``` sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a ``` **Container Arch:** `multiarch` **Task Type:** `ns_scicode` ::: :::{tab-item} Command ```bash cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: nemo_skills pkg_name: nemo_skills config: params: parallelism: 16 task: scicode extra: use_sandbox: true num_repeats: null prompt_config: null args: null system_message: null dataset_split: null judge_support: false judge: url: null model_id: null api_key: null generation_type: null random_seed: 1234 temperature: null top_p: null max_new_tokens: null args: null parallelism: null ruler: data_dir: null cluster: null setup: null max_seq_length: null tokenizer_path: null template_tokens: null num_samples: null tasks: null supported_endpoint_types: - chat type: ns_scicode target: {} ``` ::: :::: --- (nemo-skills-ns-wmt24pp)= ## ns_wmt24pp WMT24++ ::::{tab-set} :::{tab-item} Container **Harness:** `nemo_skills` **Container:** ``` nvcr.io/nvidia/eval-factory/nemo-skills:26.01 ``` **Container Digest:** ``` sha256:43e2c4d6e197744f7fd0a874d06c5600a8b46b54e16d333c0ebf057b6d54635a ``` **Container Arch:** `multiarch` **Task Type:** `ns_wmt24pp` ::: :::{tab-item} Command ```bash cd /nemo_run/code && {% if config.params.extra.use_sandbox %}python -m nemo_skills.code_execution.local_sandbox.local_sandbox_server > {{config.output_dir}}/sandbox.log 2>&1 & SANDBOX_PID=$! && sleep 3 && {% endif %}{% if not config.params.task.startswith('ruler') %} ns prepare_data {{config.params.task}} {% else %} mkdir -p {{config.params.extra.ruler.data_dir}} && ln -sf {{config.params.extra.ruler.data_dir}} /nemo_run/code/ruler_data && ns prepare_data ruler --data_dir={{config.params.extra.ruler.data_dir}} --cluster={{config.params.extra.ruler.cluster}} --setup={{config.params.extra.ruler.setup}} --max_seq_length={{config.params.extra.ruler.max_seq_length}} --tokenizer_path={{config.params.extra.ruler.tokenizer_path}} {% if config.params.extra.ruler.template_tokens is not none %}--template_tokens={{config.params.extra.ruler.template_tokens}}{% endif %} {% if config.params.extra.ruler.num_samples is not none %}--num_samples={{config.params.extra.ruler.num_samples}}{% elif config.params.limit_samples is not none %}--num_samples={{config.params.limit_samples}}{% endif %} {% if config.params.extra.ruler.tasks is not none %}--tasks {% for task in config.params.extra.ruler.tasks %}{{task}}{% if not loop.last %} {% endif %}{% endfor %}{% endif %} {% endif %} && ns eval --server_type=openai --model={{target.api_endpoint.model_id}} --server_address={{target.api_endpoint.url}} --benchmarks={{config.params.task}}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}:{{config.params.extra.num_repeats}}{% endif %} --output_dir={{config.output_dir}} {% if config.params.extra.dataset_split is not none %}--split={{config.params.extra.dataset_split}}{% endif %} {% if config.params.extra.ruler.data_dir is not none %}--data_dir={{config.params.extra.ruler.data_dir}}{% endif %} ++server.api_key_env_var={% if target.api_endpoint.api_key_name is not none %}{{target.api_endpoint.api_key_name}}{% else %}DUMMY_API_KEY{% endif %} {% if config.params.max_new_tokens is not none %}++inference.tokens_to_generate={{config.params.max_new_tokens}}{% endif %} {% if config.params.extra.system_message is not none %} ++system_message='{{config.params.extra.system_message}}' {% endif %} {% if config.params.limit_samples is not none %}++max_samples={{config.params.limit_samples}}{% endif %} {% if config.params.parallelism is not none %}{% if config.params.extra.num_repeats is not none and config.params.extra.num_repeats > 1 %}++max_concurrent_requests={{[(config.params.parallelism / config.params.extra.num_repeats) | int, 1] | max}}{% else %}++max_concurrent_requests={{config.params.parallelism | int}}{% endif %}{% endif %} {% if config.params.temperature is not none %}++inference.temperature={{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}++inference.top_p={{config.params.top_p}}{% endif %} {% if config.params.extra.prompt_config is not none %}++prompt_config={{config.params.extra.prompt_config}}{% endif %} {% if config.params.extra.ruler.tokenizer_path is not none %}++tokenizer={{config.params.extra.ruler.tokenizer_path}}{% endif %} {% if config.params.extra.args is not none %} {{config.params.extra.args}} {% endif %} {% if config.params.extra.judge_support and config.params.extra.judge.url is not none %} --judge_model={{config.params.extra.judge.model_id}} --judge_server_address={{config.params.extra.judge.url}} --judge_server_type=openai {% if config.params.extra.judge.generation_type is not none %} --judge_generation_type={{config.params.extra.judge.generation_type}} {% endif %} --extra_judge_args="++server.api_key_env_var={% if config.params.extra.judge.api_key is not none %}{{config.params.extra.judge.api_key}}{% else %}DUMMY_API_KEY{% endif %} {%- if config.params.extra.judge.temperature is not none %} ++inference.temperature={{config.params.extra.judge.temperature}}{% endif %} {%- if config.params.extra.judge.top_p is not none %} ++inference.top_p={{config.params.extra.judge.top_p}}{% endif %} {%- if config.params.extra.judge.max_new_tokens is not none %} ++inference.tokens_to_generate={{config.params.extra.judge.max_new_tokens}}{% endif %} {%- if config.params.extra.judge.parallelism is not none %} ++max_concurrent_requests={{config.params.extra.judge.parallelism}}{% endif %} {%- if config.params.extra.judge.args is not none %} {{config.params.extra.judge.args}}{% endif %}" {% endif %} {% if config.params.extra.use_sandbox %} ; EXIT_CODE=$? ; kill $SANDBOX_PID 2>/dev/null || true ; exit $EXIT_CODE{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: nemo_skills pkg_name: nemo_skills config: params: parallelism: 16 task: wmt24pp extra: use_sandbox: false num_repeats: null prompt_config: null args: null system_message: null dataset_split: null judge_support: false judge: url: null model_id: null api_key: null generation_type: null random_seed: 1234 temperature: null top_p: null max_new_tokens: null args: null parallelism: null ruler: data_dir: null cluster: null setup: null max_seq_length: null tokenizer_path: null template_tokens: null num_samples: null tasks: null supported_endpoint_types: - chat type: ns_wmt24pp target: {} ``` ::: :::: # profbench This page contains all evaluation tasks for the **profbench** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [llm_judge](#profbench-llm-judge) - Run LLM judge on provided ProfBench reports and score them * - [report_generation](#profbench-report-generation) - Generate professional reports and evaluate them (full pipeline) ``` (profbench-llm-judge)= ## llm_judge Run LLM judge on provided ProfBench reports and score them ::::{tab-set} :::{tab-item} Container **Harness:** `profbench` **Container:** ``` nvcr.io/nvidia/eval-factory/profbench:26.01 ``` **Container Digest:** ``` sha256:7b2766affe4c2070ec803a893f7bf1ff2fc735df562aa520ec910c9ef58d3598 ``` **Container Arch:** `multiarch` **Task Type:** `llm_judge` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %} export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.run_generation %} python -m profbench.run_report_generation \ --model {{target.api_endpoint.model_id}} \ --library {{config.params.extra.library}} \ --timeout {{config.params.request_timeout}} \ --parallel {{config.params.parallelism}} \ --retry-attempts {{config.params.max_retries}} \ --folder {{config.output_dir}}{% if target.api_endpoint.url is not none %} --base-url {{target.api_endpoint.url}}{% endif %}{% if config.params.extra.version is not none %} --version {{config.params.extra.version}}{% endif %}{% if config.params.extra.web_search %} --web-search{% endif %}{% if config.params.extra.reasoning %} --reasoning{% endif %}{% if config.params.extra.reasoning_effort is not none %} --reasoning-effort {{config.params.extra.reasoning_effort}}{% endif %}{% if config.params.limit_samples is not none %} --limit-samples {{config.params.limit_samples}}{% endif %}{% if config.params.temperature is not none %} --temperature {{config.params.temperature}}{% endif %}{% if config.params.top_p is not none %} --top-p {{config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %} --max-tokens {{config.params.max_new_tokens}}{% endif %} && GENERATION_OUTPUT=$(ls -t {{config.output_dir}}/*.jsonl | head -1) && {% endif %} {% if config.params.extra.run_judge_generated %} python -m profbench.run_best_llm_judge_on_generated_reports \ --filename $GENERATION_OUTPUT \ --api-key $API_KEY \ --model {{target.api_endpoint.model_id}} \ --library {{config.params.extra.library}} \ --timeout {{config.params.request_timeout}} \ --parallel {{config.params.parallelism}} \ --retry-attempts {{config.params.max_retries}} \ --output-folder {{config.output_dir}}/judgements{% if target.api_endpoint.url is not none %} --base-url {{target.api_endpoint.url}}{% endif %}{% if config.params.limit_samples is not none %} --limit-samples {{config.params.limit_samples}}{% endif %}{% if config.params.temperature is not none %} --temperature {{config.params.temperature}}{% endif %}{% if config.params.top_p is not none %} --top-p {{config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %} --max-tokens {{config.params.max_new_tokens}}{% endif %} && JUDGE_OUTPUT=$(ls -t {{config.output_dir}}/judgements/*.jsonl | head -1) && python -m profbench.score_report_generation $JUDGE_OUTPUT {% endif %} {% if config.params.extra.run_judge_provided %} python -m profbench.run_llm_judge_on_provided_reports \ --model {{target.api_endpoint.model_id}} \ --library {{config.params.extra.library}} \ --timeout {{config.params.request_timeout}} \ --parallel {{config.params.parallelism}} \ --retry-attempts {{config.params.max_retries}} \ --folder {{config.output_dir}}{% if target.api_endpoint.url is not none %} --base-url {{target.api_endpoint.url}}{% endif %}{% if config.params.extra.reasoning %} --reasoning{% endif %}{% if config.params.extra.reasoning_effort is not none %} --reasoning-effort {{config.params.extra.reasoning_effort}}{% endif %}{% if config.params.extra.debug %} --debug{% endif %}{% if config.params.limit_samples is not none %} --limit-samples {{config.params.limit_samples}}{% endif %}{% if config.params.temperature is not none %} --temperature {{config.params.temperature}}{% endif %}{% if config.params.top_p is not none %} --top-p {{config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %} --max-tokens {{config.params.max_new_tokens}}{% endif %} && JUDGE_OUTPUT=$(ls -t {{config.output_dir}}/*.jsonl | head -1) && python -m profbench.score_llm_judge $JUDGE_OUTPUT {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: profbench pkg_name: profbench config: params: max_new_tokens: 4096 max_retries: 5 parallelism: 10 temperature: 0.0 request_timeout: 600 top_p: 1.0e-05 extra: run_generation: false run_judge_generated: false run_judge_provided: true library: openai version: lite web_search: false reasoning: false reasoning_effort: null debug: false supported_endpoint_types: - chat type: llm_judge target: api_endpoint: {} ``` ::: :::: --- (profbench-report-generation)= ## report_generation Generate professional reports and evaluate them (full pipeline) ::::{tab-set} :::{tab-item} Container **Harness:** `profbench` **Container:** ``` nvcr.io/nvidia/eval-factory/profbench:26.01 ``` **Container Digest:** ``` sha256:7b2766affe4c2070ec803a893f7bf1ff2fc735df562aa520ec910c9ef58d3598 ``` **Container Arch:** `multiarch` **Task Type:** `report_generation` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %} export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.run_generation %} python -m profbench.run_report_generation \ --model {{target.api_endpoint.model_id}} \ --library {{config.params.extra.library}} \ --timeout {{config.params.request_timeout}} \ --parallel {{config.params.parallelism}} \ --retry-attempts {{config.params.max_retries}} \ --folder {{config.output_dir}}{% if target.api_endpoint.url is not none %} --base-url {{target.api_endpoint.url}}{% endif %}{% if config.params.extra.version is not none %} --version {{config.params.extra.version}}{% endif %}{% if config.params.extra.web_search %} --web-search{% endif %}{% if config.params.extra.reasoning %} --reasoning{% endif %}{% if config.params.extra.reasoning_effort is not none %} --reasoning-effort {{config.params.extra.reasoning_effort}}{% endif %}{% if config.params.limit_samples is not none %} --limit-samples {{config.params.limit_samples}}{% endif %}{% if config.params.temperature is not none %} --temperature {{config.params.temperature}}{% endif %}{% if config.params.top_p is not none %} --top-p {{config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %} --max-tokens {{config.params.max_new_tokens}}{% endif %} && GENERATION_OUTPUT=$(ls -t {{config.output_dir}}/*.jsonl | head -1) && {% endif %} {% if config.params.extra.run_judge_generated %} python -m profbench.run_best_llm_judge_on_generated_reports \ --filename $GENERATION_OUTPUT \ --api-key $API_KEY \ --model {{target.api_endpoint.model_id}} \ --library {{config.params.extra.library}} \ --timeout {{config.params.request_timeout}} \ --parallel {{config.params.parallelism}} \ --retry-attempts {{config.params.max_retries}} \ --output-folder {{config.output_dir}}/judgements{% if target.api_endpoint.url is not none %} --base-url {{target.api_endpoint.url}}{% endif %}{% if config.params.limit_samples is not none %} --limit-samples {{config.params.limit_samples}}{% endif %}{% if config.params.temperature is not none %} --temperature {{config.params.temperature}}{% endif %}{% if config.params.top_p is not none %} --top-p {{config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %} --max-tokens {{config.params.max_new_tokens}}{% endif %} && JUDGE_OUTPUT=$(ls -t {{config.output_dir}}/judgements/*.jsonl | head -1) && python -m profbench.score_report_generation $JUDGE_OUTPUT {% endif %} {% if config.params.extra.run_judge_provided %} python -m profbench.run_llm_judge_on_provided_reports \ --model {{target.api_endpoint.model_id}} \ --library {{config.params.extra.library}} \ --timeout {{config.params.request_timeout}} \ --parallel {{config.params.parallelism}} \ --retry-attempts {{config.params.max_retries}} \ --folder {{config.output_dir}}{% if target.api_endpoint.url is not none %} --base-url {{target.api_endpoint.url}}{% endif %}{% if config.params.extra.reasoning %} --reasoning{% endif %}{% if config.params.extra.reasoning_effort is not none %} --reasoning-effort {{config.params.extra.reasoning_effort}}{% endif %}{% if config.params.extra.debug %} --debug{% endif %}{% if config.params.limit_samples is not none %} --limit-samples {{config.params.limit_samples}}{% endif %}{% if config.params.temperature is not none %} --temperature {{config.params.temperature}}{% endif %}{% if config.params.top_p is not none %} --top-p {{config.params.top_p}}{% endif %}{% if config.params.max_new_tokens is not none %} --max-tokens {{config.params.max_new_tokens}}{% endif %} && JUDGE_OUTPUT=$(ls -t {{config.output_dir}}/*.jsonl | head -1) && python -m profbench.score_llm_judge $JUDGE_OUTPUT {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: profbench pkg_name: profbench config: params: max_new_tokens: 4096 max_retries: 5 parallelism: 10 temperature: 0.0 request_timeout: 600 top_p: 1.0e-05 extra: run_generation: true run_judge_generated: true run_judge_provided: false library: openai version: lite web_search: false reasoning: false reasoning_effort: null debug: false supported_endpoint_types: - chat type: report_generation target: api_endpoint: {} ``` ::: :::: # ruler This page contains all evaluation tasks for the **ruler** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [ruler-128k-chat](#ruler-ruler-128k-chat) - RULER with context length of 128k (chat mode) * - [ruler-128k-completions](#ruler-ruler-128k-completions) - RULER with context length of 128k (completions mode) * - [ruler-16k-chat](#ruler-ruler-16k-chat) - RULER with context length of 16k (chat mode) * - [ruler-16k-completions](#ruler-ruler-16k-completions) - RULER with context length of 16k (completions mode) * - [ruler-1m-chat](#ruler-ruler-1m-chat) - RULER with context length of 1M (chat mode) * - [ruler-1m-completions](#ruler-ruler-1m-completions) - RULER with context length of 1M (completions mode) * - [ruler-256k-chat](#ruler-ruler-256k-chat) - RULER with context length of 256k (chat mode) * - [ruler-256k-completions](#ruler-ruler-256k-completions) - RULER with context length of 256k (completions mode) * - [ruler-32k-chat](#ruler-ruler-32k-chat) - RULER with context length of 32k (chat mode) * - [ruler-32k-completions](#ruler-ruler-32k-completions) - RULER with context length of 32k (completions mode) * - [ruler-4k-chat](#ruler-ruler-4k-chat) - RULER with context length of 4k (chat mode) * - [ruler-4k-completions](#ruler-ruler-4k-completions) - RULER with context length of 4k (completions mode) * - [ruler-512k-chat](#ruler-ruler-512k-chat) - RULER with context length of 512k (chat mode) * - [ruler-512k-completions](#ruler-ruler-512k-completions) - RULER with context length of 512k (completions mode) * - [ruler-64k-chat](#ruler-ruler-64k-chat) - RULER with context length of 64k (chat mode) * - [ruler-64k-completions](#ruler-ruler-64k-completions) - RULER with context length of 64k (completions mode) * - [ruler-8k-chat](#ruler-ruler-8k-chat) - RULER with context length of 8k (chat mode) * - [ruler-8k-completions](#ruler-ruler-8k-completions) - RULER with context length of 8k (completions mode) * - [ruler-chat](#ruler-ruler-chat) - RULER (chat mode) without specified context length. A user must explicitly specify `max_seq_length` parameter. * - [ruler-completions](#ruler-ruler-completions) - RULER (completions mode) without specified context length. A user must explicitly specify `max_seq_length` parameter. ``` (ruler-ruler-128k-chat)= ## ruler-128k-chat RULER with context length of 128k (chat mode) ::::{tab-set} :::{tab-item} Container **Harness:** `ruler` **Container:** ``` nvcr.io/nvidia/eval-factory/long-context-eval:26.01 ``` **Container Digest:** ``` sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689 ``` **Container Arch:** `multiarch` **Task Type:** `ruler-128k-chat` ::: :::{tab-item} Command ```bash python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: ruler pkg_name: long_context_eval config: params: parallelism: 1 temperature: 0.0 request_timeout: 300 top_p: 0.0001 extra: tokenizer: null tokenizer_backend: hf max_seq_length: 128000 subtasks: all supported_endpoint_types: - chat type: ruler-128k-chat target: api_endpoint: {} ``` ::: :::: --- (ruler-ruler-128k-completions)= ## ruler-128k-completions RULER with context length of 128k (completions mode) ::::{tab-set} :::{tab-item} Container **Harness:** `ruler` **Container:** ``` nvcr.io/nvidia/eval-factory/long-context-eval:26.01 ``` **Container Digest:** ``` sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689 ``` **Container Arch:** `multiarch` **Task Type:** `ruler-128k-completions` ::: :::{tab-item} Command ```bash python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: ruler pkg_name: long_context_eval config: params: parallelism: 1 temperature: 0.0 request_timeout: 300 top_p: 0.0001 extra: tokenizer: null tokenizer_backend: hf max_seq_length: 128000 subtasks: all supported_endpoint_types: - completions type: ruler-128k-completions target: api_endpoint: {} ``` ::: :::: --- (ruler-ruler-16k-chat)= ## ruler-16k-chat RULER with context length of 16k (chat mode) ::::{tab-set} :::{tab-item} Container **Harness:** `ruler` **Container:** ``` nvcr.io/nvidia/eval-factory/long-context-eval:26.01 ``` **Container Digest:** ``` sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689 ``` **Container Arch:** `multiarch` **Task Type:** `ruler-16k-chat` ::: :::{tab-item} Command ```bash python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: ruler pkg_name: long_context_eval config: params: parallelism: 1 temperature: 0.0 request_timeout: 300 top_p: 0.0001 extra: tokenizer: null tokenizer_backend: hf max_seq_length: 16000 subtasks: all supported_endpoint_types: - chat type: ruler-16k-chat target: api_endpoint: {} ``` ::: :::: --- (ruler-ruler-16k-completions)= ## ruler-16k-completions RULER with context length of 16k (completions mode) ::::{tab-set} :::{tab-item} Container **Harness:** `ruler` **Container:** ``` nvcr.io/nvidia/eval-factory/long-context-eval:26.01 ``` **Container Digest:** ``` sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689 ``` **Container Arch:** `multiarch` **Task Type:** `ruler-16k-completions` ::: :::{tab-item} Command ```bash python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: ruler pkg_name: long_context_eval config: params: parallelism: 1 temperature: 0.0 request_timeout: 300 top_p: 0.0001 extra: tokenizer: null tokenizer_backend: hf max_seq_length: 16000 subtasks: all supported_endpoint_types: - completions type: ruler-16k-completions target: api_endpoint: {} ``` ::: :::: --- (ruler-ruler-1m-chat)= ## ruler-1m-chat RULER with context length of 1M (chat mode) ::::{tab-set} :::{tab-item} Container **Harness:** `ruler` **Container:** ``` nvcr.io/nvidia/eval-factory/long-context-eval:26.01 ``` **Container Digest:** ``` sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689 ``` **Container Arch:** `multiarch` **Task Type:** `ruler-1m-chat` ::: :::{tab-item} Command ```bash python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: ruler pkg_name: long_context_eval config: params: parallelism: 1 temperature: 0.0 request_timeout: 300 top_p: 0.0001 extra: tokenizer: null tokenizer_backend: hf max_seq_length: 1000000 subtasks: all supported_endpoint_types: - chat type: ruler-1m-chat target: api_endpoint: {} ``` ::: :::: --- (ruler-ruler-1m-completions)= ## ruler-1m-completions RULER with context length of 1M (completions mode) ::::{tab-set} :::{tab-item} Container **Harness:** `ruler` **Container:** ``` nvcr.io/nvidia/eval-factory/long-context-eval:26.01 ``` **Container Digest:** ``` sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689 ``` **Container Arch:** `multiarch` **Task Type:** `ruler-1m-completions` ::: :::{tab-item} Command ```bash python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: ruler pkg_name: long_context_eval config: params: parallelism: 1 temperature: 0.0 request_timeout: 300 top_p: 0.0001 extra: tokenizer: null tokenizer_backend: hf max_seq_length: 1000000 subtasks: all supported_endpoint_types: - completions type: ruler-1m-completions target: api_endpoint: {} ``` ::: :::: --- (ruler-ruler-256k-chat)= ## ruler-256k-chat RULER with context length of 256k (chat mode) ::::{tab-set} :::{tab-item} Container **Harness:** `ruler` **Container:** ``` nvcr.io/nvidia/eval-factory/long-context-eval:26.01 ``` **Container Digest:** ``` sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689 ``` **Container Arch:** `multiarch` **Task Type:** `ruler-256k-chat` ::: :::{tab-item} Command ```bash python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: ruler pkg_name: long_context_eval config: params: parallelism: 1 temperature: 0.0 request_timeout: 300 top_p: 0.0001 extra: tokenizer: null tokenizer_backend: hf max_seq_length: 256000 subtasks: all supported_endpoint_types: - chat type: ruler-256k-chat target: api_endpoint: {} ``` ::: :::: --- (ruler-ruler-256k-completions)= ## ruler-256k-completions RULER with context length of 256k (completions mode) ::::{tab-set} :::{tab-item} Container **Harness:** `ruler` **Container:** ``` nvcr.io/nvidia/eval-factory/long-context-eval:26.01 ``` **Container Digest:** ``` sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689 ``` **Container Arch:** `multiarch` **Task Type:** `ruler-256k-completions` ::: :::{tab-item} Command ```bash python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: ruler pkg_name: long_context_eval config: params: parallelism: 1 temperature: 0.0 request_timeout: 300 top_p: 0.0001 extra: tokenizer: null tokenizer_backend: hf max_seq_length: 256000 subtasks: all supported_endpoint_types: - completions type: ruler-256k-completions target: api_endpoint: {} ``` ::: :::: --- (ruler-ruler-32k-chat)= ## ruler-32k-chat RULER with context length of 32k (chat mode) ::::{tab-set} :::{tab-item} Container **Harness:** `ruler` **Container:** ``` nvcr.io/nvidia/eval-factory/long-context-eval:26.01 ``` **Container Digest:** ``` sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689 ``` **Container Arch:** `multiarch` **Task Type:** `ruler-32k-chat` ::: :::{tab-item} Command ```bash python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: ruler pkg_name: long_context_eval config: params: parallelism: 1 temperature: 0.0 request_timeout: 300 top_p: 0.0001 extra: tokenizer: null tokenizer_backend: hf max_seq_length: 32000 subtasks: all supported_endpoint_types: - chat type: ruler-32k-chat target: api_endpoint: {} ``` ::: :::: --- (ruler-ruler-32k-completions)= ## ruler-32k-completions RULER with context length of 32k (completions mode) ::::{tab-set} :::{tab-item} Container **Harness:** `ruler` **Container:** ``` nvcr.io/nvidia/eval-factory/long-context-eval:26.01 ``` **Container Digest:** ``` sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689 ``` **Container Arch:** `multiarch` **Task Type:** `ruler-32k-completions` ::: :::{tab-item} Command ```bash python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: ruler pkg_name: long_context_eval config: params: parallelism: 1 temperature: 0.0 request_timeout: 300 top_p: 0.0001 extra: tokenizer: null tokenizer_backend: hf max_seq_length: 32000 subtasks: all supported_endpoint_types: - completions type: ruler-32k-completions target: api_endpoint: {} ``` ::: :::: --- (ruler-ruler-4k-chat)= ## ruler-4k-chat RULER with context length of 4k (chat mode) ::::{tab-set} :::{tab-item} Container **Harness:** `ruler` **Container:** ``` nvcr.io/nvidia/eval-factory/long-context-eval:26.01 ``` **Container Digest:** ``` sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689 ``` **Container Arch:** `multiarch` **Task Type:** `ruler-4k-chat` ::: :::{tab-item} Command ```bash python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: ruler pkg_name: long_context_eval config: params: parallelism: 1 temperature: 0.0 request_timeout: 300 top_p: 0.0001 extra: tokenizer: null tokenizer_backend: hf max_seq_length: 4000 subtasks: all supported_endpoint_types: - chat type: ruler-4k-chat target: api_endpoint: {} ``` ::: :::: --- (ruler-ruler-4k-completions)= ## ruler-4k-completions RULER with context length of 4k (completions mode) ::::{tab-set} :::{tab-item} Container **Harness:** `ruler` **Container:** ``` nvcr.io/nvidia/eval-factory/long-context-eval:26.01 ``` **Container Digest:** ``` sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689 ``` **Container Arch:** `multiarch` **Task Type:** `ruler-4k-completions` ::: :::{tab-item} Command ```bash python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: ruler pkg_name: long_context_eval config: params: parallelism: 1 temperature: 0.0 request_timeout: 300 top_p: 0.0001 extra: tokenizer: null tokenizer_backend: hf max_seq_length: 4000 subtasks: all supported_endpoint_types: - completions type: ruler-4k-completions target: api_endpoint: {} ``` ::: :::: --- (ruler-ruler-512k-chat)= ## ruler-512k-chat RULER with context length of 512k (chat mode) ::::{tab-set} :::{tab-item} Container **Harness:** `ruler` **Container:** ``` nvcr.io/nvidia/eval-factory/long-context-eval:26.01 ``` **Container Digest:** ``` sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689 ``` **Container Arch:** `multiarch` **Task Type:** `ruler-512k-chat` ::: :::{tab-item} Command ```bash python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: ruler pkg_name: long_context_eval config: params: parallelism: 1 temperature: 0.0 request_timeout: 300 top_p: 0.0001 extra: tokenizer: null tokenizer_backend: hf max_seq_length: 512000 subtasks: all supported_endpoint_types: - chat type: ruler-512k-chat target: api_endpoint: {} ``` ::: :::: --- (ruler-ruler-512k-completions)= ## ruler-512k-completions RULER with context length of 512k (completions mode) ::::{tab-set} :::{tab-item} Container **Harness:** `ruler` **Container:** ``` nvcr.io/nvidia/eval-factory/long-context-eval:26.01 ``` **Container Digest:** ``` sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689 ``` **Container Arch:** `multiarch` **Task Type:** `ruler-512k-completions` ::: :::{tab-item} Command ```bash python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: ruler pkg_name: long_context_eval config: params: parallelism: 1 temperature: 0.0 request_timeout: 300 top_p: 0.0001 extra: tokenizer: null tokenizer_backend: hf max_seq_length: 512000 subtasks: all supported_endpoint_types: - completions type: ruler-512k-completions target: api_endpoint: {} ``` ::: :::: --- (ruler-ruler-64k-chat)= ## ruler-64k-chat RULER with context length of 64k (chat mode) ::::{tab-set} :::{tab-item} Container **Harness:** `ruler` **Container:** ``` nvcr.io/nvidia/eval-factory/long-context-eval:26.01 ``` **Container Digest:** ``` sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689 ``` **Container Arch:** `multiarch` **Task Type:** `ruler-64k-chat` ::: :::{tab-item} Command ```bash python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: ruler pkg_name: long_context_eval config: params: parallelism: 1 temperature: 0.0 request_timeout: 300 top_p: 0.0001 extra: tokenizer: null tokenizer_backend: hf max_seq_length: 64000 subtasks: all supported_endpoint_types: - chat type: ruler-64k-chat target: api_endpoint: {} ``` ::: :::: --- (ruler-ruler-64k-completions)= ## ruler-64k-completions RULER with context length of 64k (completions mode) ::::{tab-set} :::{tab-item} Container **Harness:** `ruler` **Container:** ``` nvcr.io/nvidia/eval-factory/long-context-eval:26.01 ``` **Container Digest:** ``` sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689 ``` **Container Arch:** `multiarch` **Task Type:** `ruler-64k-completions` ::: :::{tab-item} Command ```bash python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: ruler pkg_name: long_context_eval config: params: parallelism: 1 temperature: 0.0 request_timeout: 300 top_p: 0.0001 extra: tokenizer: null tokenizer_backend: hf max_seq_length: 64000 subtasks: all supported_endpoint_types: - completions type: ruler-64k-completions target: api_endpoint: {} ``` ::: :::: --- (ruler-ruler-8k-chat)= ## ruler-8k-chat RULER with context length of 8k (chat mode) ::::{tab-set} :::{tab-item} Container **Harness:** `ruler` **Container:** ``` nvcr.io/nvidia/eval-factory/long-context-eval:26.01 ``` **Container Digest:** ``` sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689 ``` **Container Arch:** `multiarch` **Task Type:** `ruler-8k-chat` ::: :::{tab-item} Command ```bash python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: ruler pkg_name: long_context_eval config: params: parallelism: 1 temperature: 0.0 request_timeout: 300 top_p: 0.0001 extra: tokenizer: null tokenizer_backend: hf max_seq_length: 8000 subtasks: all supported_endpoint_types: - chat type: ruler-8k-chat target: api_endpoint: {} ``` ::: :::: --- (ruler-ruler-8k-completions)= ## ruler-8k-completions RULER with context length of 8k (completions mode) ::::{tab-set} :::{tab-item} Container **Harness:** `ruler` **Container:** ``` nvcr.io/nvidia/eval-factory/long-context-eval:26.01 ``` **Container Digest:** ``` sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689 ``` **Container Arch:** `multiarch` **Task Type:** `ruler-8k-completions` ::: :::{tab-item} Command ```bash python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: ruler pkg_name: long_context_eval config: params: parallelism: 1 temperature: 0.0 request_timeout: 300 top_p: 0.0001 extra: tokenizer: null tokenizer_backend: hf max_seq_length: 8000 subtasks: all supported_endpoint_types: - completions type: ruler-8k-completions target: api_endpoint: {} ``` ::: :::: --- (ruler-ruler-chat)= ## ruler-chat RULER (chat mode) without specified context length. A user must explicitly specify `max_seq_length` parameter. ::::{tab-set} :::{tab-item} Container **Harness:** `ruler` **Container:** ``` nvcr.io/nvidia/eval-factory/long-context-eval:26.01 ``` **Container Digest:** ``` sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689 ``` **Container Arch:** `multiarch` **Task Type:** `ruler-chat` ::: :::{tab-item} Command ```bash python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: ruler pkg_name: long_context_eval config: params: parallelism: 1 temperature: 0.0 request_timeout: 300 top_p: 0.0001 extra: tokenizer: null tokenizer_backend: hf max_seq_length: null subtasks: all supported_endpoint_types: - chat type: ruler-chat target: api_endpoint: {} ``` ::: :::: --- (ruler-ruler-completions)= ## ruler-completions RULER (completions mode) without specified context length. A user must explicitly specify `max_seq_length` parameter. ::::{tab-set} :::{tab-item} Container **Harness:** `ruler` **Container:** ``` nvcr.io/nvidia/eval-factory/long-context-eval:26.01 ``` **Container Digest:** ``` sha256:461a74e48403c66058797cbfb6f42b1cc769b33f92dbd0503706586b2eb84689 ``` **Container Arch:** `multiarch` **Task Type:** `ruler-completions` ::: :::{tab-item} Command ```bash python -c "import nltk;nltk.download('punkt_tab');nltk.download('punkt')" && {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} &&{% endif %} long_context_eval --url {{target.api_endpoint.url}} --tasks "{{config.params.extra.subtasks}}" --result_dir {{config.output_dir}} --model {{target.api_endpoint.model_id}} --mode {% if target.api_endpoint.type == "completions" %}completion{% elif target.api_endpoint.type == "chat" %}chat{% endif %} --tokenizer_path "{{config.params.extra.tokenizer}}" --tokenizer_type "{{config.params.extra.tokenizer_backend}}" --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} {% if config.params.limit_samples is not none %}--num_samples {{config.params.limit_samples}}{% endif %} {% if config.params.extra.max_seq_length is defined %}--max_seq_length {{config.params.extra.max_seq_length}}{% endif %} --timeout {{config.params.request_timeout}} --threads {{config.params.parallelism}} {% if config.params.max_new_tokens is not none %}--tokens_to_generate {{config.params.max_new_tokens}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: ruler pkg_name: long_context_eval config: params: parallelism: 1 temperature: 0.0 request_timeout: 300 top_p: 0.0001 extra: tokenizer: null tokenizer_backend: hf max_seq_length: null subtasks: all supported_endpoint_types: - completions type: ruler-completions target: api_endpoint: {} ``` ::: :::: # safety_eval This page contains all evaluation tasks for the **safety_eval** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [aegis_v2](#safety-eval-aegis-v2) - Aegis V2 without evaluating reasoning traces. This version is used by the NeMo Safety Toolkit. * - [aegis_v2_reasoning](#safety-eval-aegis-v2-reasoning) - Aegis V2 with evaluating reasoning traces. * - [wildguard](#safety-eval-wildguard) - Wildguard ``` (safety-eval-aegis-v2)= ## aegis_v2 Aegis V2 without evaluating reasoning traces. This version is used by the NeMo Safety Toolkit. ::::{tab-set} :::{tab-item} Container **Harness:** `safety_eval` **Container:** ``` nvcr.io/nvidia/eval-factory/safety-harness:25.11 ``` **Container Digest:** ``` sha256:08eeb3f5c3282522ca30da7d3ddc2cab1a48909be05ba561a0dae9a299c637f0 ``` **Container Arch:** `multiarch` **Task Type:** `aegis_v2` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key is not none %}export API_KEY=${{target.api_endpoint.api_key}} && {% endif %} {% if config.params.extra.judge.api_key is not none %}export JUDGE_API_KEY=${{config.params.extra.judge.api_key}} && {% endif %} safety-eval --model-name {{target.api_endpoint.model_id}} --model-url {{target.api_endpoint.url}} --model-type {{target.api_endpoint.type}} --judge-url {{config.params.extra.judge.url}} --results-dir {{config.output_dir}} --eval {{config.params.task}} --mut-inference-params max_tokens={{config.params.max_new_tokens}},temperature={{config.params.temperature}},top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},concurrency={{config.params.parallelism}},retries={{config.params.max_retries}} --judge-inference-params concurrency={{config.params.extra.judge.parallelism}},retries={{config.params.max_retries}} {% if config.params.extra.dataset is defined and config.params.extra.dataset %} --dataset {{config.params.extra.dataset}}{% endif %} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}} {% endif %} {% if config.params.extra.judge.model_id is not none %} --judge-model-name {{config.params.extra.judge.model_id}} {% endif %} {% if config.type == "aegis_v2_reasoning" %} {% if config.params.extra.evaluate_reasoning_traces %} --evaluate-reasoning-traces {% endif %} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: safety_eval pkg_name: safety_eval config: params: max_new_tokens: 6144 max_retries: 5 parallelism: 8 task: aegis_v2 temperature: 0.6 request_timeout: 30 top_p: 0.95 extra: judge: url: null model_id: null api_key: null parallelism: 32 request_timeout: 60 max_retries: 16 evaluate_reasoning_traces: false supported_endpoint_types: - chat - completions type: aegis_v2 target: api_endpoint: stream: false ``` ::: :::: --- (safety-eval-aegis-v2-reasoning)= ## aegis_v2_reasoning Aegis V2 with evaluating reasoning traces. ::::{tab-set} :::{tab-item} Container **Harness:** `safety_eval` **Container:** ``` nvcr.io/nvidia/eval-factory/safety-harness:25.11 ``` **Container Digest:** ``` sha256:08eeb3f5c3282522ca30da7d3ddc2cab1a48909be05ba561a0dae9a299c637f0 ``` **Container Arch:** `multiarch` **Task Type:** `aegis_v2_reasoning` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key is not none %}export API_KEY=${{target.api_endpoint.api_key}} && {% endif %} {% if config.params.extra.judge.api_key is not none %}export JUDGE_API_KEY=${{config.params.extra.judge.api_key}} && {% endif %} safety-eval --model-name {{target.api_endpoint.model_id}} --model-url {{target.api_endpoint.url}} --model-type {{target.api_endpoint.type}} --judge-url {{config.params.extra.judge.url}} --results-dir {{config.output_dir}} --eval {{config.params.task}} --mut-inference-params max_tokens={{config.params.max_new_tokens}},temperature={{config.params.temperature}},top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},concurrency={{config.params.parallelism}},retries={{config.params.max_retries}} --judge-inference-params concurrency={{config.params.extra.judge.parallelism}},retries={{config.params.max_retries}} {% if config.params.extra.dataset is defined and config.params.extra.dataset %} --dataset {{config.params.extra.dataset}}{% endif %} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}} {% endif %} {% if config.params.extra.judge.model_id is not none %} --judge-model-name {{config.params.extra.judge.model_id}} {% endif %} {% if config.type == "aegis_v2_reasoning" %} {% if config.params.extra.evaluate_reasoning_traces %} --evaluate-reasoning-traces {% endif %} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: safety_eval pkg_name: safety_eval config: params: max_new_tokens: 6144 max_retries: 5 parallelism: 8 task: aegis_v2 temperature: 0.6 request_timeout: 30 top_p: 0.95 extra: judge: url: null model_id: null api_key: null parallelism: 32 request_timeout: 60 max_retries: 16 evaluate_reasoning_traces: true supported_endpoint_types: - chat - completions type: aegis_v2_reasoning target: api_endpoint: stream: false ``` ::: :::: --- (safety-eval-wildguard)= ## wildguard Wildguard ::::{tab-set} :::{tab-item} Container **Harness:** `safety_eval` **Container:** ``` nvcr.io/nvidia/eval-factory/safety-harness:25.11 ``` **Container Digest:** ``` sha256:08eeb3f5c3282522ca30da7d3ddc2cab1a48909be05ba561a0dae9a299c637f0 ``` **Container Arch:** `multiarch` **Task Type:** `wildguard` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key is not none %}export API_KEY=${{target.api_endpoint.api_key}} && {% endif %} {% if config.params.extra.judge.api_key is not none %}export JUDGE_API_KEY=${{config.params.extra.judge.api_key}} && {% endif %} safety-eval --model-name {{target.api_endpoint.model_id}} --model-url {{target.api_endpoint.url}} --model-type {{target.api_endpoint.type}} --judge-url {{config.params.extra.judge.url}} --results-dir {{config.output_dir}} --eval {{config.params.task}} --mut-inference-params max_tokens={{config.params.max_new_tokens}},temperature={{config.params.temperature}},top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},concurrency={{config.params.parallelism}},retries={{config.params.max_retries}} --judge-inference-params concurrency={{config.params.extra.judge.parallelism}},retries={{config.params.max_retries}} {% if config.params.extra.dataset is defined and config.params.extra.dataset %} --dataset {{config.params.extra.dataset}}{% endif %} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}} {% endif %} {% if config.params.extra.judge.model_id is not none %} --judge-model-name {{config.params.extra.judge.model_id}} {% endif %} {% if config.type == "aegis_v2_reasoning" %} {% if config.params.extra.evaluate_reasoning_traces %} --evaluate-reasoning-traces {% endif %} {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: safety_eval pkg_name: safety_eval config: params: max_new_tokens: 6144 max_retries: 5 parallelism: 8 task: wildguard temperature: 0.6 request_timeout: 30 top_p: 0.95 extra: judge: url: null model_id: null api_key: null parallelism: 32 request_timeout: 60 max_retries: 16 supported_endpoint_types: - chat - completions type: wildguard target: api_endpoint: stream: false ``` ::: :::: # scicode This page contains all evaluation tasks for the **scicode** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [scicode](#scicode-scicode) - - SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. - Includes default system prompt ("You are a helpful assistant."). * - [scicode_aa_v2](#scicode-scicode-aa-v2) - - SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. - This variant mimicks setup used by Artificial Analysis in their Intelligence Benchmark (v2). - It includes scientist-annotated background in the prompts and uses all available problems for evaluation (including "dev" set). - Does not include a default system prompt ("You are a helpful assistant."). * - [scicode_background](#scicode-scicode-background) - - SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. - This variant includes scientist-annotated background in the prompts. - Includes default system prompt ("You are a helpful assistant."). ``` (scicode-scicode)= ## scicode - SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. - Includes default system prompt ("You are a helpful assistant."). ::::{tab-set} :::{tab-item} Container **Harness:** `scicode` **Container:** ``` nvcr.io/nvidia/eval-factory/scicode:26.01 ``` **Container Digest:** ``` sha256:f5c12499db7d8b415321c4242e5625ed69affdc1632056326790e5d55a4656e0 ``` **Container Arch:** `multiarch` **Task Type:** `scicode` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} scicode_eval --model {{target.api_endpoint.model_id}} --url {{target.api_endpoint.url}} --output-dir {{config.output_dir}}/scicode_results --log-dir {{config.output_dir}}/logs {% if config.params.temperature is not none %}--temperature={{config.params.temperature}}{% endif %} {% if config.params.limit_samples is not none %}--limit-samples={{config.params.limit_samples}}{% endif %} --n-samples={{config.params.extra.n_samples}} --extra-params top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},max_tokens={{config.params.max_new_tokens}},max_retries={{config.params.max_retries}},include_system_prompt={{config.params.extra.include_system_prompt}} {% if config.params.extra.with_background %}--with-background {% endif %} {% if config.params.extra.include_dev %}--include-dev{% endif %} {% if config.params.extra.eval_threads is not none %}--eval-threads={{config.params.extra.eval_threads}}{% endif %} {% if config.params.extra.regex_path is not none %}--regex-path={{config.params.extra.regex_path}}{% endif %} {% if config.params.extra.prompt_template_type is not none %}--prompt-template-type={{config.params.extra.prompt_template_type}}{% endif %} --concurrent-requests={{config.params.parallelism}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: scicode pkg_name: scicode config: params: max_new_tokens: 2048 max_retries: 2 parallelism: 1 temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: with_background: false include_dev: false n_samples: 1 eval_threads: null include_system_prompt: true regex_path: null prompt_template_type: null supported_endpoint_types: - chat type: scicode target: api_endpoint: stream: false ``` ::: :::: --- (scicode-scicode-aa-v2)= ## scicode_aa_v2 - SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. - This variant mimicks setup used by Artificial Analysis in their Intelligence Benchmark (v2). - It includes scientist-annotated background in the prompts and uses all available problems for evaluation (including "dev" set). - Does not include a default system prompt ("You are a helpful assistant."). ::::{tab-set} :::{tab-item} Container **Harness:** `scicode` **Container:** ``` nvcr.io/nvidia/eval-factory/scicode:26.01 ``` **Container Digest:** ``` sha256:f5c12499db7d8b415321c4242e5625ed69affdc1632056326790e5d55a4656e0 ``` **Container Arch:** `multiarch` **Task Type:** `scicode_aa_v2` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} scicode_eval --model {{target.api_endpoint.model_id}} --url {{target.api_endpoint.url}} --output-dir {{config.output_dir}}/scicode_results --log-dir {{config.output_dir}}/logs {% if config.params.temperature is not none %}--temperature={{config.params.temperature}}{% endif %} {% if config.params.limit_samples is not none %}--limit-samples={{config.params.limit_samples}}{% endif %} --n-samples={{config.params.extra.n_samples}} --extra-params top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},max_tokens={{config.params.max_new_tokens}},max_retries={{config.params.max_retries}},include_system_prompt={{config.params.extra.include_system_prompt}} {% if config.params.extra.with_background %}--with-background {% endif %} {% if config.params.extra.include_dev %}--include-dev{% endif %} {% if config.params.extra.eval_threads is not none %}--eval-threads={{config.params.extra.eval_threads}}{% endif %} {% if config.params.extra.regex_path is not none %}--regex-path={{config.params.extra.regex_path}}{% endif %} {% if config.params.extra.prompt_template_type is not none %}--prompt-template-type={{config.params.extra.prompt_template_type}}{% endif %} --concurrent-requests={{config.params.parallelism}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: scicode pkg_name: scicode config: params: max_new_tokens: 16384 max_retries: 30 parallelism: 1 temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: with_background: true include_dev: true n_samples: 3 eval_threads: null include_system_prompt: false regex_path: aa_regex.txt prompt_template_type: background_comment_template.txt supported_endpoint_types: - chat type: scicode_aa_v2 target: api_endpoint: stream: false ``` ::: :::: --- (scicode-scicode-background)= ## scicode_background - SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. - This variant includes scientist-annotated background in the prompts. - Includes default system prompt ("You are a helpful assistant."). ::::{tab-set} :::{tab-item} Container **Harness:** `scicode` **Container:** ``` nvcr.io/nvidia/eval-factory/scicode:26.01 ``` **Container Digest:** ``` sha256:f5c12499db7d8b415321c4242e5625ed69affdc1632056326790e5d55a4656e0 ``` **Container Arch:** `multiarch` **Task Type:** `scicode_background` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} scicode_eval --model {{target.api_endpoint.model_id}} --url {{target.api_endpoint.url}} --output-dir {{config.output_dir}}/scicode_results --log-dir {{config.output_dir}}/logs {% if config.params.temperature is not none %}--temperature={{config.params.temperature}}{% endif %} {% if config.params.limit_samples is not none %}--limit-samples={{config.params.limit_samples}}{% endif %} --n-samples={{config.params.extra.n_samples}} --extra-params top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},max_tokens={{config.params.max_new_tokens}},max_retries={{config.params.max_retries}},include_system_prompt={{config.params.extra.include_system_prompt}} {% if config.params.extra.with_background %}--with-background {% endif %} {% if config.params.extra.include_dev %}--include-dev{% endif %} {% if config.params.extra.eval_threads is not none %}--eval-threads={{config.params.extra.eval_threads}}{% endif %} {% if config.params.extra.regex_path is not none %}--regex-path={{config.params.extra.regex_path}}{% endif %} {% if config.params.extra.prompt_template_type is not none %}--prompt-template-type={{config.params.extra.prompt_template_type}}{% endif %} --concurrent-requests={{config.params.parallelism}} ``` ::: :::{tab-item} Defaults ```yaml framework_name: scicode pkg_name: scicode config: params: max_new_tokens: 2048 max_retries: 2 parallelism: 1 temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: with_background: true include_dev: false n_samples: 1 eval_threads: null include_system_prompt: true regex_path: null prompt_template_type: null supported_endpoint_types: - chat type: scicode_background target: api_endpoint: stream: false ``` ::: :::: # simple_evals This page contains all evaluation tasks for the **simple_evals** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [AA_AIME_2024](#simple-evals-aa-aime-2024) - AIME 2024 questions, math, using Artificial Analysis's setup. * - [AA_math_test_500](#simple-evals-aa-math-test-500) - Open Ai math test 500, using Artificial Analysis's setup. * - [AIME_2024](#simple-evals-aime-2024) - AIME 2024 questions, math * - [AIME_2025](#simple-evals-aime-2025) - AIME 2025 questions, math * - [AIME_2025_aa_v2](#simple-evals-aime-2025-aa-v2) - AIME 2025 questions, math - params aligned with Artificial Analysis Index v2 * - [aime_2024_nemo](#simple-evals-aime-2024-nemo) - AIME 2024 questions, math, using NeMo's alignment template * - [aime_2025_nemo](#simple-evals-aime-2025-nemo) - AIME 2025 questions, math, using NeMo's alignment template * - [browsecomp](#simple-evals-browsecomp) - BrowseComp is a benchmark for measuring the ability for agents to browse the web. * - [gpqa_diamond](#simple-evals-gpqa-diamond) - gpqa_diamond 0-shot CoT * - [gpqa_diamond_aa_v2](#simple-evals-gpqa-diamond-aa-v2) - gpqa_diamond questions with custom regex extraction patterns for AA v2 * - [gpqa_diamond_aa_v2_llama_4](#simple-evals-gpqa-diamond-aa-v2-llama-4) - gpqa_diamond questions with custom regex extraction patterns for Llama 4 * - [gpqa_diamond_aa_v3](#simple-evals-gpqa-diamond-aa-v3) - GPQA Diamond with AA v3 methodology - multi-stage regex extraction for robust answer parsing * - [gpqa_diamond_nemo](#simple-evals-gpqa-diamond-nemo) - gpqa_diamond questions, reasoning, using NeMo's alignment template * - [gpqa_extended](#simple-evals-gpqa-extended) - gpqa_extended 0-shot CoT * - [gpqa_main](#simple-evals-gpqa-main) - gpqa_main 0-shot CoT * - [healthbench](#simple-evals-healthbench) - HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare. * - [healthbench_consensus](#simple-evals-healthbench-consensus) - HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare. The consensus subset measures 34 particularly important aspects of model behavior and has been validated by the consensus of multiple physicians. * - [healthbench_hard](#simple-evals-healthbench-hard) - HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare. The hard subset consists of 1000 examples chosen because they are difficult for current frontier models. * - [humaneval](#simple-evals-humaneval) - HumanEval evaluates the performance in Python code generation tasks. It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions. * - [humanevalplus](#simple-evals-humanevalplus) - HumanEvalPlus is a dataset of 164 programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions. * - [math_test_500](#simple-evals-math-test-500) - Open AI math test 500 * - [math_test_500_nemo](#simple-evals-math-test-500-nemo) - math_test_500 questions, math, using NeMo's alignment template * - [mgsm](#simple-evals-mgsm) - MGSM is a benchmark of grade-school math problems. The same 250 problems from GSM8K are each translated via human annotators in 10 languages. * - [mgsm_aa_v2](#simple-evals-mgsm-aa-v2) - MGSM is a benchmark of grade-school math problems - params aligned with Artificial Analysis Index v2 * - [mmlu](#simple-evals-mmlu) - MMLU 0-shot CoT * - [mmlu_am](#simple-evals-mmlu-am) - Global-MMLU 0-shot CoT in Amharic (am) * - [mmlu_ar](#simple-evals-mmlu-ar) - Global-MMLU 0-shot CoT in Arabic (ar) * - [mmlu_ar-lite](#simple-evals-mmlu-ar-lite) - Global-MMLU-Lite 0-shot CoT in Arabic (ar) * - [mmlu_bn](#simple-evals-mmlu-bn) - Global-MMLU 0-shot CoT in Bengali (bn) * - [mmlu_bn-lite](#simple-evals-mmlu-bn-lite) - Global-MMLU-Lite 0-shot CoT in Bengali (bn) * - [mmlu_cs](#simple-evals-mmlu-cs) - Global-MMLU 0-shot CoT in Czech (cs) * - [mmlu_de](#simple-evals-mmlu-de) - Global-MMLU 0-shot CoT in German (de) * - [mmlu_de-lite](#simple-evals-mmlu-de-lite) - Global-MMLU-Lite 0-shot CoT in German (de) * - [mmlu_el](#simple-evals-mmlu-el) - Global-MMLU 0-shot CoT in Greek (el) * - [mmlu_en](#simple-evals-mmlu-en) - Global-MMLU 0-shot CoT in English (en) * - [mmlu_en-lite](#simple-evals-mmlu-en-lite) - Global-MMLU-Lite 0-shot CoT in English (en) * - [mmlu_es](#simple-evals-mmlu-es) - Global-MMLU 0-shot CoT in Spanish (es) * - [mmlu_es-lite](#simple-evals-mmlu-es-lite) - Global-MMLU-Lite 0-shot CoT in Spanish (es) * - [mmlu_fa](#simple-evals-mmlu-fa) - Global-MMLU 0-shot CoT in Persian (fa) * - [mmlu_fil](#simple-evals-mmlu-fil) - Global-MMLU 0-shot CoT in Filipino (fil) * - [mmlu_fr](#simple-evals-mmlu-fr) - Global-MMLU 0-shot CoT in French (fr) * - [mmlu_fr-lite](#simple-evals-mmlu-fr-lite) - Global-MMLU-Lite 0-shot CoT in French (fr) * - [mmlu_ha](#simple-evals-mmlu-ha) - Global-MMLU 0-shot CoT in Hausa (ha) * - [mmlu_he](#simple-evals-mmlu-he) - Global-MMLU 0-shot CoT in Hebrew (he) * - [mmlu_hi](#simple-evals-mmlu-hi) - Global-MMLU 0-shot CoT in Hindi (hi) * - [mmlu_hi-lite](#simple-evals-mmlu-hi-lite) - Global-MMLU-Lite 0-shot CoT in Hindi (hi) * - [mmlu_id](#simple-evals-mmlu-id) - Global-MMLU 0-shot CoT in Indonesian (id) * - [mmlu_id-lite](#simple-evals-mmlu-id-lite) - Global-MMLU-Lite 0-shot CoT in Indonesian (id) * - [mmlu_ig](#simple-evals-mmlu-ig) - Global-MMLU 0-shot CoT in Igbo (ig) * - [mmlu_it](#simple-evals-mmlu-it) - Global-MMLU 0-shot CoT in Italian (it) * - [mmlu_it-lite](#simple-evals-mmlu-it-lite) - Global-MMLU-Lite 0-shot CoT in Italian (it) * - [mmlu_ja](#simple-evals-mmlu-ja) - Global-MMLU 0-shot CoT in Japanese (ja) * - [mmlu_ja-lite](#simple-evals-mmlu-ja-lite) - Global-MMLU-Lite 0-shot CoT in Japanese (ja) * - [mmlu_ko](#simple-evals-mmlu-ko) - Global-MMLU 0-shot CoT in Korean (ko) * - [mmlu_ko-lite](#simple-evals-mmlu-ko-lite) - Global-MMLU-Lite 0-shot CoT in Korean (ko) * - [mmlu_ky](#simple-evals-mmlu-ky) - Global-MMLU 0-shot CoT in Kyrgyz (ky) * - [mmlu_llama_4](#simple-evals-mmlu-llama-4) - MMLU questions with custom regex extraction patterns for Llama 4 * - [mmlu_lt](#simple-evals-mmlu-lt) - Global-MMLU 0-shot CoT in Lithuanian (lt) * - [mmlu_mg](#simple-evals-mmlu-mg) - Global-MMLU 0-shot CoT in Malagasy (mg) * - [mmlu_ms](#simple-evals-mmlu-ms) - Global-MMLU 0-shot CoT in Malay (ms) * - [mmlu_my-lite](#simple-evals-mmlu-my-lite) - Global-MMLU-Lite 0-shot CoT in Malay (my) * - [mmlu_ne](#simple-evals-mmlu-ne) - Global-MMLU 0-shot CoT in Nepali (ne) * - [mmlu_nl](#simple-evals-mmlu-nl) - Global-MMLU 0-shot CoT in Dutch (nl) * - [mmlu_ny](#simple-evals-mmlu-ny) - Global-MMLU 0-shot CoT in Nyanja (ny) * - [mmlu_pl](#simple-evals-mmlu-pl) - Global-MMLU 0-shot CoT in Polish (pl) * - [mmlu_pro](#simple-evals-mmlu-pro) - MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines. * - [mmlu_pro_aa_v2](#simple-evals-mmlu-pro-aa-v2) - MMLU-Pro - params aligned with Artificial Analysis Index v2 * - [mmlu_pro_aa_v3](#simple-evals-mmlu-pro-aa-v3) - MMLU-Pro with AA v3 methodology - multi-stage regex extraction with A-J options * - [mmlu_pro_llama_4](#simple-evals-mmlu-pro-llama-4) - MMLU-Pro questions with custom regex extraction patterns for Llama 4 * - [mmlu_pt](#simple-evals-mmlu-pt) - Global-MMLU 0-shot CoT in Portuguese (pt) * - [mmlu_pt-lite](#simple-evals-mmlu-pt-lite) - Global-MMLU-Lite 0-shot CoT in Portuguese (pt) * - [mmlu_ro](#simple-evals-mmlu-ro) - Global-MMLU 0-shot CoT in Romanian (ro) * - [mmlu_ru](#simple-evals-mmlu-ru) - Global-MMLU 0-shot CoT in Russian (ru) * - [mmlu_si](#simple-evals-mmlu-si) - Global-MMLU 0-shot CoT in Sinhala (si) * - [mmlu_sn](#simple-evals-mmlu-sn) - Global-MMLU 0-shot CoT in Shona (sn) * - [mmlu_so](#simple-evals-mmlu-so) - Global-MMLU 0-shot CoT in Somali (so) * - [mmlu_sr](#simple-evals-mmlu-sr) - Global-MMLU 0-shot CoT in Serbian (sr) * - [mmlu_sv](#simple-evals-mmlu-sv) - Global-MMLU 0-shot CoT in Swedish (sv) * - [mmlu_sw](#simple-evals-mmlu-sw) - Global-MMLU 0-shot CoT in Swahili (sw) * - [mmlu_sw-lite](#simple-evals-mmlu-sw-lite) - Global-MMLU-Lite 0-shot CoT in Swahili (sw) * - [mmlu_te](#simple-evals-mmlu-te) - Global-MMLU 0-shot CoT in Telugu (te) * - [mmlu_tr](#simple-evals-mmlu-tr) - Global-MMLU 0-shot CoT in Turkish (tr) * - [mmlu_uk](#simple-evals-mmlu-uk) - Global-MMLU 0-shot CoT in Ukrainian (uk) * - [mmlu_vi](#simple-evals-mmlu-vi) - Global-MMLU 0-shot CoT in Vietnamese (vi) * - [mmlu_yo](#simple-evals-mmlu-yo) - Global-MMLU 0-shot CoT in Yoruba (yo) * - [mmlu_yo-lite](#simple-evals-mmlu-yo-lite) - Global-MMLU-Lite 0-shot CoT in Yoruba (yo) * - [mmlu_zh-lite](#simple-evals-mmlu-zh-lite) - Global-MMLU-Lite 0-shot CoT in Chinese (Simplified) (zh) * - [simpleqa](#simple-evals-simpleqa) - A factuality benchmark called SimpleQA that measures the ability for language models to answer short, fact-seeking questions. ``` (simple-evals-aa-aime-2024)= ## AA_AIME_2024 AIME 2024 questions, math, using Artificial Analysis's setup. ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `AA_AIME_2024` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: AA_AIME_2024 temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 10 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: JUDGE_API_KEY backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: AA_AIME_2024 target: api_endpoint: {} ``` ::: :::: --- (simple-evals-aa-math-test-500)= ## AA_math_test_500 Open Ai math test 500, using Artificial Analysis's setup. ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `AA_math_test_500` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: AA_math_test_500 temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 3 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: JUDGE_API_KEY backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: AA_math_test_500 target: api_endpoint: {} ``` ::: :::: --- (simple-evals-aime-2024)= ## AIME_2024 AIME 2024 questions, math ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `AIME_2024` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: AIME_2024 temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: AIME_2024 target: api_endpoint: {} ``` ::: :::: --- (simple-evals-aime-2025)= ## AIME_2025 AIME 2025 questions, math ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `AIME_2025` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: AIME_2025 temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 10 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: JUDGE_API_KEY backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: AIME_2025 target: api_endpoint: {} ``` ::: :::: --- (simple-evals-aime-2025-aa-v2)= ## AIME_2025_aa_v2 AIME 2025 questions, math - params aligned with Artificial Analysis Index v2 ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `AIME_2025_aa_v2` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 30 parallelism: 10 task: AIME_2025 temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 10 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: JUDGE_API_KEY backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: AIME_2025_aa_v2 target: api_endpoint: {} ``` ::: :::: --- (simple-evals-aime-2024-nemo)= ## aime_2024_nemo AIME 2024 questions, math, using NeMo's alignment template ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `aime_2024_nemo` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: aime_2024_nemo temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 10 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: aime_2024_nemo target: api_endpoint: {} ``` ::: :::: --- (simple-evals-aime-2025-nemo)= ## aime_2025_nemo AIME 2025 questions, math, using NeMo's alignment template ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `aime_2025_nemo` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: aime_2025_nemo temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 10 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: aime_2025_nemo target: api_endpoint: {} ``` ::: :::: --- (simple-evals-browsecomp)= ## browsecomp BrowseComp is a benchmark for measuring the ability for agents to browse the web. ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `browsecomp` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: browsecomp temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: JUDGE_API_KEY backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: browsecomp target: api_endpoint: {} ``` ::: :::: --- (simple-evals-gpqa-diamond)= ## gpqa_diamond gpqa_diamond 0-shot CoT ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `gpqa_diamond` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: gpqa_diamond temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: gpqa_diamond target: api_endpoint: {} ``` ::: :::: --- (simple-evals-gpqa-diamond-aa-v2)= ## gpqa_diamond_aa_v2 gpqa_diamond questions with custom regex extraction patterns for AA v2 ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `gpqa_diamond_aa_v2` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 30 parallelism: 10 task: gpqa_diamond temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 5 downsampling_ratio: null add_system_prompt: false custom_config: extraction: - regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9]) match_group: 1 name: aa_v2_regex judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: gpqa_diamond_aa_v2 target: api_endpoint: {} ``` ::: :::: --- (simple-evals-gpqa-diamond-aa-v2-llama-4)= ## gpqa_diamond_aa_v2_llama_4 gpqa_diamond questions with custom regex extraction patterns for Llama 4 ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `gpqa_diamond_aa_v2_llama_4` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: gpqa_diamond temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 5 downsampling_ratio: null add_system_prompt: false custom_config: extraction: - regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9]) match_group: 1 name: answer_colon_llama4 - regex: (?i)(?:the )?best? answer is\s*[\*\_,{}\.]*([A-D])(?![a-zA-Z0-9]) match_group: 1 name: answer_is_llama4 judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: gpqa_diamond_aa_v2_llama_4 target: api_endpoint: {} ``` ::: :::: --- (simple-evals-gpqa-diamond-aa-v3)= ## gpqa_diamond_aa_v3 GPQA Diamond with AA v3 methodology - multi-stage regex extraction for robust answer parsing ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `gpqa_diamond_aa_v3` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 30 parallelism: 10 task: gpqa_diamond temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 5 downsampling_ratio: null add_system_prompt: false custom_config: prompt_template: 'Answer the following multiple choice question. The last line of your response should be in the following format: ''Answer: A/B/C/D'' (e.g. ''Answer: A''). {Question} A) {A} B) {B} C) {C} D) {D} ' extraction: - regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9]) match_group: 1 name: primary_answer_format - regex: \\boxed\{[^}]*([A-Z])[^}]*\} match_group: 1 name: latex_boxed - regex: answer is ([a-zA-Z]) match_group: 1 name: natural_language - regex: answer is \(([a-zA-Z])\) match_group: 1 name: with_parenthesis - regex: ([A-Z])\)\s*[^A-Z]* match_group: 1 name: choice_format - regex: ([A-Z])\s+is\s+the\s+correct\s+answer match_group: 1 name: explicit_statement - regex: ([A-Z])\s*$ match_group: 1 name: standalone_letter_end - regex: ([A-Z])\s*\. match_group: 1 name: letter_with_period - regex: ([A-Z])\s*[^\w] match_group: 1 name: letter_nonword judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: gpqa_diamond_aa_v3 target: api_endpoint: {} ``` ::: :::: --- (simple-evals-gpqa-diamond-nemo)= ## gpqa_diamond_nemo gpqa_diamond questions, reasoning, using NeMo's alignment template ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `gpqa_diamond_nemo` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: gpqa_diamond_nemo temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 5 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: gpqa_diamond_nemo target: api_endpoint: {} ``` ::: :::: --- (simple-evals-gpqa-extended)= ## gpqa_extended gpqa_extended 0-shot CoT ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `gpqa_extended` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: gpqa_extended temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: gpqa_extended target: api_endpoint: {} ``` ::: :::: --- (simple-evals-gpqa-main)= ## gpqa_main gpqa_main 0-shot CoT ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `gpqa_main` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: gpqa_main temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: gpqa_main target: api_endpoint: {} ``` ::: :::: --- (simple-evals-healthbench)= ## healthbench HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare. ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `healthbench` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: healthbench temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: JUDGE_API_KEY backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: healthbench target: api_endpoint: {} ``` ::: :::: --- (simple-evals-healthbench-consensus)= ## healthbench_consensus HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare. The consensus subset measures 34 particularly important aspects of model behavior and has been validated by the consensus of multiple physicians. ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `healthbench_consensus` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: healthbench_consensus temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: JUDGE_API_KEY backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: healthbench_consensus target: api_endpoint: {} ``` ::: :::: --- (simple-evals-healthbench-hard)= ## healthbench_hard HealthBench is an open-source benchmark measuring the performance and safety of large language models in healthcare. The hard subset consists of 1000 examples chosen because they are difficult for current frontier models. ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `healthbench_hard` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: healthbench_hard temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: JUDGE_API_KEY backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: healthbench_hard target: api_endpoint: {} ``` ::: :::: --- (simple-evals-humaneval)= ## humaneval HumanEval evaluates the performance in Python code generation tasks. It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions. ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `humaneval` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: humaneval temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: humaneval target: api_endpoint: {} ``` ::: :::: --- (simple-evals-humanevalplus)= ## humanevalplus HumanEvalPlus is a dataset of 164 programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions. ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `humanevalplus` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: humanevalplus temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: humanevalplus target: api_endpoint: {} ``` ::: :::: --- (simple-evals-math-test-500)= ## math_test_500 Open AI math test 500 ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `math_test_500` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: math_test_500 temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: math_test_500 target: api_endpoint: {} ``` ::: :::: --- (simple-evals-math-test-500-nemo)= ## math_test_500_nemo math_test_500 questions, math, using NeMo's alignment template ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `math_test_500_nemo` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: math_test_500_nemo temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 3 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: math_test_500_nemo target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mgsm)= ## mgsm MGSM is a benchmark of grade-school math problems. The same 250 problems from GSM8K are each translated via human annotators in 10 languages. ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mgsm` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mgsm temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mgsm target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mgsm-aa-v2)= ## mgsm_aa_v2 MGSM is a benchmark of grade-school math problems - params aligned with Artificial Analysis Index v2 ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mgsm_aa_v2` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 30 parallelism: 10 task: mgsm temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mgsm_aa_v2 target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu)= ## mmlu MMLU 0-shot CoT ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-am)= ## mmlu_am Global-MMLU 0-shot CoT in Amharic (am) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_am` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_am temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_am target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-ar)= ## mmlu_ar Global-MMLU 0-shot CoT in Arabic (ar) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_ar` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_ar temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_ar target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-ar-lite)= ## mmlu_ar-lite Global-MMLU-Lite 0-shot CoT in Arabic (ar) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_ar-lite` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_ar-lite temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_ar-lite target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-bn)= ## mmlu_bn Global-MMLU 0-shot CoT in Bengali (bn) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_bn` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_bn temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_bn target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-bn-lite)= ## mmlu_bn-lite Global-MMLU-Lite 0-shot CoT in Bengali (bn) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_bn-lite` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_bn-lite temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_bn-lite target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-cs)= ## mmlu_cs Global-MMLU 0-shot CoT in Czech (cs) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_cs` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_cs temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_cs target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-de)= ## mmlu_de Global-MMLU 0-shot CoT in German (de) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_de` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_de temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_de target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-de-lite)= ## mmlu_de-lite Global-MMLU-Lite 0-shot CoT in German (de) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_de-lite` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_de-lite temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_de-lite target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-el)= ## mmlu_el Global-MMLU 0-shot CoT in Greek (el) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_el` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_el temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_el target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-en)= ## mmlu_en Global-MMLU 0-shot CoT in English (en) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_en` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_en temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_en target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-en-lite)= ## mmlu_en-lite Global-MMLU-Lite 0-shot CoT in English (en) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_en-lite` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_en-lite temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_en-lite target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-es)= ## mmlu_es Global-MMLU 0-shot CoT in Spanish (es) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_es` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_es temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_es target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-es-lite)= ## mmlu_es-lite Global-MMLU-Lite 0-shot CoT in Spanish (es) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_es-lite` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_es-lite temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_es-lite target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-fa)= ## mmlu_fa Global-MMLU 0-shot CoT in Persian (fa) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_fa` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_fa temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_fa target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-fil)= ## mmlu_fil Global-MMLU 0-shot CoT in Filipino (fil) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_fil` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_fil temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_fil target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-fr)= ## mmlu_fr Global-MMLU 0-shot CoT in French (fr) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_fr` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_fr temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_fr target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-fr-lite)= ## mmlu_fr-lite Global-MMLU-Lite 0-shot CoT in French (fr) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_fr-lite` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_fr-lite temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_fr-lite target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-ha)= ## mmlu_ha Global-MMLU 0-shot CoT in Hausa (ha) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_ha` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_ha temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_ha target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-he)= ## mmlu_he Global-MMLU 0-shot CoT in Hebrew (he) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_he` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_he temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_he target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-hi)= ## mmlu_hi Global-MMLU 0-shot CoT in Hindi (hi) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_hi` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_hi temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_hi target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-hi-lite)= ## mmlu_hi-lite Global-MMLU-Lite 0-shot CoT in Hindi (hi) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_hi-lite` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_hi-lite temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_hi-lite target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-id)= ## mmlu_id Global-MMLU 0-shot CoT in Indonesian (id) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_id` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_id temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_id target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-id-lite)= ## mmlu_id-lite Global-MMLU-Lite 0-shot CoT in Indonesian (id) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_id-lite` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_id-lite temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_id-lite target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-ig)= ## mmlu_ig Global-MMLU 0-shot CoT in Igbo (ig) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_ig` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_ig temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_ig target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-it)= ## mmlu_it Global-MMLU 0-shot CoT in Italian (it) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_it` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_it temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_it target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-it-lite)= ## mmlu_it-lite Global-MMLU-Lite 0-shot CoT in Italian (it) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_it-lite` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_it-lite temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_it-lite target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-ja)= ## mmlu_ja Global-MMLU 0-shot CoT in Japanese (ja) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_ja` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_ja temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_ja target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-ja-lite)= ## mmlu_ja-lite Global-MMLU-Lite 0-shot CoT in Japanese (ja) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_ja-lite` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_ja-lite temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_ja-lite target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-ko)= ## mmlu_ko Global-MMLU 0-shot CoT in Korean (ko) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_ko` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_ko temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_ko target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-ko-lite)= ## mmlu_ko-lite Global-MMLU-Lite 0-shot CoT in Korean (ko) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_ko-lite` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_ko-lite temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_ko-lite target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-ky)= ## mmlu_ky Global-MMLU 0-shot CoT in Kyrgyz (ky) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_ky` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_ky temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_ky target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-llama-4)= ## mmlu_llama_4 MMLU questions with custom regex extraction patterns for Llama 4 ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_llama_4` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: extraction: - regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9]) match_group: 1 name: answer_colon_llama4 - regex: (?i)(?:the )?best? answer is\s*[\*\_,{}\.]*([A-D])(?![a-zA-Z0-9]) match_group: 1 name: answer_is_llama4 judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_llama_4 target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-lt)= ## mmlu_lt Global-MMLU 0-shot CoT in Lithuanian (lt) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_lt` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_lt temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_lt target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-mg)= ## mmlu_mg Global-MMLU 0-shot CoT in Malagasy (mg) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_mg` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_mg temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_mg target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-ms)= ## mmlu_ms Global-MMLU 0-shot CoT in Malay (ms) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_ms` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_ms temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_ms target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-my-lite)= ## mmlu_my-lite Global-MMLU-Lite 0-shot CoT in Malay (my) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_my-lite` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_my-lite temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_my-lite target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-ne)= ## mmlu_ne Global-MMLU 0-shot CoT in Nepali (ne) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_ne` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_ne temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_ne target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-nl)= ## mmlu_nl Global-MMLU 0-shot CoT in Dutch (nl) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_nl` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_nl temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_nl target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-ny)= ## mmlu_ny Global-MMLU 0-shot CoT in Nyanja (ny) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_ny` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_ny temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_ny target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-pl)= ## mmlu_pl Global-MMLU 0-shot CoT in Polish (pl) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_pl` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_pl temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_pl target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-pro)= ## mmlu_pro MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines. ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_pro` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_pro temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_pro target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-pro-aa-v2)= ## mmlu_pro_aa_v2 MMLU-Pro - params aligned with Artificial Analysis Index v2 ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_pro_aa_v2` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 30 parallelism: 10 task: mmlu_pro temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_pro_aa_v2 target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-pro-aa-v3)= ## mmlu_pro_aa_v3 MMLU-Pro with AA v3 methodology - multi-stage regex extraction with A-J options ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_pro_aa_v3` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 30 parallelism: 10 task: mmlu_pro temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: prompt_template: 'Answer the following multiple choice question. The last line of your response should be in the following format: ''Answer: A/B/C/D/E/F/G/H/I/J'' (e.g. ''Answer: A''). {Question} A) {A} B) {B} C) {C} D) {D} E) {E} F) {F} G) {G} H) {H} I) {I} J) {J} ' extraction: - regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9]) match_group: 1 name: primary_answer_format - regex: \\boxed\{[^}]*([A-Z])[^}]*\} match_group: 1 name: latex_boxed - regex: answer is ([a-zA-Z]) match_group: 1 name: natural_language - regex: answer is \(([a-zA-Z])\) match_group: 1 name: with_parenthesis - regex: ([A-Z])\)\s*[^A-Z]* match_group: 1 name: choice_format - regex: ([A-Z])\s+is\s+the\s+correct\s+answer match_group: 1 name: explicit_statement - regex: ([A-Z])\s*$ match_group: 1 name: standalone_letter_end - regex: ([A-Z])\s*\. match_group: 1 name: letter_with_period - regex: ([A-Z])\s*[^\w] match_group: 1 name: letter_nonword judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_pro_aa_v3 target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-pro-llama-4)= ## mmlu_pro_llama_4 MMLU-Pro questions with custom regex extraction patterns for Llama 4 ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_pro_llama_4` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_pro temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: extraction: - regex: (?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9]) match_group: 1 name: answer_colon_llama4 - regex: (?i)(?:the )?best? answer is\s*[\*\_,{}\.]*([A-D])(?![a-zA-Z0-9]) match_group: 1 name: answer_is_llama4 judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_pro_llama_4 target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-pt)= ## mmlu_pt Global-MMLU 0-shot CoT in Portuguese (pt) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_pt` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_pt temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_pt target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-pt-lite)= ## mmlu_pt-lite Global-MMLU-Lite 0-shot CoT in Portuguese (pt) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_pt-lite` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_pt-lite temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_pt-lite target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-ro)= ## mmlu_ro Global-MMLU 0-shot CoT in Romanian (ro) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_ro` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_ro temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_ro target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-ru)= ## mmlu_ru Global-MMLU 0-shot CoT in Russian (ru) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_ru` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_ru temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_ru target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-si)= ## mmlu_si Global-MMLU 0-shot CoT in Sinhala (si) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_si` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_si temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_si target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-sn)= ## mmlu_sn Global-MMLU 0-shot CoT in Shona (sn) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_sn` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_sn temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_sn target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-so)= ## mmlu_so Global-MMLU 0-shot CoT in Somali (so) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_so` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_so temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_so target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-sr)= ## mmlu_sr Global-MMLU 0-shot CoT in Serbian (sr) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_sr` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_sr temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_sr target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-sv)= ## mmlu_sv Global-MMLU 0-shot CoT in Swedish (sv) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_sv` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_sv temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_sv target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-sw)= ## mmlu_sw Global-MMLU 0-shot CoT in Swahili (sw) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_sw` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_sw temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_sw target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-sw-lite)= ## mmlu_sw-lite Global-MMLU-Lite 0-shot CoT in Swahili (sw) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_sw-lite` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_sw-lite temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_sw-lite target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-te)= ## mmlu_te Global-MMLU 0-shot CoT in Telugu (te) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_te` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_te temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_te target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-tr)= ## mmlu_tr Global-MMLU 0-shot CoT in Turkish (tr) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_tr` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_tr temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_tr target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-uk)= ## mmlu_uk Global-MMLU 0-shot CoT in Ukrainian (uk) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_uk` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_uk temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_uk target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-vi)= ## mmlu_vi Global-MMLU 0-shot CoT in Vietnamese (vi) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_vi` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_vi temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_vi target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-yo)= ## mmlu_yo Global-MMLU 0-shot CoT in Yoruba (yo) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_yo` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_yo temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_yo target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-yo-lite)= ## mmlu_yo-lite Global-MMLU-Lite 0-shot CoT in Yoruba (yo) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_yo-lite` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_yo-lite temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_yo-lite target: api_endpoint: {} ``` ::: :::: --- (simple-evals-mmlu-zh-lite)= ## mmlu_zh-lite Global-MMLU-Lite 0-shot CoT in Chinese (Simplified) (zh) ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `mmlu_zh-lite` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: mmlu_zh-lite temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: mmlu_zh-lite target: api_endpoint: {} ``` ::: :::: --- (simple-evals-simpleqa)= ## simpleqa A factuality benchmark called SimpleQA that measures the ability for language models to answer short, fact-seeking questions. ::::{tab-set} :::{tab-item} Container **Harness:** `simple_evals` **Container:** ``` nvcr.io/nvidia/eval-factory/simple-evals:26.01 ``` **Container Digest:** ``` sha256:5e59e6e34ac55bb7d2b7a17466c039a5a663a9961bce70751e5a4f3f09026158 ``` **Container Arch:** `multiarch` **Task Type:** `simpleqa` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} echo '{{config.params.extra.custom_config | tojson}}' > {{config.output_dir}}/temp_config.json && python3 -c 'import yaml, json; config_data = json.load(open("{{config.output_dir}}/temp_config.json")); yaml.dump(config_data, open("{{config.output_dir}}/custom_config.yml", "w"), default_flow_style=False)' && {% endif %} simple_evals --model {{target.api_endpoint.model_id}} --eval_name {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} --top_p {{config.params.top_p}} --max_tokens {{config.params.max_new_tokens}} --out_dir {{config.output_dir}}/{{config.type}} --cache_dir {{config.output_dir}}/{{config.type}}/cache --num_threads {{config.params.parallelism}} --max_retries {{config.params.max_retries}} --timeout {{config.params.request_timeout}} {% if config.params.extra.n_samples is defined %} --num_repeats {{config.params.extra.n_samples}}{% endif %} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.downsampling_ratio is not none %} --downsampling_ratio {{config.params.extra.downsampling_ratio}}{% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} {% if config.params.extra.judge.url is not none %} --judge_url {{config.params.extra.judge.url}}{% endif %} {% if config.params.extra.judge.model_id is not none %} --judge_model_id {{config.params.extra.judge.model_id}}{% endif %} {% if config.params.extra.judge.api_key is not none %} --judge_api_key_name {{config.params.extra.judge.api_key}}{% endif %} {% if config.params.extra.judge.backend is not none %} --judge_backend {{config.params.extra.judge.backend}}{% endif %} {% if config.params.extra.judge.request_timeout is not none %} --judge_request_timeout {{config.params.extra.judge.request_timeout}}{% endif %} {% if config.params.extra.judge.max_retries is not none %} --judge_max_retries {{config.params.extra.judge.max_retries}}{% endif %} {% if config.params.extra.judge.temperature is not none %} --judge_temperature {{config.params.extra.judge.temperature}}{% endif %} {% if config.params.extra.judge.top_p is not none %} --judge_top_p {{config.params.extra.judge.top_p}}{% endif %} {% if config.params.extra.judge.max_tokens is not none %} --judge_max_tokens {{config.params.extra.judge.max_tokens}}{% endif %} {% if config.params.extra.judge.max_concurrent_requests is not none %} --judge_max_concurrent_requests {{config.params.extra.judge.max_concurrent_requests}}{% endif %} {% if config.params.extra.custom_config is defined and config.params.extra.custom_config is not none %} --custom_eval_cfg_file {{config.output_dir}}/custom_config.yml{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: simple_evals pkg_name: simple_evals config: params: max_new_tokens: 16384 max_retries: 5 parallelism: 10 task: simpleqa temperature: 0.0 request_timeout: 60 top_p: 1.0e-05 extra: n_samples: 1 downsampling_ratio: null add_system_prompt: false custom_config: null judge: url: null model_id: null api_key: null backend: openai request_timeout: 600 max_retries: 16 temperature: 0.0 top_p: 0.0001 max_tokens: 1024 max_concurrent_requests: null supported_endpoint_types: - chat type: simpleqa target: api_endpoint: {} ``` ::: :::: # tau2_bench This page contains all evaluation tasks for the **tau2_bench** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [tau2_bench_airline](#tau2-bench-tau2-bench-airline) - tau2-bench - Airline Domain * - [tau2_bench_retail](#tau2-bench-tau2-bench-retail) - tau2-bench - Retail Domain * - [tau2_bench_telecom](#tau2-bench-tau2-bench-telecom) - tau2-bench - Telecom Domain (used by Artificial Analysis Index v2) ``` (tau2-bench-tau2-bench-airline)= ## tau2_bench_airline tau2-bench - Airline Domain ::::{tab-set} :::{tab-item} Container **Harness:** `tau2_bench` **Container:** ``` nvcr.io/nvidia/eval-factory/tau2-bench:26.01 ``` **Container Digest:** ``` sha256:24aae1ed0eb955810a597382b1cbbfd8da64f9f74e1e64a4afd6a271d1b98be3 ``` **Container Arch:** `multiarch` **Task Type:** `tau2_bench_airline` ::: :::{tab-item} Command ```bash {% if config.params.extra.cache.enabled %}export LLM_CACHE_ENABLED=true && export CACHE_TYPE=disk && export CACHE_DIR={{config.params.extra.cache.cache_dir}} && {% endif %} tau2 run --domain {{config.params.task}} --agent-llm openai/{{target.api_endpoint.model_id}} --user-llm openai/{{config.params.extra.user.model_id}} {% if config.params.extra.judge.enabled %}--judge-llm openai/{{config.params.extra.judge.model_id}}{% endif %} {% if target.api_endpoint.api_key_name is not none %}--agent-api-key {{target.api_endpoint.api_key_name}}{% endif %} {% if config.params.extra.user.api_key is not none %}--user-api-key {{config.params.extra.user.api_key}}{% endif %} {% if config.params.extra.judge.enabled and config.params.extra.judge.api_key is not none %}--judge-api-key {{config.params.extra.judge.api_key}}{% endif %} --agent-llm-args '{"base_url": "{{target.api_endpoint.url}}", "temperature": {{config.params.temperature}}, "top_p": {{config.params.top_p}}, "max_completion_tokens": {{config.params.max_new_tokens}}, "timeout": {{config.params.request_timeout}}{% if config.params.extra.agent_args is defined and config.params.extra.agent_args is not none %}{% for key, value in config.params.extra.agent_args.items() %}, "{{key}}": {{value|tojson}}{% endfor %}{% endif %}}' --user-llm-args '{"base_url": "{{config.params.extra.user.url}}", "temperature": {{config.params.extra.user.temperature}}, "top_p": {{config.params.extra.user.top_p}}, "max_completion_tokens": {{config.params.extra.user.max_new_tokens}}, "timeout": {{config.params.extra.user.request_timeout}}{% if config.params.extra.user.args is defined and config.params.extra.user.args is not none %}{% for key, value in config.params.extra.user.args.items() %}, "{{key}}": {{value|tojson}}{% endfor %}{% endif %}}' {% if config.params.extra.judge.enabled %}--judge-llm-args '{"base_url": "{{config.params.extra.judge.url}}", "temperature": {{config.params.extra.judge.temperature}}, "top_p": {{config.params.extra.judge.top_p}}, "max_completion_tokens": {{config.params.extra.judge.max_new_tokens}}, "timeout": {{config.params.extra.judge.request_timeout}}{% if config.params.extra.judge.args is defined and config.params.extra.judge.args is not none %}{% for key, value in config.params.extra.judge.args.items() %}, "{{key}}": {{value|tojson}}{% endfor %}{% endif %}}'{% endif %} {% if config.params.extra.judge.enabled %}--judge-system-prompt "{{config.params.extra.judge.system_prompt}}"{% endif %} {% if config.params.extra.judge.enabled %}--judge-window-size {{config.params.extra.judge_window_size}}{% endif %} --max-concurrency {{config.params.parallelism}} --max-retries {{config.params.max_retries}} --max-steps {{config.params.extra.max_steps}} --results-dir {{config.output_dir}} --num-trials {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --num-tasks {{config.params.limit_samples}} {% endif %} {% if config.params.extra.skip_failed_samples %} --skip-failed-samples {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: tau2_bench pkg_name: nvidia_tau2 config: params: max_new_tokens: 16384 max_retries: 30 parallelism: 10 task: airline temperature: 0.0 request_timeout: 3600 top_p: 0.95 extra: n_samples: 3 max_steps: 100 judge_window_size: 30 skip_failed_samples: false cache: enabled: true cache_dir: .cache/llm_cache user: url: https://integrate.api.nvidia.com/v1/chat/completions model_id: nvdev/qwen/qwen-235b api_key: USER_API_KEY temperature: 0.0 max_new_tokens: 4096 top_p: 0.95 request_timeout: 3600 judge: enabled: false url: https://integrate.api.nvidia.com/v1/chat/completions model_id: openai/gpt-oss-120b system_prompt: Reasoning:medium api_key: JUDGE_API_KEY temperature: 0.6 max_new_tokens: 16000 top_p: 0.95 request_timeout: 3600 supported_endpoint_types: - chat type: tau2_bench_airline target: api_endpoint: stream: false ``` ::: :::: --- (tau2-bench-tau2-bench-retail)= ## tau2_bench_retail tau2-bench - Retail Domain ::::{tab-set} :::{tab-item} Container **Harness:** `tau2_bench` **Container:** ``` nvcr.io/nvidia/eval-factory/tau2-bench:26.01 ``` **Container Digest:** ``` sha256:24aae1ed0eb955810a597382b1cbbfd8da64f9f74e1e64a4afd6a271d1b98be3 ``` **Container Arch:** `multiarch` **Task Type:** `tau2_bench_retail` ::: :::{tab-item} Command ```bash {% if config.params.extra.cache.enabled %}export LLM_CACHE_ENABLED=true && export CACHE_TYPE=disk && export CACHE_DIR={{config.params.extra.cache.cache_dir}} && {% endif %} tau2 run --domain {{config.params.task}} --agent-llm openai/{{target.api_endpoint.model_id}} --user-llm openai/{{config.params.extra.user.model_id}} {% if config.params.extra.judge.enabled %}--judge-llm openai/{{config.params.extra.judge.model_id}}{% endif %} {% if target.api_endpoint.api_key_name is not none %}--agent-api-key {{target.api_endpoint.api_key_name}}{% endif %} {% if config.params.extra.user.api_key is not none %}--user-api-key {{config.params.extra.user.api_key}}{% endif %} {% if config.params.extra.judge.enabled and config.params.extra.judge.api_key is not none %}--judge-api-key {{config.params.extra.judge.api_key}}{% endif %} --agent-llm-args '{"base_url": "{{target.api_endpoint.url}}", "temperature": {{config.params.temperature}}, "top_p": {{config.params.top_p}}, "max_completion_tokens": {{config.params.max_new_tokens}}, "timeout": {{config.params.request_timeout}}{% if config.params.extra.agent_args is defined and config.params.extra.agent_args is not none %}{% for key, value in config.params.extra.agent_args.items() %}, "{{key}}": {{value|tojson}}{% endfor %}{% endif %}}' --user-llm-args '{"base_url": "{{config.params.extra.user.url}}", "temperature": {{config.params.extra.user.temperature}}, "top_p": {{config.params.extra.user.top_p}}, "max_completion_tokens": {{config.params.extra.user.max_new_tokens}}, "timeout": {{config.params.extra.user.request_timeout}}{% if config.params.extra.user.args is defined and config.params.extra.user.args is not none %}{% for key, value in config.params.extra.user.args.items() %}, "{{key}}": {{value|tojson}}{% endfor %}{% endif %}}' {% if config.params.extra.judge.enabled %}--judge-llm-args '{"base_url": "{{config.params.extra.judge.url}}", "temperature": {{config.params.extra.judge.temperature}}, "top_p": {{config.params.extra.judge.top_p}}, "max_completion_tokens": {{config.params.extra.judge.max_new_tokens}}, "timeout": {{config.params.extra.judge.request_timeout}}{% if config.params.extra.judge.args is defined and config.params.extra.judge.args is not none %}{% for key, value in config.params.extra.judge.args.items() %}, "{{key}}": {{value|tojson}}{% endfor %}{% endif %}}'{% endif %} {% if config.params.extra.judge.enabled %}--judge-system-prompt "{{config.params.extra.judge.system_prompt}}"{% endif %} {% if config.params.extra.judge.enabled %}--judge-window-size {{config.params.extra.judge_window_size}}{% endif %} --max-concurrency {{config.params.parallelism}} --max-retries {{config.params.max_retries}} --max-steps {{config.params.extra.max_steps}} --results-dir {{config.output_dir}} --num-trials {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --num-tasks {{config.params.limit_samples}} {% endif %} {% if config.params.extra.skip_failed_samples %} --skip-failed-samples {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: tau2_bench pkg_name: nvidia_tau2 config: params: max_new_tokens: 16384 max_retries: 30 parallelism: 10 task: retail temperature: 0.0 request_timeout: 3600 top_p: 0.95 extra: n_samples: 3 max_steps: 100 judge_window_size: 30 skip_failed_samples: false cache: enabled: true cache_dir: .cache/llm_cache user: url: https://integrate.api.nvidia.com/v1/chat/completions model_id: nvdev/qwen/qwen-235b api_key: USER_API_KEY temperature: 0.0 max_new_tokens: 4096 top_p: 0.95 request_timeout: 3600 judge: enabled: false url: https://integrate.api.nvidia.com/v1/chat/completions model_id: openai/gpt-oss-120b system_prompt: Reasoning:medium api_key: JUDGE_API_KEY temperature: 0.6 max_new_tokens: 16000 top_p: 0.95 request_timeout: 3600 supported_endpoint_types: - chat type: tau2_bench_retail target: api_endpoint: stream: false ``` ::: :::: --- (tau2-bench-tau2-bench-telecom)= ## tau2_bench_telecom tau2-bench - Telecom Domain (used by Artificial Analysis Index v2) ::::{tab-set} :::{tab-item} Container **Harness:** `tau2_bench` **Container:** ``` nvcr.io/nvidia/eval-factory/tau2-bench:26.01 ``` **Container Digest:** ``` sha256:24aae1ed0eb955810a597382b1cbbfd8da64f9f74e1e64a4afd6a271d1b98be3 ``` **Container Arch:** `multiarch` **Task Type:** `tau2_bench_telecom` ::: :::{tab-item} Command ```bash {% if config.params.extra.cache.enabled %}export LLM_CACHE_ENABLED=true && export CACHE_TYPE=disk && export CACHE_DIR={{config.params.extra.cache.cache_dir}} && {% endif %} tau2 run --domain {{config.params.task}} --agent-llm openai/{{target.api_endpoint.model_id}} --user-llm openai/{{config.params.extra.user.model_id}} {% if config.params.extra.judge.enabled %}--judge-llm openai/{{config.params.extra.judge.model_id}}{% endif %} {% if target.api_endpoint.api_key_name is not none %}--agent-api-key {{target.api_endpoint.api_key_name}}{% endif %} {% if config.params.extra.user.api_key is not none %}--user-api-key {{config.params.extra.user.api_key}}{% endif %} {% if config.params.extra.judge.enabled and config.params.extra.judge.api_key is not none %}--judge-api-key {{config.params.extra.judge.api_key}}{% endif %} --agent-llm-args '{"base_url": "{{target.api_endpoint.url}}", "temperature": {{config.params.temperature}}, "top_p": {{config.params.top_p}}, "max_completion_tokens": {{config.params.max_new_tokens}}, "timeout": {{config.params.request_timeout}}{% if config.params.extra.agent_args is defined and config.params.extra.agent_args is not none %}{% for key, value in config.params.extra.agent_args.items() %}, "{{key}}": {{value|tojson}}{% endfor %}{% endif %}}' --user-llm-args '{"base_url": "{{config.params.extra.user.url}}", "temperature": {{config.params.extra.user.temperature}}, "top_p": {{config.params.extra.user.top_p}}, "max_completion_tokens": {{config.params.extra.user.max_new_tokens}}, "timeout": {{config.params.extra.user.request_timeout}}{% if config.params.extra.user.args is defined and config.params.extra.user.args is not none %}{% for key, value in config.params.extra.user.args.items() %}, "{{key}}": {{value|tojson}}{% endfor %}{% endif %}}' {% if config.params.extra.judge.enabled %}--judge-llm-args '{"base_url": "{{config.params.extra.judge.url}}", "temperature": {{config.params.extra.judge.temperature}}, "top_p": {{config.params.extra.judge.top_p}}, "max_completion_tokens": {{config.params.extra.judge.max_new_tokens}}, "timeout": {{config.params.extra.judge.request_timeout}}{% if config.params.extra.judge.args is defined and config.params.extra.judge.args is not none %}{% for key, value in config.params.extra.judge.args.items() %}, "{{key}}": {{value|tojson}}{% endfor %}{% endif %}}'{% endif %} {% if config.params.extra.judge.enabled %}--judge-system-prompt "{{config.params.extra.judge.system_prompt}}"{% endif %} {% if config.params.extra.judge.enabled %}--judge-window-size {{config.params.extra.judge_window_size}}{% endif %} --max-concurrency {{config.params.parallelism}} --max-retries {{config.params.max_retries}} --max-steps {{config.params.extra.max_steps}} --results-dir {{config.output_dir}} --num-trials {{config.params.extra.n_samples}} {% if config.params.limit_samples is not none %} --num-tasks {{config.params.limit_samples}} {% endif %} {% if config.params.extra.skip_failed_samples %} --skip-failed-samples {% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: tau2_bench pkg_name: nvidia_tau2 config: params: max_new_tokens: 16384 max_retries: 30 parallelism: 10 task: telecom temperature: 0.0 request_timeout: 3600 top_p: 0.95 extra: n_samples: 3 max_steps: 100 judge_window_size: 30 skip_failed_samples: false cache: enabled: true cache_dir: .cache/llm_cache user: url: https://integrate.api.nvidia.com/v1/chat/completions model_id: nvdev/qwen/qwen-235b api_key: USER_API_KEY temperature: 0.0 max_new_tokens: 4096 top_p: 0.95 request_timeout: 3600 judge: enabled: false url: https://integrate.api.nvidia.com/v1/chat/completions model_id: openai/gpt-oss-120b system_prompt: Reasoning:medium api_key: JUDGE_API_KEY temperature: 0.6 max_new_tokens: 16000 top_p: 0.95 request_timeout: 3600 supported_endpoint_types: - chat type: tau2_bench_telecom target: api_endpoint: stream: false ``` ::: :::: # tooltalk This page contains all evaluation tasks for the **tooltalk** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [tooltalk](#tooltalk-tooltalk) - ToolTalk task with default settings. ``` (tooltalk-tooltalk)= ## tooltalk ToolTalk task with default settings. ::::{tab-set} :::{tab-item} Container **Harness:** `tooltalk` **Container:** ``` nvcr.io/nvidia/eval-factory/tooltalk:26.01 ``` **Container Digest:** ``` sha256:2c032e8274fd3a825b3c2774d33d0caddfa198fe24980dd99b8e3ae77c8aadee ``` **Container Arch:** `multiarch` **Task Type:** `tooltalk` ::: :::{tab-item} Command ```bash {% if target.api_endpoint.api_key_name is not none %}API_KEY=${{target.api_endpoint.api_key_name}}{% endif %} python -m tooltalk.evaluation.evaluate_{{'openai' if 'azure' in target.api_endpoint.url or 'api.openai' in target.api_endpoint.url else 'nim'}} --dataset data/easy --database data/databases --model {{target.api_endpoint.model_id}} {% if config.params.max_new_tokens is not none %}--max_new_tokens {{config.params.max_new_tokens}}{% endif %} {% if config.params.temperature is not none %}--temperature {{config.params.temperature}}{% endif %} {% if config.params.top_p is not none %}--top_p {{config.params.top_p}}{% endif %} --api_mode all --output_dir {{config.output_dir}} --url {{target.api_endpoint.url}} {% if config.params.limit_samples is not none %}--first_n {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: tooltalk pkg_name: tooltalk config: params: extra: {} supported_endpoint_types: - chat type: tooltalk target: api_endpoint: {} ``` ::: :::: # vlmevalkit This page contains all evaluation tasks for the **vlmevalkit** harness. ```{list-table} :header-rows: 1 :widths: 30 70 * - Task - Description * - [ai2d_judge](#vlmevalkit-ai2d-judge) - A benchmark for evaluating diagram understanding capabilities of large vision-language models. * - [chartqa](#vlmevalkit-chartqa) - A Benchmark for Question Answering about Charts with Visual and Logical Reasoning * - [mathvista-mini](#vlmevalkit-mathvista-mini) - Evaluating Math Reasoning in Visual Contexts * - [mmmu_judge](#vlmevalkit-mmmu-judge) - A benchmark for evaluating multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. * - [ocr_reasoning](#vlmevalkit-ocr-reasoning) - Comprehensive benchmark of 1,069 human-annotated examples designed to evaluate multimodal large language models on text-rich image reasoning tasks by assessing both final answers and the reasoning process across six core abilities and 18 practical tasks. * - [ocrbench](#vlmevalkit-ocrbench) - Comprehensive evaluation benchmark designed to assess the OCR capabilities of Large Multimodal Models * - [slidevqa](#vlmevalkit-slidevqa) - Evaluates ability to answer questions about slide decks by selecting relevant slides from multiple images ``` (vlmevalkit-ai2d-judge)= ## ai2d_judge A benchmark for evaluating diagram understanding capabilities of large vision-language models. ::::{tab-set} :::{tab-item} Container **Harness:** `vlmevalkit` **Container:** ``` nvcr.io/nvidia/eval-factory/vlmevalkit:26.01 ``` **Container Digest:** ``` sha256:24c650c547cfd666bcc5ec822c996eb90e89e4964a1d4ec29e4d01d8bd3a22dc ``` **Container Arch:** `amd` **Task Type:** `ai2d_judge` ::: :::{tab-item} Command ```bash cat > {{config.output_dir}}/vlmeval_config.json << 'EOF' { "model": { "{{target.api_endpoint.model_id.split('/')[-1]}}": { "class": "CustomOAIEndpoint", "model": "{{target.api_endpoint.model_id}}", "api_base": "{{target.api_endpoint.url}}", "api_key_var_name": "{{target.api_endpoint.api_key_name}}", "max_tokens": {{config.params.max_new_tokens}}, "temperature": {{config.params.temperature}},{% if config.params.top_p is not none %} "top_p": {{config.params.top_p}},{% endif %} "retry": {{config.params.max_retries}}, "timeout": {{config.params.request_timeout}}{% if config.params.extra.wait is defined %}, "wait": {{config.params.extra.wait}}{% endif %}{% if config.params.extra.img_size is defined %}, "img_size": {{config.params.extra.img_size}}{% endif %}{% if config.params.extra.img_detail is defined %}, "img_detail": "{{config.params.extra.img_detail}}"{% endif %}{% if config.params.extra.system_prompt is defined %}, "system_prompt": "{{config.params.extra.system_prompt}}"{% endif %}{% if config.params.extra.verbose is defined %}, "verbose": {{config.params.extra.verbose}}{% endif %} } }, "data": { "{{config.params.extra.dataset.name}}": { "class": "{{config.params.extra.dataset.class}}", "dataset": "{{config.params.extra.dataset.name}}", "model": "{{target.api_endpoint.model_id}}" } } } EOF python -m vlmeval.run \ --config {{config.output_dir}}/vlmeval_config.json \ --work-dir {{config.output_dir}} \ --api-nproc {{config.params.parallelism}} \ {%- if config.params.extra.judge is defined %} --judge {{config.params.extra.judge.model}} \ --judge-args '{{config.params.extra.judge.args}}' \ {%- endif %} {% if config.params.limit_samples is not none %}--first-n {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: vlmevalkit pkg_name: vlmevalkit config: params: max_new_tokens: 2048 max_retries: 5 parallelism: 4 temperature: 0.0 request_timeout: 60 extra: dataset: name: AI2D_TEST class: ImageMCQDataset judge: model: gpt-4o args: '{"use_azure": true}' supported_endpoint_types: - vlm type: ai2d_judge target: api_endpoint: {} ``` ::: :::: --- (vlmevalkit-chartqa)= ## chartqa A Benchmark for Question Answering about Charts with Visual and Logical Reasoning ::::{tab-set} :::{tab-item} Container **Harness:** `vlmevalkit` **Container:** ``` nvcr.io/nvidia/eval-factory/vlmevalkit:26.01 ``` **Container Digest:** ``` sha256:24c650c547cfd666bcc5ec822c996eb90e89e4964a1d4ec29e4d01d8bd3a22dc ``` **Container Arch:** `amd` **Task Type:** `chartqa` ::: :::{tab-item} Command ```bash cat > {{config.output_dir}}/vlmeval_config.json << 'EOF' { "model": { "{{target.api_endpoint.model_id.split('/')[-1]}}": { "class": "CustomOAIEndpoint", "model": "{{target.api_endpoint.model_id}}", "api_base": "{{target.api_endpoint.url}}", "api_key_var_name": "{{target.api_endpoint.api_key_name}}", "max_tokens": {{config.params.max_new_tokens}}, "temperature": {{config.params.temperature}},{% if config.params.top_p is not none %} "top_p": {{config.params.top_p}},{% endif %} "retry": {{config.params.max_retries}}, "timeout": {{config.params.request_timeout}}{% if config.params.extra.wait is defined %}, "wait": {{config.params.extra.wait}}{% endif %}{% if config.params.extra.img_size is defined %}, "img_size": {{config.params.extra.img_size}}{% endif %}{% if config.params.extra.img_detail is defined %}, "img_detail": "{{config.params.extra.img_detail}}"{% endif %}{% if config.params.extra.system_prompt is defined %}, "system_prompt": "{{config.params.extra.system_prompt}}"{% endif %}{% if config.params.extra.verbose is defined %}, "verbose": {{config.params.extra.verbose}}{% endif %} } }, "data": { "{{config.params.extra.dataset.name}}": { "class": "{{config.params.extra.dataset.class}}", "dataset": "{{config.params.extra.dataset.name}}", "model": "{{target.api_endpoint.model_id}}" } } } EOF python -m vlmeval.run \ --config {{config.output_dir}}/vlmeval_config.json \ --work-dir {{config.output_dir}} \ --api-nproc {{config.params.parallelism}} \ {%- if config.params.extra.judge is defined %} --judge {{config.params.extra.judge.model}} \ --judge-args '{{config.params.extra.judge.args}}' \ {%- endif %} {% if config.params.limit_samples is not none %}--first-n {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: vlmevalkit pkg_name: vlmevalkit config: params: max_new_tokens: 2048 max_retries: 5 parallelism: 4 temperature: 0.0 request_timeout: 60 extra: dataset: name: ChartQA_TEST class: ImageVQADataset supported_endpoint_types: - vlm type: chartqa target: api_endpoint: {} ``` ::: :::: --- (vlmevalkit-mathvista-mini)= ## mathvista-mini Evaluating Math Reasoning in Visual Contexts ::::{tab-set} :::{tab-item} Container **Harness:** `vlmevalkit` **Container:** ``` nvcr.io/nvidia/eval-factory/vlmevalkit:26.01 ``` **Container Digest:** ``` sha256:24c650c547cfd666bcc5ec822c996eb90e89e4964a1d4ec29e4d01d8bd3a22dc ``` **Container Arch:** `amd` **Task Type:** `mathvista-mini` ::: :::{tab-item} Command ```bash cat > {{config.output_dir}}/vlmeval_config.json << 'EOF' { "model": { "{{target.api_endpoint.model_id.split('/')[-1]}}": { "class": "CustomOAIEndpoint", "model": "{{target.api_endpoint.model_id}}", "api_base": "{{target.api_endpoint.url}}", "api_key_var_name": "{{target.api_endpoint.api_key_name}}", "max_tokens": {{config.params.max_new_tokens}}, "temperature": {{config.params.temperature}},{% if config.params.top_p is not none %} "top_p": {{config.params.top_p}},{% endif %} "retry": {{config.params.max_retries}}, "timeout": {{config.params.request_timeout}}{% if config.params.extra.wait is defined %}, "wait": {{config.params.extra.wait}}{% endif %}{% if config.params.extra.img_size is defined %}, "img_size": {{config.params.extra.img_size}}{% endif %}{% if config.params.extra.img_detail is defined %}, "img_detail": "{{config.params.extra.img_detail}}"{% endif %}{% if config.params.extra.system_prompt is defined %}, "system_prompt": "{{config.params.extra.system_prompt}}"{% endif %}{% if config.params.extra.verbose is defined %}, "verbose": {{config.params.extra.verbose}}{% endif %} } }, "data": { "{{config.params.extra.dataset.name}}": { "class": "{{config.params.extra.dataset.class}}", "dataset": "{{config.params.extra.dataset.name}}", "model": "{{target.api_endpoint.model_id}}" } } } EOF python -m vlmeval.run \ --config {{config.output_dir}}/vlmeval_config.json \ --work-dir {{config.output_dir}} \ --api-nproc {{config.params.parallelism}} \ {%- if config.params.extra.judge is defined %} --judge {{config.params.extra.judge.model}} \ --judge-args '{{config.params.extra.judge.args}}' \ {%- endif %} {% if config.params.limit_samples is not none %}--first-n {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: vlmevalkit pkg_name: vlmevalkit config: params: max_new_tokens: 2048 max_retries: 5 parallelism: 4 temperature: 0.0 request_timeout: 60 extra: dataset: name: MathVista_MINI class: MathVista judge: model: gpt-4o args: '{"use_azure": true}' supported_endpoint_types: - vlm type: mathvista-mini target: api_endpoint: {} ``` ::: :::: --- (vlmevalkit-mmmu-judge)= ## mmmu_judge A benchmark for evaluating multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. ::::{tab-set} :::{tab-item} Container **Harness:** `vlmevalkit` **Container:** ``` nvcr.io/nvidia/eval-factory/vlmevalkit:26.01 ``` **Container Digest:** ``` sha256:24c650c547cfd666bcc5ec822c996eb90e89e4964a1d4ec29e4d01d8bd3a22dc ``` **Container Arch:** `amd` **Task Type:** `mmmu_judge` ::: :::{tab-item} Command ```bash cat > {{config.output_dir}}/vlmeval_config.json << 'EOF' { "model": { "{{target.api_endpoint.model_id.split('/')[-1]}}": { "class": "CustomOAIEndpoint", "model": "{{target.api_endpoint.model_id}}", "api_base": "{{target.api_endpoint.url}}", "api_key_var_name": "{{target.api_endpoint.api_key_name}}", "max_tokens": {{config.params.max_new_tokens}}, "temperature": {{config.params.temperature}},{% if config.params.top_p is not none %} "top_p": {{config.params.top_p}},{% endif %} "retry": {{config.params.max_retries}}, "timeout": {{config.params.request_timeout}}{% if config.params.extra.wait is defined %}, "wait": {{config.params.extra.wait}}{% endif %}{% if config.params.extra.img_size is defined %}, "img_size": {{config.params.extra.img_size}}{% endif %}{% if config.params.extra.img_detail is defined %}, "img_detail": "{{config.params.extra.img_detail}}"{% endif %}{% if config.params.extra.system_prompt is defined %}, "system_prompt": "{{config.params.extra.system_prompt}}"{% endif %}{% if config.params.extra.verbose is defined %}, "verbose": {{config.params.extra.verbose}}{% endif %} } }, "data": { "{{config.params.extra.dataset.name}}": { "class": "{{config.params.extra.dataset.class}}", "dataset": "{{config.params.extra.dataset.name}}", "model": "{{target.api_endpoint.model_id}}" } } } EOF python -m vlmeval.run \ --config {{config.output_dir}}/vlmeval_config.json \ --work-dir {{config.output_dir}} \ --api-nproc {{config.params.parallelism}} \ {%- if config.params.extra.judge is defined %} --judge {{config.params.extra.judge.model}} \ --judge-args '{{config.params.extra.judge.args}}' \ {%- endif %} {% if config.params.limit_samples is not none %}--first-n {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: vlmevalkit pkg_name: vlmevalkit config: params: max_new_tokens: 2048 max_retries: 5 parallelism: 4 temperature: 0.0 request_timeout: 60 extra: dataset: name: MMMU_DEV_VAL class: MMMUDataset judge: model: gpt-4o args: '{"use_azure": true}' supported_endpoint_types: - vlm type: mmmu_judge target: api_endpoint: {} ``` ::: :::: --- (vlmevalkit-ocr-reasoning)= ## ocr_reasoning Comprehensive benchmark of 1,069 human-annotated examples designed to evaluate multimodal large language models on text-rich image reasoning tasks by assessing both final answers and the reasoning process across six core abilities and 18 practical tasks. ::::{tab-set} :::{tab-item} Container **Harness:** `vlmevalkit` **Container:** ``` nvcr.io/nvidia/eval-factory/vlmevalkit:26.01 ``` **Container Digest:** ``` sha256:24c650c547cfd666bcc5ec822c996eb90e89e4964a1d4ec29e4d01d8bd3a22dc ``` **Container Arch:** `amd` **Task Type:** `ocr_reasoning` ::: :::{tab-item} Command ```bash cat > {{config.output_dir}}/vlmeval_config.json << 'EOF' { "model": { "{{target.api_endpoint.model_id.split('/')[-1]}}": { "class": "CustomOAIEndpoint", "model": "{{target.api_endpoint.model_id}}", "api_base": "{{target.api_endpoint.url}}", "api_key_var_name": "{{target.api_endpoint.api_key_name}}", "max_tokens": {{config.params.max_new_tokens}}, "temperature": {{config.params.temperature}},{% if config.params.top_p is not none %} "top_p": {{config.params.top_p}},{% endif %} "retry": {{config.params.max_retries}}, "timeout": {{config.params.request_timeout}}{% if config.params.extra.wait is defined %}, "wait": {{config.params.extra.wait}}{% endif %}{% if config.params.extra.img_size is defined %}, "img_size": {{config.params.extra.img_size}}{% endif %}{% if config.params.extra.img_detail is defined %}, "img_detail": "{{config.params.extra.img_detail}}"{% endif %}{% if config.params.extra.system_prompt is defined %}, "system_prompt": "{{config.params.extra.system_prompt}}"{% endif %}{% if config.params.extra.verbose is defined %}, "verbose": {{config.params.extra.verbose}}{% endif %} } }, "data": { "{{config.params.extra.dataset.name}}": { "class": "{{config.params.extra.dataset.class}}", "dataset": "{{config.params.extra.dataset.name}}", "model": "{{target.api_endpoint.model_id}}" } } } EOF python -m vlmeval.run \ --config {{config.output_dir}}/vlmeval_config.json \ --work-dir {{config.output_dir}} \ --api-nproc {{config.params.parallelism}} \ {%- if config.params.extra.judge is defined %} --judge {{config.params.extra.judge.model}} \ --judge-args '{{config.params.extra.judge.args}}' \ {%- endif %} {% if config.params.limit_samples is not none %}--first-n {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: vlmevalkit pkg_name: vlmevalkit config: params: max_new_tokens: 2048 max_retries: 5 parallelism: 4 temperature: 0.0 request_timeout: 60 extra: dataset: name: OCR_Reasoning class: OCR_Reasoning judge: model: gpt-4o args: '{"use_azure": true}' supported_endpoint_types: - vlm type: ocr_reasoning target: api_endpoint: {} ``` ::: :::: --- (vlmevalkit-ocrbench)= ## ocrbench Comprehensive evaluation benchmark designed to assess the OCR capabilities of Large Multimodal Models ::::{tab-set} :::{tab-item} Container **Harness:** `vlmevalkit` **Container:** ``` nvcr.io/nvidia/eval-factory/vlmevalkit:26.01 ``` **Container Digest:** ``` sha256:24c650c547cfd666bcc5ec822c996eb90e89e4964a1d4ec29e4d01d8bd3a22dc ``` **Container Arch:** `amd` **Task Type:** `ocrbench` ::: :::{tab-item} Command ```bash cat > {{config.output_dir}}/vlmeval_config.json << 'EOF' { "model": { "{{target.api_endpoint.model_id.split('/')[-1]}}": { "class": "CustomOAIEndpoint", "model": "{{target.api_endpoint.model_id}}", "api_base": "{{target.api_endpoint.url}}", "api_key_var_name": "{{target.api_endpoint.api_key_name}}", "max_tokens": {{config.params.max_new_tokens}}, "temperature": {{config.params.temperature}},{% if config.params.top_p is not none %} "top_p": {{config.params.top_p}},{% endif %} "retry": {{config.params.max_retries}}, "timeout": {{config.params.request_timeout}}{% if config.params.extra.wait is defined %}, "wait": {{config.params.extra.wait}}{% endif %}{% if config.params.extra.img_size is defined %}, "img_size": {{config.params.extra.img_size}}{% endif %}{% if config.params.extra.img_detail is defined %}, "img_detail": "{{config.params.extra.img_detail}}"{% endif %}{% if config.params.extra.system_prompt is defined %}, "system_prompt": "{{config.params.extra.system_prompt}}"{% endif %}{% if config.params.extra.verbose is defined %}, "verbose": {{config.params.extra.verbose}}{% endif %} } }, "data": { "{{config.params.extra.dataset.name}}": { "class": "{{config.params.extra.dataset.class}}", "dataset": "{{config.params.extra.dataset.name}}", "model": "{{target.api_endpoint.model_id}}" } } } EOF python -m vlmeval.run \ --config {{config.output_dir}}/vlmeval_config.json \ --work-dir {{config.output_dir}} \ --api-nproc {{config.params.parallelism}} \ {%- if config.params.extra.judge is defined %} --judge {{config.params.extra.judge.model}} \ --judge-args '{{config.params.extra.judge.args}}' \ {%- endif %} {% if config.params.limit_samples is not none %}--first-n {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: vlmevalkit pkg_name: vlmevalkit config: params: max_new_tokens: 2048 max_retries: 5 parallelism: 4 temperature: 0.0 request_timeout: 60 extra: dataset: name: OCRBench class: OCRBench supported_endpoint_types: - vlm type: ocrbench target: api_endpoint: {} ``` ::: :::: --- (vlmevalkit-slidevqa)= ## slidevqa Evaluates ability to answer questions about slide decks by selecting relevant slides from multiple images ::::{tab-set} :::{tab-item} Container **Harness:** `vlmevalkit` **Container:** ``` nvcr.io/nvidia/eval-factory/vlmevalkit:26.01 ``` **Container Digest:** ``` sha256:24c650c547cfd666bcc5ec822c996eb90e89e4964a1d4ec29e4d01d8bd3a22dc ``` **Container Arch:** `amd` **Task Type:** `slidevqa` ::: :::{tab-item} Command ```bash cat > {{config.output_dir}}/vlmeval_config.json << 'EOF' { "model": { "{{target.api_endpoint.model_id.split('/')[-1]}}": { "class": "CustomOAIEndpoint", "model": "{{target.api_endpoint.model_id}}", "api_base": "{{target.api_endpoint.url}}", "api_key_var_name": "{{target.api_endpoint.api_key_name}}", "max_tokens": {{config.params.max_new_tokens}}, "temperature": {{config.params.temperature}},{% if config.params.top_p is not none %} "top_p": {{config.params.top_p}},{% endif %} "retry": {{config.params.max_retries}}, "timeout": {{config.params.request_timeout}}{% if config.params.extra.wait is defined %}, "wait": {{config.params.extra.wait}}{% endif %}{% if config.params.extra.img_size is defined %}, "img_size": {{config.params.extra.img_size}}{% endif %}{% if config.params.extra.img_detail is defined %}, "img_detail": "{{config.params.extra.img_detail}}"{% endif %}{% if config.params.extra.system_prompt is defined %}, "system_prompt": "{{config.params.extra.system_prompt}}"{% endif %}{% if config.params.extra.verbose is defined %}, "verbose": {{config.params.extra.verbose}}{% endif %} } }, "data": { "{{config.params.extra.dataset.name}}": { "class": "{{config.params.extra.dataset.class}}", "dataset": "{{config.params.extra.dataset.name}}", "model": "{{target.api_endpoint.model_id}}" } } } EOF python -m vlmeval.run \ --config {{config.output_dir}}/vlmeval_config.json \ --work-dir {{config.output_dir}} \ --api-nproc {{config.params.parallelism}} \ {%- if config.params.extra.judge is defined %} --judge {{config.params.extra.judge.model}} \ --judge-args '{{config.params.extra.judge.args}}' \ {%- endif %} {% if config.params.limit_samples is not none %}--first-n {{config.params.limit_samples}}{% endif %} ``` ::: :::{tab-item} Defaults ```yaml framework_name: vlmevalkit pkg_name: vlmevalkit config: params: max_new_tokens: 2048 max_retries: 5 parallelism: 4 temperature: 0.0 request_timeout: 60 extra: dataset: name: SLIDEVQA class: SlideVQA judge: model: gpt-4o args: '{"use_azure": true}' supported_endpoint_types: - vlm type: slidevqa target: api_endpoint: {} ``` ::: :::: (benchmarks-full-list)= # Available Benchmarks ```{include} all/benchmarks-table.md ``` # Benchmark Catalog Comprehensive catalog of hundreds of benchmarks across popular evaluation harnesses, all available through NGC containers and the NeMo Evaluator platform. ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`graph;1.5em;sd-mr-1` About Selecting Benchmarks :link: eval-benchmarks :link-type: ref Browse benchmark categories and choose the ones best suited for your model and use case ::: :::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` Available Benchmarks :link: benchmarks-full-list :link-type: ref Detailed description of all available tasks, groupped by evaluation harness. ::: :::: :::{toctree} :caption: Benchmark Catalog :hidden: About Selecting Benchmarks Available Benchmarks ::: (eval-custom-tasks)= # Tasks Not Explicitly Defined by Framework Definition File ## Introduction NeMo Evaluator provides a unified interface and a curated set of pre-defined task configurations for launching evaluations. These task configurations are specified in the [Framework Definition File (FDF)](../about/concepts/framework-definition-file.md) to provide a simple and standardized way of running evaluations, with minimum user-provided input required. However, you can choose to evaluate your model on a task that was not explicitly included in the FDF. To do so, you must specify your task as `"."`, where the task name originates from the underlying evaluation harness, and ensure that all of the task parameters (e.g., sampling parameters, few-shot settings) are specified correctly. Additionally, you need to determine which [endpoint type](../deployment/bring-your-own-endpoint/testing-endpoint-oai-compatibility.md) is appropriate for the task. ## Run Evaluation In this example, we will use the [PolEmo 2.0](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/polemo2) task from LM Evaluation Harness. This task consists of consumer reviews in Polish and assesses sentiment analysis abilities. It requires a "completions" endpoint and has the sampling parameters defined as a part of [task configuration](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/polemo2/polemo2_in.yaml) of the underlying harness. :::{note} Make sure to review the task configuration in the underlying harness and ensure the sampling parameters are defined and match your preffered way of running the benchmark. You can configure the evaluation using the `params` field in the `EvaluationConfig`. ::: ### 1. Prepare the Environment Start `lm-evaluation-harness` Docker container: ```bash docker run --rm -it nvcr.io/nvidia/eval-factory/lm-evaluation-harness:{{ docker_compose_latest }} ``` or install `nemo-evaluator` and `nvidia-lm-eval` Python package in your environment of choice: ```bash pip install nemo-evaluator nvidia-lm-eval ``` ### 2. Run the Evaluation ```{literalinclude} _snippets/polemo2.py :language: python :start-after: "## Run the evaluation" ``` # Add Evaluation Packages to NeMo Framework The NeMo Framework Docker image comes with [nvidia-lm-eval](https://pypi.org/project/nvidia-lm-eval/) pre-installed. However, you can add more evaluation methods by installing additional NeMo Evaluator packages. For each package, follow these steps: 1. Install the required package. 2. Deploy your model: ```{literalinclude} ../get-started/_snippets/deploy.sh :language: shell :start-after: "## Deploy" ``` Wait for the server to get started and ready for accepting requests: ```python from nemo_evaluator.api import check_endpoint check_endpoint( endpoint_url="http://0.0.0.0:8080/v1/completions/", endpoint_type="completions", model_name="megatron_model", ) ``` Make sure to open two separate terminals within the same container for executing the deployment and evaluation. 3. (Optional) Export the required environment variables. 4. Run the evalution of your choice. Below you can find examples for enabling and launching evaluations for different packages. :::{tip} All examples below use only a subset of samples. To run the evaluation on the whole dataset, remove the `limit_samples` parameter. ::: ## Enable On-Demand Evaluation Packages :::{note} If multiple harnesses are installed in your environment and they define a task with the same name, you must use the `.` format to avoid ambiguity. For example: ```python eval_config = EvaluationConfig(type="lm-evaluation-harness.mmlu") eval_config = EvaluationConfig(type="simple_evals.mmlu") ``` ::: ::::{tab-set} :::{tab-item} BFCL 1. Install the [nvidia-bfcl](https://pypi.org/project/nvidia-bfcl/) package: ```bash pip install nvidia-bfcl ``` 2. Run the evaluation: ```{literalinclude} _snippets/bfcl.py :language: python :start-after: "## Run the evaluation" ``` ::: :::{tab-item} garak 1. Install the [nvidia-eval-factory-garak](https://pypi.org/project/nvidia-eval-factory-garak/) package: ```bash pip install nvidia-eval-factory-garak ``` 2. Run the evaluation: ```{literalinclude} _snippets/garak.py :language: python :start-after: "## Run the evaluation" ``` ::: :::{tab-item} BigCode 1. Install the [nvidia-bigcode-eval](https://pypi.org/project/nvidia-bigcode-eval/) package: ```bash pip install nvidia-bigcode-eval ``` 2. Run the evaluation: ```{literalinclude} _snippets/bigcode.py :language: python :start-after: "## Run the evaluation" ``` ::: :::{tab-item} simple-evals 1. Install the [nvidia-simple-evals](https://pypi.org/project/nvidia-simple-evals/) package: ```bash pip install nvidia-simple-evals ``` In the example below, we use the `AIME_2025` task, which follows the llm-as-a-judge approach for checking the output correctness. By default, [Llama 3.3 70B](https://build.nvidia.com/meta/llama-3_3-70b-instruct) NVIDIA NIM is used for judging. 2. To run evaluation, set your [build.nvidia.com](https://build.nvidia.com/) API key as the `JUDGE_API_KEY` variable: ```bash export JUDGE_API_KEY=... ``` To customize the judge setting, see the instructions for [NVIDIA Eval Factory package](https://pypi.org/project/nvidia-simple-evals/). 3. Run the evaluation: ```{literalinclude} _snippets/simple_evals.py :language: python :start-after: "## Run the evaluation" ``` ::: :::{tab-item} safety-harness 1. Install the [nvidia-safety-harness](https://pypi.org/project/nvidia-safety-harness/) package: ```bash pip install nvidia-safety-harness ``` 2. Deploy the judge model. In the example below, we use the `aegis_v2` task, which requires the [Llama 3.1 NemoGuard 8B ContentSafety](https://docs.nvidia.com/nim/llama-3-1-nemoguard-8b-contentsafety/latest/getting-started.html) model to assess your model's responses. The model is available through NVIDIA NIM. See the [instructions](https://docs.nvidia.com/nim/llama-3-1-nemoguard-8b-contentsafety/latest/getting-started.html) on deploying the judge model. If you set up a gated judge endpoint, you must export your API key as the `JUDGE_API_KEY` variable: ```bash export JUDGE_API_KEY=... ``` 3. To access the evaluation dataset, you must authenticate with the [Hugging Face Hub](https://huggingface.co/docs/huggingface_hub/quick-start#authentication). 4. Run the evaluation: ```{literalinclude} _snippets/safety.py :language: python :start-after: "## Run the evaluation" ``` Make sure to modify the judge configuration in the provided snippet to match your Llama 3.1 NemoGuard 8B ContentSafety endoint. ::: :::: --- orphan: true --- (eval-parameters)= # Evaluation Configuration Parameters Comprehensive reference for configuring evaluation tasks in {{ product_name_short }}, covering universal parameters, framework-specific settings, and optimization patterns. :::{admonition} Quick Navigation :class: info **Looking for task-specific guides?** - {ref}`text-gen` - Text generation evaluation - {ref}`logprobs` - Log-probability evaluation - {ref}`code-generation` - Code generation evaluation **Looking for available benchmarks?** - {ref}`eval-benchmarks` - Browse available benchmarks by category **Need help getting started?** - {ref}`evaluation-overview` - Overview of evaluation workflows - {ref}`eval-run` - Step-by-step evaluation guides ::: ## Overview All evaluation tasks in {{ product_name_short }} use the `ConfigParams` class for configuration. This provides a consistent interface across different evaluation harnesses while allowing framework-specific customization through the `extra` parameter. ```python from nemo_evaluator.api.api_dataclasses import ConfigParams # Basic configuration params = ConfigParams( temperature=0, top_p=1.0, max_new_tokens=256, limit_samples=100 ) # Advanced configuration with framework-specific parameters params = ConfigParams( temperature=0, parallelism=8, extra={ "num_fewshot": 5, "tokenizer": "/path/to/tokenizer", "custom_prompt": "Answer the question:" } ) ``` ## Universal Parameters These parameters are available for all evaluation tasks regardless of the underlying harness or benchmark. ### Core Generation Parameters ```{list-table} :header-rows: 1 :widths: 15 10 30 25 20 * - Parameter - Type - Description - Example Values - Notes * - `temperature` - `float` - Sampling randomness - `0` (deterministic), `0.7` (creative) - Use `0` for reproducible results * - `top_p` - `float` - Nucleus sampling threshold - `1.0` (disabled), `0.9` (selective) - Controls diversity of generated text * - `max_new_tokens` - `int` - Maximum response length - `256`, `512`, `1024` - Limits generation length ``` ### Evaluation Control Parameters ```{list-table} :header-rows: 1 :widths: 15 10 30 25 20 * - Parameter - Type - Description - Example Values - Notes * - `limit_samples` - `int/float` - Evaluation subset size - `100` (count), `0.1` (10% of dataset) - Use for quick testing or resource limits * - `task` - `str` - Task-specific identifier - `"custom_task"` - Used by some harnesses for task routing ``` ### Performance Parameters ```{list-table} :header-rows: 1 :widths: 15 10 30 25 20 * - Parameter - Type - Description - Example Values - Notes * - `parallelism` - `int` - Concurrent request threads - `1`, `8`, `16` - Balance against server capacity * - `max_retries` - `int` - Retry attempts for failed requests - `3`, `5`, `10` - Increases robustness for network issues * - `request_timeout` - `int` - Request timeout (seconds) - `60`, `120`, `300` - Adjust for model response time ``` ## Framework-Specific Parameters Framework-specific parameters are passed through the `extra` dictionary within `ConfigParams`. ::::{dropdown} LM-Evaluation-Harness Parameters :icon: code-square ```{list-table} :header-rows: 1 :widths: 15 10 30 25 20 * - Parameter - Type - Description - Example Values - Use Cases * - `num_fewshot` - `int` - Few-shot examples count - `0`, `5`, `25` - Academic benchmarks * - `tokenizer` - `str` - Tokenizer path - `"/path/to/tokenizer"` - Log-probability tasks * - `tokenizer_backend` - `str` - Tokenizer implementation - `"huggingface"`, `"sentencepiece"` - Custom tokenizer setups * - `trust_remote_code` - `bool` - Allow remote code execution - `True`, `False` - For custom tokenizers * - `add_bos_token` - `bool` - Add beginning-of-sequence token - `True`, `False` - Model-specific formatting * - `add_eos_token` - `bool` - Add end-of-sequence token - `True`, `False` - Model-specific formatting * - `fewshot_delimiter` - `str` - Separator between examples - `"\\n\\n"`, `"\\n---\\n"` - Custom prompt formatting * - `fewshot_seed` - `int` - Reproducible example selection - `42`, `1337` - Ensures consistent few-shot examples * - `description` - `str` - Custom prompt prefix - `"Answer the question:"` - Task-specific instructions * - `bootstrap_iters` - `int` - Statistical bootstrap iterations - `1000`, `10000` - For confidence intervals ``` :::: ::::{dropdown} Simple-Evals Parameters :icon: code-square ```{list-table} :header-rows: 1 :widths: 15 10 30 25 20 * - Parameter - Type - Description - Example Values - Use Cases * - `pass_at_k` - `list[int]` - Code evaluation metrics - `[1, 5, 10]` - Code generation tasks * - `timeout` - `int` - Code execution timeout - `5`, `10`, `30` - Code generation tasks * - `max_workers` - `int` - Parallel execution workers - `4`, `8`, `16` - Code execution parallelism * - `languages` - `list[str]` - Target programming languages - `["python", "java", "cpp"]` - Multi-language evaluation ``` :::: ::::{dropdown} BigCode-Evaluation-Harness Parameters :icon: code-square ```{list-table} :header-rows: 1 :widths: 15 10 30 25 20 * - Parameter - Type - Description - Example Values - Use Cases * - `num_workers` - `int` - Parallel execution workers - `4`, `8`, `16` - Code execution parallelism * - `eval_metric` - `str` - Evaluation metric - `"pass_at_k"`, `"bleu"` - Different scoring methods * - `languages` - `list[str]` - Programming languages - `["python", "javascript"]` - Language-specific evaluation ``` :::: ::::{dropdown} Safety and Specialized Harnesses :icon: code-square ```{list-table} :header-rows: 1 :widths: 15 10 30 25 20 * - Parameter - Type - Description - Example Values - Use Cases * - `probes` - `str` - Garak security probes - `"ansiescape.AnsiEscaped"` - Security evaluation * - `detectors` - `str` - Garak security detectors - `"base.TriggerListDetector"` - Security evaluation * - `generations` - `int` - Number of generations per prompt - `1`, `5`, `10` - Safety evaluation ``` :::: ## Configuration Patterns ::::{dropdown} Academic Benchmarks (Deterministic) :icon: code-square ```python academic_params = ConfigParams( temperature=0.01, # Near-deterministic generation (0.0 not supported by all endpoints) top_p=1.0, # No nucleus sampling max_new_tokens=256, # Moderate response length limit_samples=None, # Full dataset evaluation parallelism=4, # Conservative parallelism extra={ "num_fewshot": 5, # Standard few-shot count "fewshot_seed": 42 # Reproducible examples } ) ``` :::: ::::{dropdown} Creative Tasks (Controlled Randomness) :icon: code-square ```python creative_params = ConfigParams( temperature=0.7, # Moderate creativity top_p=0.9, # Nucleus sampling max_new_tokens=512, # Longer responses extra={ "repetition_penalty": 1.1, # Reduce repetition "do_sample": True # Enable sampling } ) ``` :::: ::::{dropdown} Code Generation (Balanced) :icon: code-square ```python code_params = ConfigParams( temperature=0.2, # Slight randomness for diversity top_p=0.95, # Selective sampling max_new_tokens=1024, # Sufficient for code solutions extra={ "pass_at_k": [1, 5, 10], # Multiple success metrics "timeout": 10, # Code execution timeout "stop_sequences": ["```", "\\n\\n"] # Code block terminators } ) ``` :::: ::::{dropdown} Log-Probability Tasks :icon: code-square ```python logprob_params = ConfigParams( # No generation parameters needed for log-probability tasks limit_samples=100, # Quick testing extra={ "tokenizer_backend": "huggingface", "tokenizer": "/path/to/nemo_tokenizer", "trust_remote_code": True } ) ``` :::: ::::{dropdown} High-Throughput Evaluation :icon: code-square ```python performance_params = ConfigParams( temperature=0.01, # Near-deterministic for speed parallelism=16, # High concurrency max_retries=5, # Robust retry policy request_timeout=120, # Generous timeout limit_samples=0.1, # 10% sample for testing extra={ "batch_size": 8, # Batch requests if supported "cache_requests": True # Enable caching } ) ``` :::: ## Parameter Selection Guidelines ### By Evaluation Type **Text Generation Tasks**: - Use `temperature=0.01` for near-deterministic, reproducible results (most endpoints don't support exactly 0.0) - Set appropriate `max_new_tokens` based on expected response length - Configure `parallelism` based on server capacity **Log-Probability Tasks**: - Always specify `tokenizer` and `tokenizer_backend` in `extra` - Generation parameters (temperature, top_p) are not used - Focus on tokenizer configuration accuracy **Code Generation Tasks**: - Use moderate `temperature` (0.1-0.3) for diversity without randomness - Set higher `max_new_tokens` (1024+) for complete solutions - Configure `timeout` and `pass_at_k` in `extra` **Safety Evaluation**: - Use appropriate `probes` and `detectors` in `extra` - Consider multiple `generations` per prompt - Use chat endpoints for instruction-following safety tests ### By Resource Constraints **Limited Compute**: - Reduce `parallelism` to 1-4 - Use `limit_samples` for subset evaluation - Increase `request_timeout` for slower responses **High-Performance Clusters**: - Increase `parallelism` to 16-32 - Enable request batching in `extra` if supported - Use full dataset evaluation (`limit_samples=None`) **Development/Testing**: - Use `limit_samples=10-100` for quick validation - Set `temperature=0.01` for consistent results - Enable verbose logging in `extra` if available ## Common Configuration Errors ### Tokenizer Issues :::{admonition} Problem :class: error Missing tokenizer for log-probability tasks ```python # Incorrect - missing tokenizer params = ConfigParams(extra={}) ``` ::: :::{admonition} Solution :class: tip Always specify tokenizer for log-probability tasks ```python # Correct params = ConfigParams( extra={ "tokenizer_backend": "huggingface", "tokenizer": "/path/to/nemo_tokenizer" } ) ``` ::: ### Performance Issues :::{admonition} Problem :class: error Excessive parallelism overwhelming server ```python # Incorrect - too many concurrent requests params = ConfigParams(parallelism=100) ``` ::: :::{admonition} Solution :class: tip Start conservative and scale up ```python # Correct - reasonable concurrency params = ConfigParams(parallelism=8, max_retries=3) ``` ::: ### Parameter Conflicts :::{admonition} Problem :class: error Mixing generation and log-probability parameters ```python # Incorrect - generation params unused for log-probability params = ConfigParams( temperature=0.7, # Ignored for log-probability tasks extra={"tokenizer": "/path"} ) ``` ::: :::{admonition} Solution :class: tip Use appropriate parameters for task type ```python # Correct - only relevant parameters params = ConfigParams( limit_samples=100, # Relevant for all tasks extra={"tokenizer": "/path"} # Required for log-probability ) ``` ::: ## Best Practices ### Development Workflow 1. **Start Small**: Use `limit_samples=10` for initial validation 2. **Test Configuration**: Verify parameters work before full evaluation 3. **Monitor Resources**: Check memory and compute usage during evaluation 4. **Document Settings**: Record successful configurations for reproducibility ### Production Evaluation 1. **Deterministic Settings**: Use `temperature=0.01` for consistent results 2. **Full Datasets**: Remove `limit_samples` for complete evaluation 3. **Robust Configuration**: Set appropriate retries and timeouts 4. **Resource Planning**: Scale `parallelism` based on available infrastructure ### Parameter Tuning 1. **Task-Appropriate**: Match parameters to evaluation methodology 2. **Incremental Changes**: Adjust one parameter at a time 3. **Baseline Comparison**: Compare against known good configurations 4. **Performance Monitoring**: Track evaluation speed and resource usage ## Next Steps - **Basic Usage**: See {ref}`text-gen` for getting started - **Custom Tasks**: Learn {ref}`eval-custom-tasks` for specialized evaluations - **Troubleshooting**: Refer to {ref}`troubleshooting-index` for common issues - **Benchmarks**: Browse {ref}`eval-benchmarks` for task-specific recommendations --- orphan: true --- (code-generation)= # Code Generation Evaluation Evaluate programming capabilities through code generation, completion, and algorithmic problem solving using the BigCode evaluation harness. ## Overview Code generation evaluation assesses a model's ability to: - **Generate Code**: Write complete functions from natural language descriptions - **Code Completion**: Fill in missing code segments - **Algorithm Implementation**: Solve programming challenges and competitive programming problems ## Before You Start Ensure you have: - **Model Endpoint**: An OpenAI-compatible endpoint for your model - **API Access**: Valid API key for your model endpoint - **Sufficient Context**: Models with adequate context length for code problems ### Pre-Flight Check Verify your setup before running code evaluation: {ref}`deployment-testing-compatibility`. ## Choose Your Approach ::::{tab-set} :::{tab-item} NeMo Evaluator Launcher :sync: launcher **Recommended** - The fastest way to run code generation evaluations with unified CLI: ```bash # List available code generation tasks nemo-evaluator-launcher ls tasks | grep -E "(mbpp|humaneval)" # Run MBPP evaluation nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \ -o 'evaluation.tasks=["mbpp"]' \ -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \ -o target.api_endpoint.api_key=${YOUR_API_KEY} # Run multiple code generation benchmarks nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \ -o 'evaluation.tasks=["mbpp", "humaneval"]' ``` ::: :::{tab-item} Core API :sync: api For programmatic evaluation in custom workflows: ```python from nemo_evaluator.core.evaluate import evaluate from nemo_evaluator.api.api_dataclasses import ( EvaluationConfig, EvaluationTarget, ApiEndpoint, ConfigParams, EndpointType ) # Configure code generation evaluation eval_config = EvaluationConfig( type="mbpp", output_dir="./results", params=ConfigParams( limit_samples=10, # Remove for full dataset temperature=0.2, # Low temperature for consistent code max_new_tokens=1024, # Sufficient tokens for complete functions top_p=0.9 ) ) target_config = EvaluationTarget( api_endpoint=ApiEndpoint( url="https://integrate.api.nvidia.com/v1/chat/completions", model_id="meta/llama-3.2-3b-instruct", type=EndpointType.CHAT, api_key="your_api_key" ) ) result = evaluate(eval_cfg=eval_config, target_cfg=target_config) print(f"Evaluation completed: {result}") ``` ::: :::{tab-item} Containers Directly :sync: containers For specialized container workflows: ```bash # Pull and run BigCode evaluation container docker run --rm -it nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:{{ docker_compose_latest }} bash # Inside container - set environment export MY_API_KEY=your_api_key_here # Run code generation evaluation nemo-evaluator run_eval \ --eval_type mbpp \ --model_id meta/llama-3.2-3b-instruct \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ --model_type chat \ --api_key_name MY_API_KEY \ --output_dir /tmp/results \ --overrides 'config.params.limit_samples=10,config.params.temperature=0.2' ``` ::: :::: ## Container Access The BigCode evaluation harness is available through Docker containers. No separate package installation is required: ```bash docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:{{ docker_compose_latest }} ``` ## Discovering Available Tasks Use the launcher CLI to discover all available code generation tasks: ```bash # List all available benchmarks nemo-evaluator-launcher ls tasks # Filter for code generation tasks nemo-evaluator-launcher ls tasks | grep -E "(mbpp|humaneval)" ``` ## Available Tasks The BigCode harness provides these programming benchmarks: ```{list-table} :header-rows: 1 :widths: 20 40 20 20 * - Task - Description - Language - Endpoint Type * - `mbpp` - Mostly Basic Programming Problems - Python - chat * - `mbppplus` - Extended MBPP with additional test cases - Python - chat * - `humaneval` - Hand-written programming problems - Python - completions ``` ## Basic Code Generation Evaluation The Most Basic Programming Problems (MBPP) benchmark tests fundamental programming skills. Use any of the three approaches above to run MBPP evaluations. ### Understanding Results Code generation evaluations typically report pass@k metrics that indicate what percentage of problems were solved correctly within k attempts. ## Advanced Configuration ::::{dropdown} Custom Evaluation Parameters :icon: code-square ```python # Advanced configuration for code generation eval_params = ConfigParams( limit_samples=100, # Evaluate on subset for testing parallelism=4, # Concurrent evaluation requests temperature=0.2, # Low temperature for consistent code max_new_tokens=1024 # Sufficient tokens for complete functions ) eval_config = EvaluationConfig( type="mbpp", output_dir="/results/mbpp_advanced/", params=eval_params ) ``` :::: ::::{dropdown} Multiple Task Evaluation :icon: code-square Evaluate across different code generation benchmarks: ```python from nemo_evaluator.core.evaluate import evaluate from nemo_evaluator.api.api_dataclasses import ( EvaluationConfig, EvaluationTarget, ApiEndpoint, ConfigParams, EndpointType ) # Configure target endpoint (reused for all tasks) target_config = EvaluationTarget( api_endpoint=ApiEndpoint( url="https://integrate.api.nvidia.com/v1/chat/completions", model_id="meta/llama-3.2-3b-instruct", type=EndpointType.CHAT, api_key="your_api_key" ) ) code_tasks = ["mbpp", "mbppplus"] results = {} for task in code_tasks: eval_config = EvaluationConfig( type=task, output_dir=f"./results/{task}/", params=ConfigParams( limit_samples=50, temperature=0.1, parallelism=2 ) ) results[task] = evaluate( eval_cfg=eval_config, target_cfg=target_config ) ``` :::: ## Understanding Metrics ### Pass@k Interpretation Code generation evaluations typically report pass@k metrics: - **Pass@1**: Percentage of problems solved on the first attempt - **Pass@k**: Percentage of problems solved in k attempts (if multiple samples are generated) --- orphan: true --- (function-calling)= # Function Calling Evaluation Assess tool use capabilities, API calling accuracy, and structured output generation for agent-like behaviors using the Berkeley Function Calling Leaderboard (BFCL). ## Overview Function calling evaluation measures a model's ability to: - **Tool Discovery**: Identify appropriate functions for given tasks - **Parameter Extraction**: Extract correct parameters from natural language - **API Integration**: Generate proper function calls and handle responses - **Multi-Step Reasoning**: Chain function calls for complex workflows - **Error Handling**: Manage invalid parameters and API failures ## Before You Start Ensure you have: - **Chat Model Endpoint**: Function calling requires chat-formatted OpenAI-compatible endpoints - **API Access**: Valid API key for your model endpoint - **Structured Output Support**: Model capable of generating JSON/function call formats --- ## Choose Your Approach ::::{tab-set} :::{tab-item} NeMo Evaluator Launcher :sync: launcher **Recommended** - The fastest way to run function calling evaluations with unified CLI: ```bash # List available function calling tasks nemo-evaluator-launcher ls tasks | grep -E "(bfcl|function)" # Run BFCL AST prompting evaluation nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \ -o 'evaluation.tasks=["bfclv3_ast_prompting"]' \ -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \ -o target.api_endpoint.api_key=${YOUR_API_KEY} ``` ::: :::{tab-item} Core API :sync: api For programmatic evaluation in custom workflows: ```python from nemo_evaluator.core.evaluate import evaluate from nemo_evaluator.api.api_dataclasses import ( EvaluationConfig, EvaluationTarget, ApiEndpoint, ConfigParams, EndpointType ) # Configure function calling evaluation eval_config = EvaluationConfig( type="bfclv3_ast_prompting", output_dir="./results", params=ConfigParams( limit_samples=10, # Remove for full dataset temperature=0.1, # Low temperature for precise function calls max_new_tokens=512, # Adequate for function call generation top_p=0.9 ) ) target_config = EvaluationTarget( api_endpoint=ApiEndpoint( url="https://integrate.api.nvidia.com/v1/chat/completions", model_id="meta/llama-3.2-3b-instruct", type=EndpointType.CHAT, api_key="your_api_key" ) ) result = evaluate(eval_cfg=eval_config, target_cfg=target_config) print(f"Evaluation completed: {result}") ``` ::: :::{tab-item} Containers Directly :sync: containers For specialized container workflows: ```bash # Pull and run BFCL evaluation container docker run --rm -it nvcr.io/nvidia/eval-factory/bfcl:{{ docker_compose_latest }} bash # Inside container - set environment export MY_API_KEY=your_api_key_here # Run function calling evaluation nemo-evaluator run_eval \ --eval_type bfclv3_ast_prompting \ --model_id meta/llama-3.2-3b-instruct \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ --model_type chat \ --api_key_name MY_API_KEY \ --output_dir /tmp/results \ --overrides 'config.params.limit_samples=10,config.params.temperature=0.1' ``` ::: :::: ## Installation Install the BFCL evaluation package for local development: ```bash pip install nvidia-bfcl==25.7.1 ``` ## Discovering Available Tasks Use the launcher CLI to discover all available function calling tasks: ```bash # List all available benchmarks nemo-evaluator-launcher ls tasks # Filter for function calling tasks nemo-evaluator-launcher ls tasks | grep -E "(bfcl|function)" ``` ## Available Function Calling Tasks BFCL provides comprehensive function calling benchmarks: | Task | Description | Complexity | Format | |------|-------------|------------|---------| | `bfclv3_ast_prompting` | AST-based function calling with structured output | Intermediate | Structured | | `bfclv2_ast_prompting` | BFCL v2 AST-based function calling (legacy) | Intermediate | Structured | ## Basic Function Calling Evaluation The most comprehensive BFCL task is AST-based function calling that evaluates structured function calling. Use any of the three approaches above to run BFCL evaluations. ### Understanding Function Calling Format BFCL evaluates models on their ability to generate proper function calls: **Input Example**: ```text What's the weather like in San Francisco and New York? Available functions: - get_weather(city: str, units: str = "celsius") -> dict ``` **Expected Output**: ```json [ {"name": "get_weather", "arguments": {"city": "San Francisco"}}, {"name": "get_weather", "arguments": {"city": "New York"}} ] ``` ## Advanced Configuration ### Custom Evaluation Parameters ```python # Optimized settings for function calling eval_params = ConfigParams( limit_samples=100, parallelism=2, # Conservative for complex reasoning temperature=0.1, # Low temperature for precise function calls max_new_tokens=512, # Adequate for function call generation top_p=0.9 # Focused sampling for accuracy ) eval_config = EvaluationConfig( type="bfclv3_ast_prompting", output_dir="/results/bfcl_optimized/", params=eval_params ) ``` ### Multi-Task Function Calling Evaluation Evaluate multiple BFCL versions: ```python function_calling_tasks = [ "bfclv2_ast_prompting", # BFCL v2 "bfclv3_ast_prompting" # BFCL v3 (latest) ] results = {} for task in function_calling_tasks: eval_config = EvaluationConfig( type=task, output_dir=f"/results/{task}/", params=ConfigParams( limit_samples=50, temperature=0.0, # Deterministic for consistency parallelism=1 # Sequential for complex reasoning ) ) result = evaluate( target_cfg=target_config, eval_cfg=eval_config ) results[task] = result # Access metrics from EvaluationResult object print(f"Completed {task} evaluation") print(f"Results: {result}") ``` ## Understanding Metrics ### Results Structure The `evaluate()` function returns an `EvaluationResult` object containing task-level and metric-level results: ```python from nemo_evaluator.core.evaluate import evaluate result = evaluate(eval_cfg=eval_config, target_cfg=target_config) # Access task results if result.tasks: for task_name, task_result in result.tasks.items(): print(f"Task: {task_name}") for metric_name, metric_result in task_result.metrics.items(): for score_name, score in metric_result.scores.items(): print(f" {metric_name}.{score_name}: {score.value}") ``` ### Interpreting BFCL Scores BFCL evaluations measure function calling accuracy across various dimensions. The specific metrics depend on the BFCL version and configuration. Check the `results.yml` output file for detailed metric breakdowns. --- *For more function calling tasks and advanced configurations, see the [BFCL package documentation](https://pypi.org/project/nvidia-bfcl/).* (eval-run)= # Evaluation Techniques Follow step-by-step guides for different evaluation scenarios and methodologies in NeMo Evaluator. ## Before You Start Ensure you have: 1. Completed the initial getting started guides for {ref}`gs-install` and {ref}`gs-quickstart`. 2. Have your endpoint and API key ready or prepared for the checkpoint you wish to deploy. 3. Prepared your [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) for accessing gated datasets. ## Evaluations Select an evaluation type tailored to your model capabilities. ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`pencil;1.5em;sd-mr-1` Text Generation :link: text-gen :link-type: ref Measure model performance through natural language generation for academic benchmarks, reasoning tasks, and general knowledge assessment. ::: :::{grid-item-card} {octicon}`graph;1.5em;sd-mr-1` Log-Probability :link: logprobs :link-type: ref Assess model confidence and uncertainty using log-probabilities for multiple-choice scenarios without text generation. ::: :::{grid-item-card} {octicon}`comment;1.5em;sd-mr-1` Reasoning :link: run-eval-reasoning :link-type: ref Control the thinking budget and post-process the responses to extract the reasoning content and the final answer ::: :::: :::{toctree} :hidden: Text Generation Log Probability Reasoning ::: (text-gen)= # Text Generation Evaluation Text generation evaluation is the primary method for assessing LLM capabilities where models produce natural language responses to prompts. This approach evaluates the quality, accuracy, and appropriateness of generated text across various tasks and domains. :::{tip} In the example below we use the `gpqa_diamond` benchmark, but the instructions provided apply to all text generation tasks, such as: - `mmlu` - `mmlu_pro` - `ifeval` - `gsm8k` - `mgsm` - `mbpp` ::: ## Before You Start Ensure you have: - **Model Endpoint**: An OpenAI-compatible API endpoint for your model (completions or chat). See {ref}`deployment-testing-compatibility` for snippets you can use to test your endpoint. - **API Access**: Valid API key if your endpoint requires authentication - **Installed Packages**: NeMo Evaluator or access to evaluation containers ## Evaluation Approach In text generation evaluation: 1. **Prompt Construction**: Models receive carefully crafted prompts (questions, instructions, or text to continue) 2. **Response Generation**: Models generate natural language responses using their trained parameters 3. **Response Assessment**: Generated text is evaluated for correctness, quality, or adherence to specific criteria 4. **Metric Calculation**: Numerical scores are computed based on evaluation criteria This differs from **log-probability evaluation** where models assign confidence scores to predefined choices. For log-probability methods, see the {ref}`logprobs`. ## Use NeMo Evaluator Launcher Use an example config for evaluating the [Meta Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model: ```{literalinclude} ../../../packages/nemo-evaluator-launcher/examples/local_basic.yaml :language: yaml :start-after: "[docs-start-snippet]" ``` To launch the evaluation, run: ```bash export HF_TOKEN_FOR_GPQA_DIAMOND=hf_your-token-here # GPQA is a gated dataset export NGC_API_KEY=nvapi-your-token-here # API Key with access to build.nvidia.com nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_basic.yaml ``` ## Use NeMo Evaluator Start `simple-evals` docker container: ```bash docker run --rm -it nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} ``` or install `nemo-evaluator` and `nvidia-simple-evals` Python package in your environment of choice: ```bash pip install nemo-evaluator nvidia-simple-evals ``` ### Run with CLI ```bash export HF_TOKEN_FOR_GPQA_DIAMOND=hf_your-token-here # GPQA is a gated dataset export NGC_API_KEY=nvapi-your-token-here # API Key with access to build.nvidia.com # Run evaluation nemo-evaluator run_eval \ --eval_type gpqa_diamond \ --model_id meta/llama-3.2-3b-instruct \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ --model_type chat \ --api_key_name NGC_API_KEY \ --output_dir ./llama_3_1_8b_instruct_results ``` ### Run with Python API ```python # set env variables before entering Python: # export HF_TOKEN_FOR_GPQA_DIAMOND=hf_your-token-here # GPQA is a gated dataset # export NGC_API_KEY=nvapi-your-token-here # API Key with access to build.nvidia.com from nemo_evaluator.core.evaluate import evaluate from nemo_evaluator.api.api_dataclasses import ( ApiEndpoint, EvaluationConfig, EvaluationTarget, ConfigParams, EndpointType ) # Configure target endpoint api_endpoint = ApiEndpoint( url="https://integrate.api.nvidia.com/v1/chat/completions", type=EndpointType.CHAT, model_id="meta/llama-3.2-3b-instruct", api_key="NGC_API_KEY" # variable name storing the key ) target = EvaluationTarget(api_endpoint=api_endpoint) # Configure evaluation task config = EvaluationConfig( type="gpqa_diamond", output_dir="./llama_3_1_8b_instruct_results" ) # Execute evaluation results = evaluate(target_cfg=target, eval_cfg=config) ``` (logprobs)= # Evaluate LLMs Using Log-Probabilities ## Introduction While the most typical approach to LLM evaluation involves assessing the quality of a model's generated response to a question, an alternative method uses **log-probabilities**. In this approach, we quantify a model's "surprise" or uncertainty when processing a text sequence. This is done by calculating the sum of log-probabilities that the model assigns to each token. A higher sum indicates the model is more confident about the sequence. In this evaluation approach: * The LLM is given a single combined text containing both the question and a potential answer. * Next, the sum of log-probabilities is calculated only for the tokens that belong to the answer. * This allows an assessment of how likely it is that the model would provide that answer for the given question. For multiple-choice scenarios, the answer with the highest sum is treated as the one selected by the model. The sum of log-probabilities can be used to calculate different metrics, such as **perplexity**. Additionally, log-probabilities can be analyzed to assess whether a response would be generated by the model using greedy sampling—a method commonly employed to evaluate **accuracy**. Using log-probabilities is especially useful for evaluating base (pre-trained) models, as it eliminates the need for complex instruction-following and does not require the model to adhere to a specific output format. :::{tip} In the example below we use the `piqa` benchmark, but the instructions provided apply to all `lm-evaluation-harness` tasks utilizing log-probabilities, such as: - arc_challenge - arc_multilingual - bbh - commonsense_qa - hellaswag - hellaswag_multilingual - musr - openbookqa - social_iqa - truthfulqa - winogrande ::: ## Before You Start Ensure you have: - **Completions Endpoint**: Log-probability tasks require completions endpoints (not chat) that supports `logprobs` and `echo` parameters (see {ref}`compatibility-log-probs`) - **Model Tokenizer**: Access to tokenizer files for client-side tokenization (supported types: `huggingface` or `tiktoken`) - **API Access**: Valid API key for your model endpoint if it is gated - **Authentication**: Hugging Face token for gated datasets and tokenizers ## Use NeMo Evaluator Launcher Use an example config for deploying and evaluating the [Meta Llama 3.1 8B](https://huggingface.co/meta-llama/Llama-3.1-8B) model: ```{literalinclude} ../../../packages/nemo-evaluator-launcher/examples/local_vllm_logprobs.yaml :language: yaml :start-after: "[docs-start-snippet]" ``` To launch the evaluation, run: ```bash nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_vllm_logprobs.yaml ``` :::{tip} Set `deployment: none` and provide `target` specification if you want to evaluate an existing endpoint instead: ```yaml defaults: - execution: local - deployment: none - _self_ execution: output_dir: llama_local env_vars: HF_TOKEN: ${oc.env:HF_TOKEN} # needed to access meta-llama/Llama-3.1-8B gated model target: api_endpoint: model_id: meta-llama/Llama-3.1-8B url: https://your-endpoint.com/v1/completions api_key_name: API_KEY # API Key with access to provided url # specify the benchmarks to evaluate evaluation: nemo_evaluator_config: # global config settings that apply to all tasks config: params: extra: # for log-probability tasks like piqa, you need to specify the tokenizer tokenizer: meta-llama/Llama-3.1-8B # or use a path to locally stored checkpoint tokenizer_backend: huggingface # or "tiktoken" tasks: - name: piqa ``` ::: ## Use NeMo Evaluator Start `lm-evaluation-harness` docker container: ```bash docker run --rm -it nvcr.io/nvidia/eval-factory/lm-evaluation-harness:{{ docker_compose_latest }} ``` or install `nemo-evaluator` and `nvidia-lm-eval` Python package in your environment of choice: ```bash pip install nemo-evaluator nvidia-lm-eval ``` To launch the evaluation, run the following Python code: ```{literalinclude} ../_snippets/piqa_hf.py :language: python :start-after: "# [snippet-start]" :end-before: "# [snippet-end]" ``` Make sure to provide the source for the tokenizer and a backend for loading it. For models trained with NeMo Framework, the tokenizer is stored inside the checkpoint directory. For the NeMo format it is available inside `context/nemo_tokenizer` subdirectory: ```python extra={ "tokenizer": "/workspace/llama3_8b_nemo2/context/nemo_tokenizer", "tokenizer_backend": "huggingface", }, ``` For Megatron Bridge checkpoints, the tokenizer is stored under `tokenizer` subdirectory: ```python extra={ "tokenizer": "/workspace/mbridge_llama3_8b/iter_0000000/tokenizer", "tokenizer_backend": "huggingface", }, ``` ## How it works When the server receives a `logprob=` parameter in the request, it will return the log-probabilities of tokens. When combined with `echo=true`, the model will include the input in its response, along with the corresponding log-probabilities. Then the recieved response is processed on the client (benchmark) side to isolate the log-probabilities corresponding specifically to the answer portion of the input. For this purpose the input is tokenized, which allows to trace which log-probabilities originated from the question, and which from the answer. (run-eval-reasoning)= # Evaluation of Reasoning Models Reasoning models require a distinct approach compared to standard language models. Their outputs are typically longer, may contain dedicated reasoning tokens, and are more susceptible to generating loops or repetitive sequences. Evaluating these models effectively requires custom parameter settings and careful handling of generation constraints. ## Before You Start Ensure you have: - **Model Endpoint**: An OpenAI-compatible API endpoint for your model (completions or chat). See {ref}`deployment-testing-compatibility` for snippets you can use to test your endpoint. - **API Access**: Valid API key if your endpoint requires authentication - **Installed Packages**: NeMo Evaluator or access to evaluation containers ## Recommended Settings ### Generation Settings Below are recommended generation settings for some popular reasoning-optimized models. These configurations should be included in the **model card**: | Model | Temperature | Top-p | Top-k | |---------------------|-------------|--------|--------| | **NVIDIA Nemotron** | 0.6 | 0.95 | — | | **DeepSeek R1** | 0.6 | 0.95 | — | | **Qwen 230B** | 0.6 | 0.95 | 20 | | **Phi-4 Reasoning** | 0.8 | 0.95 | 50 | ### Token Configuration - `max_new_tokens` must be **significantly increased** for reasoning tasks as it includes the length of both reasoning trace and the final answer. - Check the model card to see settings recommended by the model creators. - It is important to observe if the specified `max_new_tokens` is enough for the model to finish reasoning. :::{tip} You can verify successful reasoning completion in the logs via the {ref}`interceptor-reasoning` Interceptor, for example: ``` [I 2025-12-02T16:14:28.257] Reasoning tracking information reasoning_words=1905 original_content_words=85 updated_content_words=85 reasoning_finished=True reasoning_started=True reasoning_tokens=unknown updated_content_tokens=unknown logger=ResponseReasoningInterceptor request_id=ccff76b2-2b85-4eed-a9d0-2363b533ae58 ``` ::: ## Reasoning Output Formats Reasoning models produce outputs that contain both the **reasoning trace** (the model's step-by-step thinking process) and the **final answer**. The reasoning trace typically includes intermediate thoughts, calculations, and logical steps before arriving at the conclusion. There are two main ways to structure reasoning output: ### 1. Wrapped with reasoning tokens e.g. ``` ... ``` ``` ... ``` or ``` ... ``` Most of the benchmarks expect only the final answer to be present in model's response. If your model endpoint replies with reasoning trace present in the main content, it needs to be removed from the assistant messages. You can do it using the {ref}`interceptor-reasoning` Interceptor. The interceptor will remove reasoning trace from the content and (optionally) track statistics for reasoning traces. :::{note} The `ResponseReasoningInterceptor` is by default configured for the `...` and ` ...` format. If your model uses these special tokens, you do not need to modify anything in your configuration. ::: ### 2. Returned as `reasoning_content` field in messages output If your model is deployed with e.g. vLLM, sglang or NIM, the reasoning part of the model's output is likely returned in the separate `reasoning_content` field in messages output (see [vLLM documentation](https://docs.vllm.ai/en/stable/features/reasoning_outputs.html) and [sglang documentation](https://sgl-project.github.io/advanced_features/separate_reasoning.html)). In the messages returned by the endpoint, there are: - `reasoning_content`: The reasoning part of the output. - `content`: The content of the final answer. Conversely to the first method, this setup does not require any extra response parsing. However, in some benchmarks, errors may appear if the reasoning has not finished and the benchmark does not support empty answers in `content`. #### Enabling reasoning parser in vLLM To enable the `reasoning_content` field in vLLM, you need to pass the `--reasoning-parser` argument to the vLLM server. In NeMo Evaluator Launcher, you can do this via `deployment.extra_args`: ```yaml deployment: hf_model_handle: Qwen/Qwen3-Next-80B-A3B-Thinking extra_args: "--reasoning-parser deepseek_r1" ``` Available reasoning parsers depend on your vLLM version. Common options include `deepseek_r1` for models using `...` format. See the [vLLM reasoning outputs documentation](https://docs.vllm.ai/en/stable/features/reasoning_outputs.html) for details. --- ## Control the Reasoning Effort Some models allow turning reasoning on/off or setting its level of effort. There are usually 2 ways of doing it: - **Special instruction in the system prompt** - **Extra parameters passed to the chat_template** :::{tip} Check the model card and documentation of the deployment of your choice to see how you can control the reasoning effort for your model. If there are several options available, it is recommended to use the dedicated chat template parameters over the system prompt. ::: ### Control reasoning with the system prompt In this example we will use the [NVIDIA-Nemotron-Nano-9B-v2](https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2/modelcard) model. This model allows you to control the reasoning effort by including `/think` or `/no_think` in the system prompt, e.g.: ```json { "model": "nvidia/nvidia-nemotron-nano-9b-v2", "messages": [ {"role": "system", "content": "You are a helpful assistant. /think"}, {"role": "user", "content": "What is 2+2?"} ], "temperature": 0.6, "top_p": 0.95, "max_tokens": 32768 } ``` When launching the evaluation, we can use the {ref}`interceptor-system-messages` Interceptor to add `/think` or `/no_think` to the system prompt. ```yaml config: params: temperature: 0.6 top_p: 0.95 max_new_tokens: 32768 # for reasoning + final answer target: api_endpoint: adapter_config: process_reasoning_traces: true # strips reasoning tokens and collects reasoning stats use_system_prompt: true # turn reasoning on with special system prompt custom_system_prompt: >- "/think" ``` ### Control reasoning with additional parameters In this example we will use the [Granite-3.3-8B-Instruct](https://build.nvidia.com/ibm/granite-3_3-8b-instruct/modelcard) model. Conversely to NVIDIA-Nemotron-Nano-9B-v2, this model allows you to turn the reasoning on with an additional `thinking` parameter passed to the chat template: ```json { "model": "ibm/granite-3.3-8b-instruct", "messages": [ { "role": "user", "content": "What is 2+2?" } ], "temperature": 0.2, "top_p": 0.7, "max_tokens": 8192, "seed": 42, "stream": true, "chat_template_kwargs": { "thinking": true } } ``` When running the evaluation, use the {ref}`interceptor-payload-modification` Interceptor to add this parameter to benchmarks' requests: ```yaml config: params: temperature: 0.6 top_p: 0.95 max_new_tokens: 32768 # for reasoning + final answer target: api_endpoint: adapter_config: process_reasoning_traces: true params_to_add: chat_template_kwargs: thinking: true ``` ## Benchmarks for Reasoning Reasoning models excel at tasks that require multi-step thinking, logical deduction, and complex problem-solving. The following benchmark categories are particularly well-suited for evaluating reasoning capabilities: - **CoT tasks**: e.g., AIME, Math, GPQA-diamond - **Coding**: e.g., scicodebench, livedocebench :::{tip} When evaluating your model on a task that does not require step-by-step thinking, consider turning the reasoning off or lowering the thinking budget. ::: ## Full Working Example ### Run the evaluation An example config is available in `packages/nemo-evaluator-launcher/examples/local_reasoning.yaml`: ```{literalinclude} ../../../packages/nemo-evaluator-launcher/examples/local_reasoning.yaml :language: yaml :start-after: "[docs-start-snippet]" ``` To launch the evaluation, run: ```bash export NGC_API_KEY=nvapi-... nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_reasoning.yaml ``` ### Analyze the artifacts NeMo Evaluator produces several artifacts for analysis after evaluation completion. The primary output file is `results.yaml`, which stores the metrics produced by the benchmark (see {ref}`evaluation-output` for more details). The `eval_factory_metrics.json` file provides valuable insights into your model's behavior. When the reasoning interceptor is enabled, this file contains a `reasoning` key that stores statistics about reasoning traces in your model's responses: ```json "reasoning": { "description": "Reasoning statistics saved during processing", "total_responses": 3672, "responses_with_reasoning": 2860, "reasoning_finished_count": 2860, "reasoning_finished_ratio": 1.0, "reasoning_started_count": 2860, "reasoning_unfinished_count": 0, "avg_reasoning_words": 153.21, "avg_original_content_words": 192.17, "avg_updated_content_words": 38.52, "max_reasoning_words": 806, "max_original_content_words": 863, "max_updated_content_words": 863, "max_reasoning_tokens": null, "avg_reasoning_tokens": null, "max_updated_content_tokens": null, "avg_updated_content_tokens": null, "total_reasoning_words": 561696, "total_original_content_words": 705555, "total_updated_content_words": 140999, "total_reasoning_tokens": 0, "total_updated_content_tokens": 0 }, ``` In the example above, the model used reasoning for 2860 out of 3672 responses (approximately 78%). The matching values for `reasoning_started_count` and `reasoning_finished_count` (and `reasoning_unfinished_count` being 0) indicate that the `max_new_tokens` parameter was set sufficiently high, allowing the model to complete all reasoning traces without truncation. These statistics also enable cost analysis for reasoning operations. While the endpoint in this example does not return reasoning token usage statistics (the `*_tokens` fields are null or zero), you can still analyze computational cost using the word count metrics from the responses. For more information on available artifacts, see {ref}`evaluation-output`. (deployment-overview)= # Serve and Deploy Models Deploy and serve models with NeMo Evaluator's flexible deployment options. Select a deployment strategy that matches your workflow, infrastructure, and requirements. ## Overview NeMo Evaluator keeps model serving separate from evaluation execution, giving you flexible architectures and scalable workflows. Choose who manages deployment based on your needs. ### Key Concepts - **Model-Evaluation Separation**: Models serve via OpenAI-compatible APIs, evaluations run in containers - **Deployment Responsibility**: Choose who manages the model serving infrastructure - **Multi-Backend Support**: Deploy locally, on HPC clusters, or in the cloud ## Deployment Strategy Guide ### **Launcher-Orchestrated Deployment** (Recommended) Let NeMo Evaluator Launcher handle both model deployment and evaluation orchestration: ```bash # Launcher deploys model AND runs evaluation HOSTNAME=cluster-login-node.com ACCOUNT=my_account OUT_DIR=/absolute/path/on/login/node nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/slurm_vllm_checkpoint_path.yaml \ -o execution.hostname=$HOSTNAME \ -o execution.output_dir=$OUT_DIR \ -o execution.account=$ACCOUNT \ -o deployment.checkpoint_path=/shared/models/llama-3.1-8b-it ``` **When to use:** - You want automated deployment lifecycle management - You prefer integrated monitoring and cleanup - You want the simplest path from model to results **Supported deployment types:** vLLM, NIM, SGLang, TRT-LLM, or no deployment (existing endpoints) :::{seealso} For detailed YAML configuration reference for each deployment type, see the {ref}`configuration-overview` in the NeMo Evaluator Launcher library. ::: ### **Bring-Your-Own-Endpoint** You handle model deployment, NeMo Evaluator handles evaluation: **Launcher users with existing endpoints:** ```bash # Point launcher to your deployed model URL=http://localhost:8000/v1/chat/completions MODEL=your-model-name nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \ -o target.api_endpoint.url=$URL \ -o target.api_endpoint.model_id=$MODEL ``` **Core library users:** ```python from nemo_evaluator import evaluate, ApiEndpoint, EvaluationTarget, EvaluationConfig api_endpoint = ApiEndpoint(url="http://localhost:8080/v1/completions") target = EvaluationTarget(api_endpoint=api_endpoint) config = EvaluationConfig(type="gsm8k", output_dir="./results") evaluate(target_cfg=target, eval_cfg=config) ``` **When to use:** - You have existing model serving infrastructure - You need custom deployment configurations - You want to deploy once and run many evaluations - You have specific security or compliance requirements ## Available Deployment Types The launcher supports multiple deployment types through Hydra configuration: **vLLM Deployment** ```yaml deployment: type: vllm image: vllm/vllm-openai:latest hf_model_handle: hf-model/handle # HuggingFace ID checkpoint_path: null # or provide a path to the stored checkpoint served_model_name: your-model-name port: 8000 ``` **NIM Deployment** ```yaml deployment: image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.6 served_model_name: meta/llama-3.1-8b-instruct port: 8000 ``` **SGLang Deployment** ```yaml deployment: type: sglang image: lmsysorg/sglang:latest hf_model_handle: hf-model/handle # HuggingFace ID checkpoint_path: null # or provide a path to the stored checkpoint served_model_name: your-model-name port: 8000 ``` **No Deployment** ```yaml deployment: type: none # Use existing endpoint ``` ## Bring-Your-Own-Endpoint Options Choose from these approaches when managing your own deployment: ### Hosted Services - **NVIDIA Build**: Ready-to-use hosted models with OpenAI-compatible APIs - **OpenAI API**: Direct integration with OpenAI's models - **Other providers**: Any service providing OpenAI-compatible endpoints ### Enterprise Integration - **Kubernetes deployments**: Container orchestration in production environments - **Existing MLOps pipelines**: Integration with current model serving infrastructure - **Custom infrastructure**: Specialized deployment requirements (adapters-client-mode)= # Client Mode The NeMo Evaluator adapter system supports **Client Mode**, where adapters run in-process through a custom httpx transport, providing a simpler alternative to the proxy server architecture. ## Overview | Feature | Server Mode | Client Mode | |---------|------------|-------------| | **Architecture** | Separate proxy server process | In-process via httpx transport | | **Setup** | Automatic server startup/shutdown | Simple client instantiation | | **Use Case** | Framework-driven evaluations | Direct API usage, notebooks | | **Overhead** | Network proxy | Direct in-process execution | | **Debugging** | Separate process | Same process, easier debugging | ## Quick Start ```python from nemo_evaluator.client import NeMoEvaluatorClient from nemo_evaluator.api.api_dataclasses import EndpointModelConfig from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig # Configure model and adapters config = EndpointModelConfig( model_id="my-model", url="https://api.example.com/v1/chat/completions", api_key_name="API_KEY", # Environment variable name adapter_config=AdapterConfig( mode="client", # Use client mode (no server) interceptors=[ InterceptorConfig(name="caching", enabled=True), InterceptorConfig(name="endpoint", enabled=True), ] ), is_base_url=False, # True if URL is base, False for complete endpoint ) # Create client async with NeMoEvaluatorClient(config, output_dir="./output") as client: response = await client.chat_completion( messages=[{"role": "user", "content": "Hello!"}] ) print(response) ``` ## Mode Configuration ### Adapter Mode Field The `mode` field in `AdapterConfig` controls whether a server process is spawned: - **`mode="server"`** (default): Spawns adapter server process in `evaluate()` calls - **`mode="client"`**: Skips server spawning, for use with `NeMoEvaluatorClient` When using `NeMoEvaluatorClient` directly, set `mode="client"` to prevent unnecessary server creation if the config is also used in `evaluate()` calls. ## URL Modes Client mode supports two URL configurations via the `is_base_url` flag: ### Base URL Mode (`is_base_url=True`) Use when the URL is a base URL and the client should append paths: ```python config = EndpointModelConfig( url="https://api.example.com/v1", # Base URL is_base_url=True, ... ) # Requests go to: https://api.example.com/v1/chat/completions ``` ### Passthrough Mode (`is_base_url=False`) Use when the URL is the complete endpoint: ```python config = EndpointModelConfig( url="https://api.example.com/v1/chat/completions", # Complete endpoint is_base_url=False, # Default ... ) # Requests go to: https://api.example.com/v1/chat/completions (as-is) ``` ## API Reference ### Initialization ```python from nemo_evaluator.client import NeMoEvaluatorClient from nemo_evaluator.api.api_dataclasses import EndpointModelConfig client = NeMoEvaluatorClient( endpoint_model_config=EndpointModelConfig( model_id="model-name", url="https://api.example.com/v1/chat/completions", api_key_name="API_KEY", adapter_config=adapter_config, is_base_url=False, temperature=0.7, top_p=0.9, max_new_tokens=100, request_timeout=60, max_retries=3, parallelism=5, ), output_dir="./eval_output" ) ``` ### Methods #### Chat Completion ```python # Single request (async) response = await client.chat_completion( messages=[{"role": "user", "content": "Hello"}], seed=42 # Optional ) # Batch requests (sync wrapper) responses = client.chat_completions( messages_list=[ [{"role": "user", "content": "Hello"}], [{"role": "user", "content": "Hi"}], ], seeds=[42, 43], # Optional show_progress=True ) # Batch requests (async) responses = await client.batch_chat_completions( messages_list=[...], seeds=[...], show_progress=True ) ``` #### Text Completion ```python # Single completion response = await client.completion( prompt="Once upon a time", seed=42 ) # Batch completions responses = client.completions( prompts=["Prompt 1", "Prompt 2"], seeds=[42, 43], show_progress=True ) ``` #### Embeddings ```python # Single embedding embedding = await client.embedding(text="Hello world") # Batch embeddings embeddings = client.embeddings( texts=["Text 1", "Text 2"], show_progress=True ) ``` ### Context Manager ```python # Recommended: ensures post-eval hooks run async with NeMoEvaluatorClient(config, output_dir="./output") as client: response = await client.chat_completion(messages=[...]) # Hooks run automatically on exit ``` ### Manual Cleanup ```python client = NeMoEvaluatorClient(config, output_dir="./output") try: response = await client.chat_completion(messages=[...]) finally: await client.aclose() # Runs post-eval hooks ``` ## Adapter Configuration Client mode uses the same `AdapterConfig` as server mode, but with `mode="client"` to prevent server spawning: ```python from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig adapter_config = AdapterConfig( mode="client", # Prevents adapter server from spawning interceptors=[ InterceptorConfig( name="system_message", config={"system_message": "You are helpful."} ), InterceptorConfig(name="request_logging"), InterceptorConfig( name="caching", config={"cache_dir": "./cache"} ), InterceptorConfig(name="reasoning"), InterceptorConfig(name="endpoint"), # Required ], post_eval_hooks=[ {"name": "post_eval_report", "config": {"report_types": ["html"]}} ] ) ``` **Note:** When using `NeMoEvaluatorClient`, the `mode` is automatically set to `"client"` if not specified. ## Implementation Details ### Architecture ``` ┌─────────────────────────┐ │ Your Script/Notebook │ └───────────┬─────────────┘ │ ↓ ┌─────────────────────────┐ │ NeMoEvaluatorClient │ │ (AsyncOpenAI wrapper) │ └───────────┬─────────────┘ │ ↓ ┌─────────────────────────┐ │ AsyncAdapterTransport │ │ (httpx.AsyncBaseTransport) │ │ │ │ ┌───────────────────┐ │ │ │ Adapter Pipeline │ │ │ │ - Interceptors │ │ │ │ - Post-eval hooks │ │ │ └───────────────────┘ │ └───────────┬─────────────┘ │ HTTP ↓ ┌─────────────────────────┐ │ Model Endpoint │ └─────────────────────────┘ ``` ### Request Flow 1. User calls `client.chat_completion(...)` 2. AsyncOpenAI client constructs httpx.Request 3. AsyncAdapterTransport intercepts the request 4. Request wrapped for adapter compatibility (HttpxRequestWrapper) 5. Request passes through interceptor chain (in thread pool for sync interceptors) 6. Endpoint interceptor makes HTTP call 7. Response passes back through response interceptors 8. Response converted back to httpx.Response 9. AsyncOpenAI client parses and returns completion ### Sync/Async Bridging Client mode handles the async/sync boundary automatically: - AsyncAdapterTransport is async (implements `httpx.AsyncBaseTransport`) - Adapter pipeline and interceptors are synchronous - `asyncio.to_thread()` runs sync pipeline in thread pool - Seamless integration with async OpenAI client ## When to Use Client Mode ### Use Client Mode When: - Writing custom evaluation scripts - Working in Jupyter notebooks - Need direct API control - Want simpler setup - Debugging in same process - Single-process evaluations ### Use Server Mode When: - Running framework-driven evaluations with `evaluate()` - Need shared adapter state across processes - Working with harnesses that don't support custom clients - Running distributed evaluations ## Examples ### Basic Usage ```python from nemo_evaluator.client import NeMoEvaluatorClient from nemo_evaluator.api.api_dataclasses import EndpointModelConfig from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig config = EndpointModelConfig( model_id="llama-3-70b", url="https://integrate.api.nvidia.com/v1/chat/completions", api_key_name="NVIDIA_API_KEY", is_base_url=False, adapter_config=AdapterConfig( interceptors=[ InterceptorConfig(name="caching"), InterceptorConfig(name="endpoint"), ] ), ) async with NeMoEvaluatorClient(config, "./output") as client: response = await client.chat_completion( messages=[{"role": "user", "content": "What is AI?"}] ) print(response) ``` ### Batch Processing ```python # Process multiple prompts with progress bar prompts = [ [{"role": "user", "content": f"Question {i}"}] for i in range(100) ] responses = client.chat_completions( messages_list=prompts, show_progress=True ) ``` ### With All Interceptors ```python adapter_config = AdapterConfig( interceptors=[ InterceptorConfig( name="system_message", config={"system_message": "Be concise."} ), InterceptorConfig(name="request_logging"), InterceptorConfig(name="response_logging"), InterceptorConfig( name="caching", config={ "cache_dir": "./cache", "reuse_cached_responses": True, "save_requests": True, "save_responses": True, } ), InterceptorConfig( name="reasoning", config={"start_reasoning_token": ""} ), InterceptorConfig(name="response_stats"), InterceptorConfig(name="endpoint"), ], post_eval_hooks=[ {"name": "post_eval_report", "config": {"report_types": ["html", "json"]}} ] ) ``` ## See Also - {ref}`adapters-concepts` - Conceptual overview of the adapter system - {ref}`adapters-configuration` - Available interceptors and configuration options - {ref}`deployment-adapters-recipes` - Common adapter patterns and recipes (adapters-configuration)= # Configuration Configure the adapter system using the `AdapterConfig` class from `nemo_evaluator.adapters.adapter_config`. This class uses a registry-based interceptor architecture where you configure a list of interceptors, each with their own parameters. ## Core Configuration Structure `AdapterConfig` accepts the following structure: ```python from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig adapter_config = AdapterConfig( interceptors=[ InterceptorConfig( name="interceptor_name", enabled=True, # Optional, defaults to True config={ # Interceptor-specific parameters } ) ], endpoint_type="chat" # Optional, defaults to "chat" ) ``` ## Available Interceptors ### System Message Interceptor **Name:** `system_message` Adds a system message to requests by adding it as a system role message. ```{list-table} :header-rows: 1 :widths: 20 15 15 50 * - Parameter - Type - Default - Description * - `system_message` - `str` - Required - System message to add to requests * - `strategy` - `str` - `"prepend"` - Strategy for handling existing system messages. Options: `"replace"` (replaces existing), `"append"` (appends to existing), `"prepend"` (prepends to existing) ``` **Example:** ```python InterceptorConfig( name="system_message", config={ "system_message": "You are a helpful assistant." } ) # With explicit strategy InterceptorConfig( name="system_message", config={ "system_message": "You are a helpful assistant.", "strategy": "replace" # Replace existing system messages } ) ``` ### Reasoning Interceptor **Name:** `reasoning` Processes reasoning content in responses by detecting and removing reasoning tokens, tracking reasoning statistics, and optionally extracting reasoning to separate fields. ```{list-table} :header-rows: 1 :widths: 25 15 20 40 * - Parameter - Type - Default - Description * - `start_reasoning_token` - `str \| None` - `""` - Token marking start of reasoning section * - `end_reasoning_token` - `str` - `""` - Token marking end of reasoning section * - `add_reasoning` - `bool` - `True` - Whether to add reasoning information * - `migrate_reasoning_content` - `bool` - `False` - Migrate reasoning_content to content field with tokens * - `enable_reasoning_tracking` - `bool` - `True` - Enable reasoning tracking and logging * - `include_if_not_finished` - `bool` - `True` - Include reasoning if end token not found * - `enable_caching` - `bool` - `True` - Cache individual request reasoning statistics * - `cache_dir` - `str` - `"/tmp/reasoning_interceptor"` - Cache directory for reasoning stats * - `stats_file_saving_interval` - `int \| None` - `None` - Save stats to file every N responses (None = only save via post_eval_hook) * - `logging_aggregated_stats_interval` - `int` - `100` - Log aggregated stats every N responses ``` **Example:** ```python InterceptorConfig( name="reasoning", config={ "start_reasoning_token": "", "end_reasoning_token": "", "enable_reasoning_tracking": True } ) ``` ### Request Logging Interceptor **Name:** `request_logging` Logs incoming requests with configurable limits and detail levels. ```{list-table} :header-rows: 1 :widths: 20 15 15 50 * - Parameter - Type - Default - Description * - `log_request_body` - `bool` - `True` - Whether to log request body * - `log_request_headers` - `bool` - `True` - Whether to log request headers * - `max_requests` - `int \| None` - `2` - Maximum requests to log (None for unlimited) ``` **Example:** ```python InterceptorConfig( name="request_logging", config={ "max_requests": 50, "log_request_body": True } ) ``` ### Response Logging Interceptor **Name:** `response_logging` Logs outgoing responses with configurable limits and detail levels. ```{list-table} :header-rows: 1 :widths: 20 15 15 50 * - Parameter - Type - Default - Description * - `log_response_body` - `bool` - `True` - Whether to log response body * - `log_response_headers` - `bool` - `True` - Whether to log response headers * - `max_responses` - `int \| None` - `None` - Maximum responses to log (None for unlimited) ``` **Example:** ```python InterceptorConfig( name="response_logging", config={ "max_responses": 50, "log_response_body": True } ) ``` ### Caching Interceptor **Name:** `caching` Caches requests and responses to disk with options for reusing cached responses. ```{list-table} :header-rows: 1 :widths: 25 15 15 45 * - Parameter - Type - Default - Description * - `cache_dir` - `str` - `"/tmp"` - Directory to store cache files * - `reuse_cached_responses` - `bool` - `False` - Whether to reuse cached responses * - `save_requests` - `bool` - `False` - Whether to save requests to cache * - `save_responses` - `bool` - `True` - Whether to save responses to cache * - `max_saved_requests` - `int \| None` - `None` - Maximum requests to save (None for unlimited) * - `max_saved_responses` - `int \| None` - `None` - Maximum responses to save (None for unlimited) ``` **Notes:** - If `reuse_cached_responses` is `True`, `save_responses` is automatically set to `True` and `max_saved_responses` to `None` - The system generates cache keys automatically using SHA256 hash of request data **Example:** ```python InterceptorConfig( name="caching", config={ "cache_dir": "./evaluation_cache", "reuse_cached_responses": True } ) ``` ### Progress Tracking Interceptor **Name:** `progress_tracking` Tracks evaluation progress by counting processed samples and optionally sending updates to a webhook. ```{list-table} :header-rows: 1 :widths: 25 15 20 40 * - Parameter - Type - Default - Description * - `progress_tracking_url` - `str \| None` - `"http://localhost:8000"` - URL to post progress updates. Supports shell variable expansion. * - `progress_tracking_interval` - `int` - `1` - Update every N samples * - `request_method` - `str` - `"PATCH"` - HTTP method for progress updates * - `output_dir` - `str \| None` - `None` - Directory to save progress file (creates a `progress` file in this directory) ``` **Example:** ```python InterceptorConfig( name="progress_tracking", config={ "progress_tracking_url": "http://monitor:8000/progress", "progress_tracking_interval": 10 } ) ``` ### Endpoint Interceptor **Name:** `endpoint` Makes the actual HTTP request to the upstream API. This interceptor has no configurable parameters and is typically added automatically as the final interceptor in the chain. **Example:** ```python InterceptorConfig(name="endpoint") ``` ## Configuration Examples ### Basic Configuration ```python from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig adapter_config = AdapterConfig( interceptors=[ InterceptorConfig( name="request_logging", config={"max_requests": 10} ), InterceptorConfig( name="caching", config={"cache_dir": "./cache"} ) ] ) ``` ### Advanced Configuration ```python from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig adapter_config = AdapterConfig( interceptors=[ # System prompting InterceptorConfig( name="system_message", config={ "system_message": "You are an expert AI assistant." } ), # Reasoning processing InterceptorConfig( name="reasoning", config={ "start_reasoning_token": "", "end_reasoning_token": "", "enable_reasoning_tracking": True } ), # Request logging InterceptorConfig( name="request_logging", config={ "max_requests": 1000, "log_request_body": True } ), # Response logging InterceptorConfig( name="response_logging", config={ "max_responses": 1000, "log_response_body": True } ), # Caching InterceptorConfig( name="caching", config={ "cache_dir": "./production_cache", "reuse_cached_responses": True } ), # Progress tracking InterceptorConfig( name="progress_tracking", config={ "progress_tracking_url": "http://monitoring:3828/progress", "progress_tracking_interval": 10 } ) ], endpoint_type="chat" ) ``` ### YAML Configuration You can also configure adapters through YAML files in your evaluation configuration: ```yaml target: api_endpoint: url: http://localhost:8080/v1/chat/completions type: chat model_id: megatron_model adapter_config: interceptors: - name: system_message config: system_message: "You are a helpful assistant." strategy: "prepend" # Optional: replace, append, or prepend (default) - name: reasoning config: start_reasoning_token: "" end_reasoning_token: "" - name: request_logging config: max_requests: 50 - name: response_logging config: max_responses: 50 - name: caching config: cache_dir: ./cache reuse_cached_responses: true ``` ## Interceptor Order Interceptors are executed in the order they appear in the `interceptors` list: 1. **Request interceptors** process the request in list order 2. The **endpoint interceptor** makes the actual API call (automatically added if not present) 3. **Response interceptors** process the response in reverse list order For example, with interceptors `[system_message, request_logging, caching, response_logging, reasoning]`: - Request flow: `system_message` → `request_logging` → `caching` (check cache) → API call (if cache miss) - Response flow: API call → `caching` (save to cache) → `response_logging` → `reasoning` ## Shorthand Syntax You can use string names as shorthand for interceptors with default configuration: ```python adapter_config = AdapterConfig( interceptors=["request_logging", "caching", "response_logging"] ) ``` This is equivalent to: ```python adapter_config = AdapterConfig( interceptors=[ InterceptorConfig(name="request_logging"), InterceptorConfig(name="caching"), InterceptorConfig(name="response_logging") ] ) ``` --- orphan: true --- (adapters)= # Evaluation Adapters Evaluation adapters provide a flexible mechanism for intercepting and modifying requests/responses between the evaluation harness and the model endpoint. This allows for custom processing, logging, and transformation of data during the evaluation process. ## Concepts For a conceptual overview and architecture diagram of adapters and interceptor chains, refer to {ref}`adapters-concepts`. ## Topics Explore the following pages to use and configure adapters. ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} Usage :link: adapters-usage :link-type: ref Learn how to enable adapters and pass `AdapterConfig` to `evaluate`. ::: :::{grid-item-card} Recipes :link: deployment-adapters-recipes :link-type: ref Reasoning cleanup, system prompt override, response shaping, logging caps. ::: :::{grid-item-card} Configuration :link: adapters-configuration :link-type: ref View available `AdapterConfig` options and defaults. ::: :::: ```{toctree} :maxdepth: 1 :hidden: Usage Recipes Configuration ``` (adapters-usage)= # Usage Configure the adapter system using the `AdapterConfig` class with interceptors. Pass the configuration through the `ApiEndpoint.adapter_config` parameter: ```python from nemo_evaluator import ( ApiEndpoint, EndpointType, EvaluationConfig, EvaluationTarget, evaluate ) from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig # Configure adapter with multiple interceptors adapter_config = AdapterConfig( interceptors=[ # Reasoning interceptor InterceptorConfig( name="reasoning", config={ "start_reasoning_token": "", "end_reasoning_token": "" } ), # System message interceptor InterceptorConfig( name="system_message", config={ "system_message": "You are a helpful assistant that thinks step by step." } ), # Logging interceptors InterceptorConfig( name="request_logging", config={"max_requests": 50} ), InterceptorConfig( name="response_logging", config={"max_responses": 50} ), # Caching interceptor InterceptorConfig( name="caching", config={ "cache_dir": "./evaluation_cache" } ), # Progress tracking InterceptorConfig( name="progress_tracking" ) ] ) # Configure evaluation target api_endpoint = ApiEndpoint( url="http://localhost:8080/v1/completions/", type=EndpointType.COMPLETIONS, model_id="megatron_model", adapter_config=adapter_config ) target_config = EvaluationTarget(api_endpoint=api_endpoint) # Configure evaluation eval_config = EvaluationConfig( type="mmlu_pro", params={"limit_samples": 10}, output_dir="./results/mmlu", ) # Run evaluation with adapter system results = evaluate( eval_cfg=eval_config, target_cfg=target_config ) ``` ## YAML Configuration You can also configure adapters through YAML configuration files: ```yaml target: api_endpoint: url: http://localhost:8080/v1/completions/ type: completions model_id: megatron_model adapter_config: interceptors: - name: reasoning config: start_reasoning_token: "" end_reasoning_token: "" - name: system_message config: system_message: "You are a helpful assistant that thinks step by step." - name: request_logging config: max_requests: 50 - name: response_logging config: max_responses: 50 - name: caching config: cache_dir: ./cache - name: progress_tracking config: type: mmlu_pro output_dir: ./results params: limit_samples: 10 ``` (adapters-recipe-system-prompt)= # Custom System Prompt (Chat) Apply a standard system message to chat endpoints for consistent behavior. ```python from nemo_evaluator import ( ApiEndpoint, EndpointType, EvaluationConfig, EvaluationTarget, evaluate ) from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig # Configure chat endpoint chat_url = "http://0.0.0.0:8080/v1/chat/completions/" api_endpoint = ApiEndpoint(url=chat_url, type=EndpointType.CHAT, model_id="megatron_model") # Configure adapter with custom system prompt using interceptor api_endpoint.adapter_config = AdapterConfig( interceptors=[ InterceptorConfig( name="system_message", config={ "system_message": "You are a precise, concise assistant. Answer questions directly and accurately.", "strategy": "prepend" # Optional: "replace", "append", or "prepend" (default) } ) ] ) target = EvaluationTarget(api_endpoint=api_endpoint) config = EvaluationConfig(type="mmlu_pro", output_dir="results") results = evaluate(target_cfg=target, eval_cfg=config) ``` ## How It Works The `system_message` interceptor modifies chat-format requests based on the configured strategy: - **`prepend` (default)**: Prepends the configured system message before any existing system message - **`replace`**: Removes any existing system messages and replaces with the configured message - **`append`**: Appends the configured system message after any existing system message All strategies: 1. Insert or modify the system message as the first message with `role: "system"` 2. Preserve all other request parameters ## Strategy Examples ```python # Replace existing system messages (ignore any existing ones) InterceptorConfig( name="system_message", config={ "system_message": "You are a helpful assistant.", "strategy": "replace" } ) # Prepend to existing system messages (default behavior) InterceptorConfig( name="system_message", config={ "system_message": "Important: ", "strategy": "prepend" } ) # Append to existing system messages InterceptorConfig( name="system_message", config={ "system_message": "\nRemember to be concise.", "strategy": "append" } ) ``` Refer to {ref}`adapters-configuration` for more configuration options. (deployment-adapters-recipes)= # Recipes Practical, focused examples for common adapter scenarios. ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} Reasoning Cleanup :link: adapters-recipe-reasoning :link-type: ref Strip intermediate reasoning tokens before scoring. ::: :::{grid-item-card} Custom System Prompt (Chat) :link: adapters-recipe-system-prompt :link-type: ref Enforce a standard system prompt for chat endpoints. ::: :::{grid-item-card} Request Parameter Modification :link: adapters-recipe-response-shaping :link-type: ref Standardize request parameters across endpoint providers. ::: :::{grid-item-card} Logging Caps :link: adapters-recipe-logging :link-type: ref Control logging volume for requests and responses. ::: :::: ```{toctree} :maxdepth: 1 :hidden: Reasoning Cleanup Custom System Prompt (Chat) Request Parameter Modification Logging Caps ``` (adapters-recipe-reasoning)= # Reasoning Cleanup Use the reasoning adapter to remove intermediate thoughts from model outputs before scoring. ```python from nemo_evaluator import ( ApiEndpoint, EndpointType, EvaluationConfig, EvaluationTarget, evaluate ) from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig # Configure completions endpoint completions_url = "http://0.0.0.0:8080/v1/completions/" api_endpoint = ApiEndpoint(url=completions_url, type=EndpointType.COMPLETIONS, model_id="megatron_model") # Configure adapter with reasoning extraction api_endpoint.adapter_config = AdapterConfig( interceptors=[ InterceptorConfig( name="reasoning", enabled=True, config={ "start_reasoning_token": "", "end_reasoning_token": "" } ) ] ) target = EvaluationTarget(api_endpoint=api_endpoint) config = EvaluationConfig(type="gsm8k", output_dir="results") results = evaluate(target_cfg=target, eval_cfg=config) ``` ## Configuration Parameters Set both `start_reasoning_token` and `end_reasoning_token` to match your model's delimiters. The reasoning interceptor removes content between these tokens from the final response before scoring. Optional parameters: - `include_if_not_finished` (default: `True`): Include reasoning content if reasoning is not finished (end token not found) - `enable_reasoning_tracking` (default: `True`): Enable reasoning tracking and logging - `add_reasoning` (default: `True`): Whether to add reasoning information to the response - `migrate_reasoning_content` (default: `False`): Migrate `reasoning_content` field to `content` field with tokens Reasoning statistics (word counts, token counts, completion status) are automatically tracked and logged when enabled. Refer to {ref}`adapters-configuration` for all interceptor options and defaults. (adapters-recipe-logging)= # Logging Caps Limit logging volume during evaluations to control overhead. ```python from nemo_evaluator import ( ApiEndpoint, EndpointType, EvaluationConfig, EvaluationTarget, evaluate ) from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig # Configure completions endpoint completions_url = "http://0.0.0.0:8080/v1/completions/" api_endpoint = ApiEndpoint(url=completions_url, type=EndpointType.COMPLETIONS, model_id="megatron_model") # Configure adapter with logging limits api_endpoint.adapter_config = AdapterConfig( interceptors=[ InterceptorConfig( name="request_logging", enabled=True, config={"max_requests": 5} # Limit request logging ), InterceptorConfig( name="response_logging", enabled=True, config={"max_responses": 5} # Limit response logging ) ] ) target = EvaluationTarget(api_endpoint=api_endpoint) config = EvaluationConfig(type="hellaswag", output_dir="results") results = evaluate(target_cfg=target, eval_cfg=config) ``` Use the following tips to control logging caps: - Include `request_logging` and `response_logging` interceptors to enable logging - Set `max_requests` and `max_responses` in the interceptor config to limit volume - Omit or disable interceptors to turn off logging for that direction - Use low limits for quick debugging, and increase when needed Refer to {ref}`adapters-configuration` for all `AdapterConfig` options and defaults (adapters-recipe-response-shaping)= # Request Parameter Modification Standardize request parameters across different endpoint providers. ```python from nemo_evaluator import ( ApiEndpoint, EndpointType, EvaluationConfig, EvaluationTarget, evaluate ) from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig # Configure completions endpoint completions_url = "http://0.0.0.0:8080/v1/completions/" api_endpoint = ApiEndpoint(url=completions_url, type=EndpointType.COMPLETIONS, model_id="megatron_model") # Configure adapter with payload modification for response shaping api_endpoint.adapter_config = AdapterConfig( interceptors=[ InterceptorConfig( name="payload_modifier", enabled=True, config={ "params_to_add": {"temperature": 0.0, "max_new_tokens": 100}, "params_to_remove": ["max_tokens"] # Remove conflicting parameters } ), InterceptorConfig( name="request_logging", enabled=True, config={"max_requests": 10} ), InterceptorConfig( name="response_logging", enabled=True, config={"max_responses": 10} ) ] ) target = EvaluationTarget(api_endpoint=api_endpoint) config = EvaluationConfig(type="lambada", output_dir="results") results = evaluate(target_cfg=target, eval_cfg=config) ``` Guidance: - Use the `payload_modifier` interceptor to standardize request parameters across different endpoints - Configure `params_to_add` in the interceptor config to add or override parameters - Configure `params_to_remove` in the interceptor config to eliminate conflicting or unsupported parameters - Configure `params_to_rename` in the interceptor config to map parameter names between different API formats - Use `request_logging` and `response_logging` interceptors to monitor transformations - Keep transformations minimal to avoid masking upstream issues - The payload modifier interceptor works with both chat and completions endpoints (bring-your-own-endpoint-hosted)= # Hosted Services Use existing hosted model APIs from cloud providers without managing your own infrastructure. This approach offers the fastest path to evaluation with minimal setup requirements. ## Overview Hosted services provide: - Pre-deployed models accessible via API - No infrastructure management required - Pay-per-use pricing models - Instant availability and global access - Professional SLA and support ## NVIDIA Build NVIDIA's catalog of ready-to-use AI models with OpenAI-compatible APIs. ### NVIDIA Build Setup and Authentication ```bash # Get your NGC API key from https://build.nvidia.com export NGC_API_KEY="nvapi-your-ngc-api-key" # Test authentication curl -H "Authorization: Bearer $NGC_API_KEY" \ "https://integrate.api.nvidia.com/v1/models" ``` Refer to the [NVIDIA Build catalog](https://build.nvidia.com) for available models. ### NVIDIA Build Configuration #### Basic NVIDIA Build Evaluation ```yaml # config/nvidia_build_basic.yaml defaults: - execution: local - deployment: none # No deployment needed - _self_ target: api_endpoint: url: https://integrate.api.nvidia.com/v1/chat/completions model_id: meta/llama-3.2-3b-instruct api_key_name: NGC_API_KEY # Name of environment variable execution: output_dir: ./results evaluation: overrides: config.params.limit_samples: 100 tasks: - name: ifeval ``` #### Multi-Model Comparison For multi-model comparison, run separate evaluations for each model and compare results: ```bash # Evaluate first model nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \ -o target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \ -o execution.output_dir=./results/llama-3.1-8b # Evaluate second model nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \ -o target.api_endpoint.model_id=meta/llama-3.1-70b-instruct \ -o execution.output_dir=./results/llama-3.1-70b # Gather the results nemo-evaluator-launcher export --dest local --format json ``` ## OpenAI API Direct integration with OpenAI's GPT models for comparison and benchmarking. ### OpenAI Setup and Authentication ```bash # Get API key from https://platform.openai.com/api-keys export OPENAI_API_KEY="your-openai-api-key" # Test authentication curl -H "Authorization: Bearer $OPENAI_API_KEY" \ "https://api.openai.com/v1/models" ``` Refer to the [OpenAI model documentation](https://platform.openai.com/docs/models) for available models. ### OpenAI Configuration #### Basic OpenAI Evaluation ```yaml # config/openai_basic.yaml defaults: - execution: local - deployment: none - _self_ target: api_endpoint: url: https://api.openai.com/v1/chat/completions model_id: gpt-4 api_key_name: OPENAI_API_KEY # Name of environment variable execution: output_dir: ./results evaluation: overrides: config.params.limit_samples: 100 tasks: - name: ifeval ``` #### Cost-Optimized Configuration ```yaml # config/openai_cost_optimized.yaml defaults: - execution: local - deployment: none - _self_ target: api_endpoint: url: https://api.openai.com/v1/chat/completions model_id: gpt-3.5-turbo # Less expensive model api_key_name: OPENAI_API_KEY execution: output_dir: ./results evaluation: overrides: config.params.limit_samples: 50 # Smaller sample size config.params.parallelism: 2 # Lower parallelism to respect rate limits tasks: - name: mmlu_pro ``` ## Troubleshooting ### Authentication Errors Verify that your API key has the correct value: ```bash # Verify NVIDIA Build API key curl -H "Authorization: Bearer $NGC_API_KEY" \ "https://integrate.api.nvidia.com/v1/models" # Verify OpenAI API key curl -H "Authorization: Bearer $OPENAI_API_KEY" \ "https://api.openai.com/v1/models" ``` ### Rate Limiting If you encounter rate limit errors (HTTP 429), reduce the `parallelism` parameter in your configuration: ```yaml evaluation: overrides: config.params.parallelism: 2 # Lower parallelism to respect rate limits ``` (bring-your-own-endpoint)= # Bring-Your-Own-Endpoint Deploy and manage model serving yourself, then point NeMo Evaluator to your endpoint. This approach gives you full control over deployment infrastructure while still leveraging NeMo Evaluator's evaluation capabilities. ## Overview With bring-your-own-endpoint, you: - Handle model deployment and serving independently - Provide an OpenAI-compatible API endpoint - Use either the launcher or core library for evaluations - Maintain full control over infrastructure and scaling ## When to Use This Approach **Choose bring-your-own-endpoint when you:** - Have existing model serving infrastructure - Need custom deployment configurations - Want to deploy once and run many evaluations - Have specific security or compliance requirements - Use enterprise Kubernetes or MLOps pipelines ## Quick Examples ### Using Launcher with Existing Endpoint ```bash # Point launcher to your deployed model URL=http://your-endpoint:8000/v1/completions MODEL=your-model-name nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \ -o target.api_endpoint.url=$URL \ -o target.api_endpoint.model_id=$MODEL -o deployment.type=none # No launcher deployment ``` ### Using Core Library ```python from nemo_evaluator import ( ApiEndpoint, EvaluationConfig, EvaluationTarget, evaluate ) # Configure your endpoint api_endpoint = ApiEndpoint( url="http://your-endpoint:8000/v1/completions", model_id="your-model-name" ) target = EvaluationTarget(api_endpoint=api_endpoint) # Run evaluation config = EvaluationConfig(type="gsm8k", output_dir="results") results = evaluate(eval_cfg=config, target_cfg=target) ``` ## Endpoint Requirements Your endpoint must provide OpenAI-compatible APIs: ### Required Endpoints - **Completions**: `/v1/completions` (POST) - For text completion tasks - **Chat Completions**: `/v1/chat/completions` (POST) - For conversational tasks - **Health Check**: `/v1/health` (GET) - For monitoring (recommended) ### Request/Response Format Must follow OpenAI API specifications for compatibility with evaluation frameworks. See the {ref}`deployment-testing-compatibility` guide to verify your endpoint's OpenAI compatibility. ## Configuration Management ### Basic Configuration ```yaml # config/bring_your_own.yaml deployment: type: none # No launcher deployment target: api_endpoint: url: http://your-endpoint:8000/v1/completions model_id: your-model-name api_key_name: API_KEY # Optional, needed for gated endpoints evaluation: tasks: - name: mmlu - name: gsm8k ``` ## Key Benefits ### Infrastructure Control - **Custom configurations**: Tailor deployment to your specific needs - **Resource optimization**: Optimize for your hardware and workloads - **Security compliance**: Meet your organization's security requirements - **Cost management**: Control costs through efficient resource usage ### Operational Flexibility - **Deploy once, evaluate many**: Reuse deployments across multiple evaluations - **Integration ready**: Works with existing infrastructure and workflows - **Technology choice**: Use any serving framework or cloud provider - **Scaling control**: Scale according to your requirements ## Getting Started 1. **Choose your approach**: Select from manual deployment, hosted services, or enterprise integration 2. **Deploy your model**: Set up your OpenAI-compatible endpoint 3. **Configure NeMo Evaluator**: Point to your endpoint with proper configuration 4. **Run evaluations**: Use launcher or core library to run benchmarks 5. **Monitor and optimize**: Track performance and optimize as needed ```{toctree} :maxdepth: 1 :hidden: Hosted Services Testing Endpoint Compatibility ``` (deployment-testing-compatibility)= # Testing Endpoint Compatibility This guide helps you test your hosted endpoint to verify OpenAI-compatible API compatibility using `curl` requests for different task types. Models deployed using `nemo-evaluator-launcher` should be compatible with these tests. To test your endpoint run the provided command and check the model's response. Make sure to populate `FULL_ENDPOINT_URL` and `API_KEY` and replace `` with your own values. :::{tip} If you model is not gated, skip the line with authorization header: ```bash -H "Authorization: Bearer ${API_KEY}" ``` from the commands below. ::: ## General Requirements Your endpoint should support the following parameters: - `top_p` - `temperature` - `max_tokens` ## Chat endpoint testing ```bash export FULL_ENDPOINT_URL="https://your-server.com/v1/chat/completions" export API_KEY="your-api-key-here" export MODEL_NAME="your-model-name-here" curl -X POST ${FULL_ENDPOINT_URL} \ -H "Content-Type: application/json" \ -H "Authorization: Bearer ${API_KEY}" \ -d '{ "messages": [ { "role": "user", "content": "Write Python code that can add a list of numbers together." } ], "model": "'"$MODEL_NAME"'", "temperature": 0.6, "top_p": 0.95, "max_tokens": 256, "stream": false }' ``` ## Completions endpoint testing ```bash export FULL_ENDPOINT_URL="https://your-server.com/v1/completions" export API_KEY="your-api-key-here" export MODEL_NAME="your-model-name-here" curl -X POST ${FULL_ENDPOINT_URL} \ -H "Content-Type: application/json" \ -H "Authorization: Bearer ${API_KEY}" \ -d '{ "prompt": "Write Python code that can add a list of numbers together.", "model": "'"$MODEL_NAME"'", "temperature": 0.6, "top_p": 0.95, "max_tokens": 256, "stream": false }' ``` ## VLM chat endpoint testing NeMo Evaluator supports the **OpenAI Images API** ([docs](https://platform.openai.com/docs/guides/images-vision#giving-a-model-images-as-input)) and **vLLM** ([docs](https://docs.vllm.ai/en/stable/features/multimodal_inputs.html)) with the image provided as **base64-encoded image**, and the following content types: - `image_url` - `text` ```bash export FULL_ENDPOINT_URL="https://your-server.com/v1/chat/completions" export API_KEY="your-api-key-here" export MODEL_NAME="your-model-name-here" curl -X POST ${FULL_ENDPOINT_URL} \ -H "Content-Type: application/json" \ -H "Authorization: Bearer ${API_KEY}" \ -H "Accept: application/json" \ -d '{ "messages": [ { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAAQABADASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD3+iiigAooooA//9k=" } }, { "type": "text", "text": "Describe the image:" } ] } ], "model": "'"$MODEL_NAME"'", "stream": false, "max_tokens": 16, "temperature": 0.0, "top_p": 1.0 }' ``` ## Function calling testing We support OpenAI-compatible function calling ([docs](https://platform.openai.com/docs/guides/function-calling?api-mode=responses)): Function calling request: ```bash export FULL_ENDPOINT_URL="https://your-server.com/v1/chat/completions" export API_KEY="your-api-key-here" export MODEL_NAME="your-model-name-here" curl -X POST ${FULL_ENDPOINT_URL} \ -H "Content-Type: application/json" \ -H "Authorization: Bearer ${API_KEY}" \ -H "Accept: application/json" \ -d '{ "model": "'"$MODEL_NAME"'", "stream": false, "max_tokens": 16, "temperature": 0.0, "top_p": 1.0, "messages": [ { "role": "user", "content": "What is the slope of the line which is perpendicular to the line with the equation y = 3x + 2?" } ], "tools": [ { "type": "function", "function": { "name": "find_critical_points", "description": "Finds the critical points of the function. Note that the provided function is in Python 3 syntax.", "parameters": { "type": "object", "properties": { "function": { "type": "string", "description": "The function to find the critical points for." }, "variable": { "type": "string", "description": "The variable in the function." }, "range": { "type": "array", "items": { "type": "number" }, "description": "The range to consider for finding critical points. Optional. Default is [0.0, 3.4]." } }, "required": ["function", "variable"] } } } ] }' ``` ## Audio endpoint testing We support audio input with the following content types: - `audio_url` Example: ``` bash export FULL_ENDPOINT_URL="https://your-server.com/v1/chat/completions" export API_KEY="your-api-key-here" export MODEL_NAME="your-model-name-here" curl -X POST ${FULL_ENDPOINT_URL} \ -H "Content-Type: application/json" \ -H "Authorization: Bearer ${API_KEY}" \ -H "Accept: application/json" \ -d '{ "max_tokens": 256, "model": "'"$MODEL_NAME"'", "messages": [ { "content": [ { "audio_url": { "url": "data:audio/wav;base64,UklGRqQlAABXQVZFZm10IBAAAAABAAEAgLsAAAB3AQACABAAZGF0YYAlAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAD/////AAAAAAAAAAAAAAAAAAAAAAAA/////wAAAAD/////AAAAAAAAAAAAAAAAAAAAAP//////////AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA////////////////////////////////////////////////////////AAD/////////////////////AAD//wAAAAAAAAAA//////////////////8AAAAAAAAAAAAA/////wAAAAD/////AAAAAAAAAAAAAAAAAAAAAAAA////////AAAAAAAAAAAAAAAAAAAAAAAAAAAAAP//////////AAAAAAAAAAAAAAAAAAD/////////////AAAAAAAA////////////////////////////////////////////////////////AAAAAAAAAAD/////////////AAD//wAAAAAAAAAA//////////////////8AAAAAAAAAAAAA/////wAAAAD/////AAAAAAAAAAAAAAAAAAAAAAAA/////wAA////////AAAAAAAAAAAAAP//////////////////AAAAAAAAAAD///////////////////////8AAAAAAAD/////////////////////AAAAAP//////////////////////////AAAAAAAAAAAAAAAA/////wAAAAAAAAAAAAAAAAAA//////////8AAAAAAAAAAAAAAAAAAAAAAAD//wAA////////AAAAAAAAAAAAAAAAAAAAAP////8AAAAA////////AAAAAAAAAAAAAP//////////////////AAAAAAAAAAD//wAAAAD///////8AAAAAAAAAAAAA//////////////////////////8AAAAA/////////////////////wAAAAAAAP///////////////////////wAAAAAAAAAA////////////////AAAAAAAAAAAAAP//////////AAAAAP////8AAAAA////////AAAAAAAA/////wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA/////wAAAAAAAAAAAAAAAAAAAAAAAAAAAAD///////8AAAAAAAAAAAAAAAD///////8AAP////8AAAAAAAAAAAAAAAAAAAAA/////wAAAAD/////AAAAAAAAAAAAAAAAAAAAAAAAAAAAAP////8AAAAA////////AAAAAAAAAAAAAAAAAAAAAAAAAAD/////////////////////AAAAAP//AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA//8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAD///////8AAAAA/////wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAD/////////////////////AAAAAAAAAAAAAAAA//////////////////////////////////8AAAAAAAAAAAAAAAAAAP//AAAAAAAAAAAAAAAAAAAAAP///////wAAAAAAAAAAAAAAAAAAAAD///////8AAAAAAAD//////////////////////////////////////////////////////////wAA//8AAAAAAAAAAP///////wAAAAAAAAAAAAAAAAAAAAD///////////////////////////////////////8AAAAAAAAAAAAAAAAAAP///////wAAAAAAAAAA//8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAP//AAAAAAAAAAAAAAAA//////////8AAAAAAAAAAP//////////////////////////AAAAAAAAAAAAAAAAAAAAAP////////////////////8AAAAA//////////////////////////////////////////8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAD//wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAP///////////////////////////////wAAAAAAAAAAAAAAAP//////////AAAAAAAAAAAAAAAAAAD///////8AAAAAAAD/////////////////////////////////////////////////////////////////////AAAAAP////////////////////8AAAAAAAD/////AAAAAAAAAAAAAAAAAAD/////AAAAAP///////////////////////////////wAAAAD///////////////////////////////////////8AAP///////wAAAAD/////////////////////////////////////////////////////AAAAAP//////////////////////////AAAAAAAAAAAAAAAAAAAAAP//AAAAAAAAAAAAAAAAAAAAAAAAAAAAAP/////+////////////AAAAAAAA//8AAAAAAAAAAP////8AAP//////////////////AAAAAAAAAAAAAAAAAAAAAP//AAAAAP///////wAAAAD/////AAAAAAAA/////wAA//////////8AAAAA//8AAAAA/////////////////////wAAAAAAAAAAAAAAAP//////////AAAAAAAAAAAAAAAAAAD//////////wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA//8AAAAAAAAAAP///////wAAAAAAAAAAAAAAAAAAAAD//////////wAAAAAAAP//AAAAAP/////////////////////////////////////////////////////////////////////+//7//v//////////////////////////////////////AAD/////////////////////////////////////////////////////AAAAAAAA//8AAAAA/////////v/+//7//v//////AAAAAAAAAAAAAAAAAAD//////v/+//7//v/+//3//f/+///////+//7//f/+//7//f/8//3//f/+//7///8AAAAAAAAAAAAAAQAAAP7//f/+//7//f/8//z//f/9//z/+//8//z//P/7//v/+//7//v/+v/6//r/+v/7//z//f/9//z//P/8//z/+//5//n/+P/4//j/+P/5//r/+v/6//r/+//8//z/+v/5//n/+f/4//f/9v/1//X/9v/4//n/+f/4//n/+v/7//v/+f/4//f/9v/2//f/+f/6//n/+P/4//n/+f/5//j/+P/5//n/+v/6//n/+P/4//n/+//7//r/+P/4//n/+f/4//b/9f/2//j/+P/4//f/9//3//b/9P/z//P/9P/0//T/9f/1//b/9v/2//f/+P/3//b/9v/2//b/9v/1//X/9v/4//n/+f/5//n/+P/3//f/9//2//X/9f/2//f/9//5//z///8AAAAAAAD///7//P/8//7///8AAAEAAgAFAAUAAwABAAIAAwACAP///f/9////AQADAAQABgAHAAcABgAFAAQAAgACAAMABAAEAAQABQAJAAsADAALAAsACgAJAAYABAADAAIAAQAAAP///v/9//7/AQAEAAUAAgD/////AAABAAAAAAAAAAEAAQAAAAAAAgAEAAUAAwABAAAAAQABAAEAAAD//wAAAQABAP7//P/6//r/+v/6//n/9//2//f/+v/9//3/+//6//7/AgACAP///v8CAAYABgACAAAAAwAFAAEA/P/6//7/AAD+//z//f8BAAMAAgAAAAIABAADAAAA/v8AAAMAAwABAP///v8AAAIAAQD+//3//f/9//3/+//7//z//f/6//j/+P/6//z/+f/1//b/+f/7//f/9P/0//j/+//6//j/+P/5//z//P/8//v/+f/4//j/+v/8//z/+//5//j/+f/8//3//P/6//f/9v/1//P/8f/v/+//8f/0//X/9v/2//f//P8BAAMAAAD8//z/AQAEAAIA+//4//v///////3//f8AAAMAAQD9//v//v8DAAUABQAFAAcACQAMAAwACwAJAAgABwAIAAgABwAFAAUABgAHAAcABwAHAAgACwANAA8AEAAPAA8AEAASABIADQAIAAcADAAPAA4ACgAJAA0ADgAJAAQABAALABAADgAJAAkADwAVABYAEwAPAA4ADwAOAAwACAAEAAEA//////7//v/+/wEABAAHAAcABgAEAAQABQAIAAkABwAFAAMAAwAEAAUABwAIAAYAAwADAAYACQAHAAEA/f8BAAUABQD///z//v8BAAAA+//5//r//P/7//n/+v/7//v/+f/3//f/+v/9//z/+f/4//v/AAACAP//+//8/wAAAwABAAAAAgAHAAYAAQD9/wAAAwD///j/9//9/wMAAQD9/wAACQAOAAsABgAIAA8AEQAMAAcACAALAAwACAADAP///f/9//7//v/8//n/+P/7//7/AAD+//3//f///wAA///9//3/AAAEAAYAAwD///7/AQAEAAMAAgACAAQAAwAAAP7/AgAIAAsACAADAP///f/6//n/+f/5//j/9//4//z/AAADAAYADAARABIAEAAOAA8AEAAPAA0ADAANABAAFAAYABsAHAAdAB4AHwAbABQADwAQABQAFQASABEAFwAgACYAJgAkACUAJgAjAB4AHAAeACMAJwArADAAMwAyAC8ALAAqACcAIwAgAB8AHgAcABkAGAAZABsAHgAhACIAIQAcABgAGAAZABcAEwARABQAGAAaABgAGAAYABgAFgAUABQAEwAOAAcABAAHAAsACgAHAAgADgATABEACAACAAMACQANAAoABAAFAA4AFgAWAA8ACAAJAA4AEgAUABQAEwASABIAFAAWABUAEgASABYAFgAMAPj/4f/S/83/z//T/9P/zv/H/8L/wP+6/6z/nP+Q/4r/hf99/3j/d/93/3X/dv9//4j/hv98/3X/ef98/3T/Zv9j/27/d/9x/2b/Y/9o/2r/Zf9h/2X/a/9r/2j/Zf9i/1n/UP9P/1X/Vv9P/0b/RP9H/0n/Sv9O/1P/U/9S/1j/YP9h/1f/VP9l/37/iv+H/4r/nf+v/6//ov+g/6v/sP+j/5P/lf+j/6n/nv+S/4//kP+K/37/d/93/3z/gP+G/5D/mv+h/6b/qv+r/6f/pP+n/7D/u//I/9f/5f/u//X/AAAPABkAGwAbAB4AJQAnACUAJQAtADYAOQA2ADIALgAmABgACQD9//H/4v/U/83/z//V/9r/3v/j/+r/8/8AABIAJAAxADoAQABEAEkATwBVAFYAUABKAEsAUgBWAFIATgBQAFQAUQBJAEQARgBIAEMAPAA5ADcALwAiABgAFgAXABQADQAMABEAFQATAA0ABQD7/+z/2f/J/7//vP+8/73/vP+5/7b/tf+y/6r/mv+G/2//Wv9J/z7/NP8k/xL/Bv8F/wP/+v7s/uf+7f7x/uf+2P7U/uD+8P76/gH/DP8X/xz/Gv8U/w//Cv8J/wz/Df8E//H+4/7n/vL+9P7p/t7+3f7i/uL+3P7Y/tj+3P7g/uf+7/7z/vL+8f73/v7+Av8B/wH/Cv8a/yz/OP88/zv/Pv9J/1b/Wf9U/1b/Zv93/3b/aP9k/3X/iP+H/3T/Zv9o/3T/fP+B/4n/k/+X/5f/lv+V/5P/jf+L/5H/nf+p/7T/wf/N/9r/6P/6/wwAGQAkADMARQBPAE4ATABUAGIAZwBiAF4AYQBmAGQAYABlAHEAfAB9AHwAfgCGAJEAowC8ANUA5gDuAPcAAwENARABDAEKAQsBCgEEAf0A/wASATIBUwFnAWsBZgFhAV0BVAFDATIBJwEgARkBDwEFAf4A+gD6APoA8ADdAMsAygDaAO4A/wAVATABRAFBATMBNQFMAVgBQwEfAREBGAEWAf4A6gDzAAMB9wDSAL0AxQDIAKYAdQBkAHcAggBjADYAJwA7AEcALgAGAPb/CwApADEAHgAGAAEACQAIAO//0v/L/9j/1/+x/3//cf+M/6r/q/+e/6X/vv/J/7z/tv/O/+z/6f/K/7r/zf/i/9D/ov+E/4v/mf+P/3T/bP+K/7j/1P/U/8z/z//i//j/BwARABsAKQA2AD4ARQBPAFsAYwBmAGsAdAB4AG4AXABUAF8AdgCLAJoAqQC8AMoAzwDNAMwA0ADWAN8A6QDxAPQA9gD9AAcBDwETARkBJAEuAS0BJQEhASMBHwEOAfoA8gD1APIA4ADMAMgA0ADTAMcAswCfAIsAbgBKACwAHAATAAQA7v/a/8//z//S/9P/zf++/6r/mP+O/4b/e/9t/2b/aP9t/2r/Yf9a/1f/T/86/xb/5f6r/mv+Mf4K/vT94/3L/a39l/2Q/ZT9mP2U/Yb9cP1W/Tn9Fv3v/M78vvy+/MX8xPy8/Lf8tvyw/KD8jPx//Hn8cfxf/Eb8LPwX/An8BfwI/An8BfwE/A38GfwT/Pr74vvg++z76fvL+6T7kPuU+5z7mPuM+4j7kvui+677svu6+8376fsD/A/8D/wK/AP88vvW+7z7s/u4+7n7rfui+6n7vPvJ+8r7zPvY++H71/u9+6r7qvu4+8X7z/vZ++D73fvP+8H7uPuz+7L7uPvH+9b73/vp+/z7Gvwz/EH8Svxa/Gr8cPxq/Gf8dPyO/Kb8uvzO/On8CP0o/Ur9c/2e/cL92v3s/QT+JP5F/mP+gf6n/tb+Cf86/2f/k//B//D/HwBQAIQAugDtAB8BVQGTAdcBGwJbApgC0gIHAzcDYgOMA7UD4AMMBDYEWARxBIkEqQTRBPkEHQVCBWsFkQWuBcUF4wUMBjQGVAZvBpMGwgbxBhsHQAdnB40HrQfIB94H6gfpB+YH7Qf+BwoIBgj+BwAICwgMCP8H8QfyB/wH/QfsB9QHwAevB5cHewdiB0sHMAcOB+kGxQaiBoEGZAZQBj0GIgb+BdsFwQWwBaAFjAVwBUwFIgX+BOIExwShBHQETAQrBAMEzAOVA3IDZANUAzADAwPfAsUCpQJ4AkUCGQL0AdEBqgGCAVkBLwEGAeEAvQCUAGQANQAKAN7/rf97/07/Kf8A/83+kv5V/hv+4/2s/Xf9Pf3//MX8lfxt/EH8Dvzc+7P7jftg+yz7+/rT+q/6ifpi+kH6JPoF+uT5wvmd+W/5O/kK+eb4xfib+Gb4NPgP+PL30/ew95D3c/dU9y33Avfe9sX2tPaf9oX2aPZS9kj2R/ZM9lb2afaB9pP2m/aj9rr24vYM9yr3RPdp95r3xvfl9wD4J/ha+If4pPi6+N34E/lQ+YX5r/nU+QH6Nvpr+pn6xfr8+kT7kfvU+w38Tfyg/Pj8QP1x/Z791/0b/lr+kP7G/gb/Tf+U/9v/IgBoAKkA5QAfAVcBhwGvAdcBBAIzAlkCdwKWArwC5wIIAxoDHwMeAx4DIAMjAyIDIAMlAzgDUANfA2ADXwNlA2sDYQNCAyADCwP9AuMCvQKbAowChwJ5AlwCOAIXAvYB0wGwAZIBdgFWATUBGQEBAeQAwwCqAJ0AkgB5AFgAQgBAAEMAPQAwACwAMgA4ADkAOwBDAE4AWABpAIcArQDRAPEAFQFAAWgBhAGdAcAB6wEOAiUCOgJaAoQCsALcAg0DQQNyA58DzAP/AzIEZASbBNoEGAVKBXgFrgXuBSgGUQZ0BqAG0Qb4BhMHLwdVB3sHlgesB8oH7gcNCCIINwhSCGgIawhiCGAIawh6CH0IdAhnCF0IVAhJCD0ILQgaCAcI+QfoB80HqAeGB28HXAc+BxYH8QbZBscGrwaMBmkGSwYzBhkG/QXjBcsFsgWXBXsFZAVPBTcFGQX2BNcEwASwBJ4EhgRtBFsEUQRKBEIEPAQ8BEMESQRMBE0EUgRfBHUEkQSrBL0ExgTOBNwE7gT8BAQFBwUNBRcFJQU0BUEFSgVWBWkFfwWQBZcFnwWzBdIF7AX8BQsGIQY6BksGVQZiBncGiAaNBosGjQaUBpsGpAa6BtsG+gYJBw0HDwcPBwkH/gb4BvUG6QbQBrcGrAamBpUGdgZbBk8GRAYqBgUG6gXdBc8FtQWTBXYFXQU+BRoF+QTgBMkErASLBG0ETgQqBAME3QO4A5ADaANHAywDCwPdAqwChAJjAjsCBQLNAaUBigFqAToBBAHWALUAmQB3AFEAKwAHAOf/yP+r/47/cv9W/zn/Gf/6/t3+w/6o/oj+Zf5F/if+Cf7p/dH9yf3G/bT9hv1H/RD96vzE/Iv8RfwQ/Pj76/vO+577cftb+0/7NPsC+8n6mvp0+k/6KPoF+uj5z/m4+aX5k/l7+VX5Kfn/+Nv4sviB+E/4Ifj298f3k/de9yz3/vbS9qj2f/ZS9iD27/XF9aH1fvVe9UT1M/Uj9RL1AvX69Pn09/Tw9Ov07fT09Pn0+/T+9AX1DfUS9RT1G/Um9TP1O/VA9Uf1VfVp9X31ivWU9aD1svXG9db14/X29RL2MvZO9mb2iPa59vH2JPdM93P3offT9wD4I/hD+Gn4k/i/+On4E/k/+Wr5kPmz+dn5A/op+kf6YPqB+qf6xvrZ+ub6/Pob+zn7Tvti+377nvu1+8D7xfvM+9P71fvS+837yfvE+737ufu6+7j7rvue+5D7h/t++3L7Zvte+1r7VftQ+0r7Rfs/+zj7L/sg+wv7+Prq+t/6zfqz+qH6oPqm+p/6jfqD+o76o/qu+q36r/q7+sj6yvrJ+tb68voO+yD7Kvs3+0r7YPt5+5j7tvvM+9v77vsJ/Cn8Rvxi/IP8qPzJ/OH89vwN/Sf9P/1W/W39gf2Q/Z39sf3P/er9+v0H/hv+OP5O/lX+Vf5g/nj+k/6m/rP+vf7H/tH+3/7w/v3+/f70/u7+8P7t/tr+vf6q/qr+r/6i/n/+VP4z/h7+B/7k/bj9kv19/XX9bP1Y/Tz9Kf0m/Sj9Hv0I/fP86/zr/OL8zfy3/Kz8rvyz/Lb8vPzG/ND82fzm/Pr8D/0i/TP9Sv1n/YP9l/2p/b/92v30/Q3+KP5F/mL+gf6j/sr+8/4Z/zr/Wf94/5v/yP///zgAaQCSAL8A+AA1AWgBkwHDAf4BNwJjAogCsgLkAhMDOwNiA48DvQPmAw4EOwRqBJEEsgTaBA0FPgVdBW8FhAWiBbwFxwXMBdcF6gX5BQAGBAYIBgoGDAYRBhwGJAYjBhwGGwYlBi4GLgYnBiQGJgYpBicGHQYKBu0F0QW9Ba8FmAVyBU8FQgVGBUAFJQUIBQIFDgUSBQIF7QToBO4E6wTYBMIEuwTABMcEywTQBNcE2QTUBM8EzwTSBM0EvwSyBKsEqASeBI0EgAR6BHkEcwRmBFYERwQ7BDIELQQqBCcEIgQbBBgEGAQaBBoEGAQVBBIEDwQOBBgELQRHBFwEaQRwBHQEdQRzBHUEgQSUBKYEsgS4BLsEuwS7BL8ExQTCBK0EjwR6BHgEfAR0BF0ERgQ/BEAEOgQmBBAEBQQBBPoD6wPXA8cDvgO5A7QDrQOlA6ADoQOkA58DjQN2A2UDXANSA0MDMgMlAxkDBgPuAtsC1ALRAscCtwKpAp8CkgJ+AmsCYwJfAlICNgIWAvsB4wHFAaUBjAF5AWUBTwE/ATwBPAEvARUB/gD1APEA5gDYANUA3wDnAOIA1gDOAMwAxgC3AKoAqACrAKUAjgB1AGgAbgB+AIoAjwCTAJoApACpAKYAngCdAKIApQCcAIsAfgB8AH4AfQB7AH4AhgCKAIcAfwB7AHwAfgB8AHgAdQBzAHIAcgB1AHYAcwBsAGcAaABsAG8AbQBpAGYAZQBmAGgAawBvAHIAcgBwAHAAeQCMAKQAtgC9AL4AvwDDAMcAzQDWAN0A2QDJALsAugDBAMAAtACjAJkAlACKAH0AeAB+AIcAigCGAIQAhgCKAIwAjwCSAJIAjwCSAJwApAChAJkAlwCfAKYAowCfAKQAswC8ALsAtgCzAK4AnwCPAIgAjQCOAIQAdgBwAHIAcQBoAF8AXABbAFUATABBADQAHAD6/9b/u/+p/5n/iP91/17/RP8n/w3/+v7r/tz+yv64/qX+kP54/mL+Tf44/iH+DP78/e/93f3H/bP9pP2Z/Yv9ff11/XD9aP1W/UP9OP00/TH9Lv0s/S79L/0o/R39F/0b/R/9IP0i/S79QP1N/U39SP1F/Ur9Tv1P/Uv9SP1E/T79Nv0v/S39L/0y/TP9Mv0w/TD9MP0t/Sf9H/0b/Rb9EP0L/Q79Hf0t/TX9Mv0x/Tr9TP1a/V79Wv1W/Vb9WP1a/Vn9U/1I/Tn9KP0Z/Q39Av36/PP86fzZ/MX8sfyl/J/8nfya/Jb8kvyT/JX8lvyU/JT8lfyY/Jr8nfyg/J78lfyJ/Ib8kPyf/Kb8pPyk/Kz8uPzA/MH8wfzF/Mv8z/zR/ND80PzT/N387Pz3/Pr8/PwJ/Rv9KP0q/Sv9Nv1G/U39S/1N/Vv9cf2B/Yr9lf2m/bX9vv3F/dP95/34/QD+Af79/ff98v30/fz9Av4B/vz9+f34/fP97f3u/fn9A/4E/v79/P0C/gf+Bv4F/g/+I/40/jr+O/48/j7+PP44/jn+QP5H/k3+U/5e/mj+b/50/nv+hv6P/pX+n/6u/rz+wf6//sT+1/7w/gP/Dv8c/zb/U/9p/3n/i/+n/8X/2v/o//b/DgAvAFIAcQCMAKYAwwDlAAkBKgFFAVwBcwGKAZ0BqwG6Ac8B5AH0Af4BCAIXAikCOgJIAlMCWwJdAmACawKAApUCogKoArACvwLTAugC+AIEAwwDFQMhAy8DOwNDA0sDVgNmA3UDgAOHA5ADmwOlA64DuAPEA9QD5APxA/gD/gMHBBUEIAQlBCkEMwREBFQEWwRcBGEEagR0BH0EhwSSBJgElgSVBJwEqQSzBLcEvQTHBM4EywTBBLoEtQSsBJwEkASMBIsEhAR4BG4EaAReBFAEQgQ6BDQEKwQgBBoEGAQSBAYE/AP4A/kD9APpA9wD0QPIA7wDsAOlA5kDiwN8A2sDWANBAyoDFwMIA/gC4wLOAr4CswKpAp4ClQKMAoECcwJjAlcCTAJBAjICIQIUAgkC/QHwAeAB1AHOAcwBygHIAckBzgHWAd0B4QHlAewB8QHwAekB4gHgAeEB5AHoAe4B8QHuAegB6AHxAfsBAQIEAgkCDgINAgUC/wH/AQICAAL+AQECCwIQAgwCCgITAicCNQI1Ai4CLAIxAjQCMgIuAjECNwI3AjECLQIvAjMCMwIzAjUCOgI6AjYCNQI5AjoCMgImAiICJwIpAiECFAIMAggC/wHxAegB6wH0AfkB+gH7AfsB9AHpAeUB7gH3AfYB7QHpAekB5gHgAdwB4AHgAdcBygHHAc0BzgHIAcUBygHKAbsBqAGkAa0BrwGhAZMBlAGcAZsBkAGJAY0BjgGAAWsBXwFbAVMBQgE1ATIBMQEmARMBAwH4AOgA0QC9ALMAsAClAJAAegBuAGwAawBmAFkASQA5ACoAGQAGAPL/4P/P/77/rP+a/4v/ff9u/17/UP8//yz/GP8J/wD/9v7j/s7+vv62/q3+oP6T/or+hv59/nH+Zf5e/ln+UP5E/jv+Nv4x/in+HP4N/vv95v3Q/bz9rP2d/ZD9g/11/Wf9Wf1J/Tj9Jv0X/Q79Cv0D/ff85fzT/Mb8vfy3/LL8rvyv/LD8r/yo/J/8m/yc/J78m/yT/Ib8dvxj/E/8Pfwu/CL8FPwH/Pf74/vL+7j7s/uz+6z7l/uC+3r7fPt4+2z7Zftp+2/7aPtY+037T/tR+0r7QPs++0H7QPs5+zb7P/tL+1H7T/tP+1P7Wftc+1/7Y/tn+2j7aPtn+2j7aPtn+2b7Zvtn+2j7bPtx+3j7fvuG+477kvuT+5P7lPuP+4T7eft7+4b7kPuU+5n7p/u0+7P7pfuf+6j7tfuz+6X7nfuj+7H7vfvK+9779PsF/BD8H/wz/En8Wfxk/G78evyG/I/8l/yc/J/8ofyn/LL8uvy7/Lj8ufy//Mb8y/zQ/Nj84/zu/Pj8A/0Q/R/9L/0//Uz9Vv1e/Wj9dP2A/Yj9kP2b/a/9x/3c/en99P0F/hr+Lf46/kX+Uf5e/mX+av51/oj+nf6r/rn+0P7v/gr/Fv8d/y3/Qv9L/0P/Ov89/0b/SP9F/0n/Wf9r/3b/fP+F/47/kP+N/5D/mv+g/5z/l/+e/7L/w//H/8T/x//R/9z/4//t//7/FAAmAC4AMQA4AEYAVgBjAG4AegCLAJsAqQC3AMYA1gDiAOgA8AD+AAsBEgEWASIBNwFHAUsBSQFNAVcBXwFfAWEBawF2AXkBdQF0AXsBhQGPAZ8BswHEAcwB0QHeAfEB/wEEAgYCCgIPAhACEwIcAigCMQI1AjsCQwJCAjsCOQJGAlgCXQJWAlECWwJoAmoCYgJfAmQCaQJiAlYCTwJNAkkCQQI8Aj8CRwJNAlICXAJlAmQCWQJQAlICWgJaAlICTgI=" }, "type": "audio_url" }, { "text": "Please recognize the speech and only output the recognized content:", "type": "text" } ], "role": "user" } ], "temperature": 0.0, "top_p": 1.0 }' ``` (compatibility-log-probs)= ## Log-probabilities testing For evaluation with log-probabilities your `completions` endpoint must support `logprobs` and `echo` parameters: ```bash export FULL_ENDPOINT_URL="https://your-server.com/v1/completions" export API_KEY="your-api-key-here" export MODEL_NAME="your-model-name-here" curl -X POST ${FULL_ENDPOINT_URL} \ -H "Content-Type: application/json" \ -H "Authorization: Bearer ${API_KEY}" \ -d '{ "prompt": "3 + 3 = 6", "model": "'$MODEL_NAME'", "max_tokens": 1, "logprobs": 1, "echo": true }' ``` ## Next Steps - **Run your first evaluation**: Choose your path with {ref}`gs-quickstart` - **Select benchmarks**: Explore available evaluation tasks --- orphan: true --- (bring-your-own-endpoint-manual)= # Manual Deployment Deploy models yourself using popular serving frameworks, then point NeMo Evaluator to your endpoints. This approach gives you full control over deployment infrastructure and serving configuration. ## Overview Manual deployment involves: - Setting up model serving using frameworks like vLLM, TensorRT-LLM, or custom solutions - Configuring OpenAI-compatible endpoints - Managing infrastructure, scaling, and monitoring yourself - Using either the launcher or core library to run evaluations against your endpoints :::{note} This guide focuses on NeMo Evaluator configuration. For specific serving framework installation and deployment instructions, refer to their official documentation: - [vLLM Documentation](https://docs.vllm.ai/) - [TensorRT-LLM Documentation](https://nvidia.github.io/TensorRT-LLM/) - [Hugging Face TGI Documentation](https://huggingface.co/docs/text-generation-inference/) ::: ## Using Manual Deployments with NeMo Evaluator Before connecting to your manual deployment, verify it's properly configured using our {ref}`deployment-testing-compatibility` guide. ### With Launcher Once your manual deployment is running, use the launcher to evaluate: ```bash # Basic evaluation against manual deployment nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \ -o target.api_endpoint.url=http://localhost:8080/v1/completions \ -o target.api_endpoint.model_id=your-model-name ``` #### Configuration File Approach ```yaml # config/manual_deployment.yaml defaults: - execution: local - deployment: none # No deployment by launcher - _self_ target: api_endpoint: url: http://localhost:8080/v1/completions model_id: llama-3.1-8b # Optional authentication (name of environment variable holding API key) api_key_name: API_KEY execution: output_dir: ./results evaluation: tasks: - name: mmlu_pro overrides: config.params.limit_samples: 100 - name: gsm8k overrides: config.params.limit_samples: 50 ``` ### With Core Library Direct API usage for manual deployments: ```python from nemo_evaluator import ( ApiEndpoint, ConfigParams, EndpointType, EvaluationConfig, EvaluationTarget, evaluate ) # Configure your manual deployment endpoint api_endpoint = ApiEndpoint( url="http://localhost:8080/v1/completions", type=EndpointType.COMPLETIONS, model_id="llama-3.1-8b", api_key="API_KEY" # Name of environment variable holding API key ) target = EvaluationTarget(api_endpoint=api_endpoint) # Configure evaluation config = EvaluationConfig( type="mmlu_pro", output_dir="./results", params=ConfigParams( limit_samples=100, parallelism=4 ) ) # Run evaluation results = evaluate(eval_cfg=config, target_cfg=target) print(f"Results: {results}") ``` #### With Adapter Configuration ```python from nemo_evaluator import ( ApiEndpoint, ConfigParams, EndpointType, EvaluationConfig, EvaluationTarget, evaluate ) from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig # Configure adapter with interceptors adapter_config = AdapterConfig( interceptors=[ InterceptorConfig( name="caching", config={ "cache_dir": "./cache", "reuse_cached_responses": True } ), InterceptorConfig( name="request_logging", config={"max_requests": 10} ), InterceptorConfig( name="response_logging", config={"max_responses": 10} ) ] ) # Configure endpoint with adapter api_endpoint = ApiEndpoint( url="http://localhost:8080/v1/completions", type=EndpointType.COMPLETIONS, model_id="llama-3.1-8b", api_key="API_KEY", adapter_config=adapter_config ) target = EvaluationTarget(api_endpoint=api_endpoint) # Configure evaluation config = EvaluationConfig( type="mmlu_pro", output_dir="./results", params=ConfigParams( limit_samples=100, parallelism=4 ) ) # Run evaluation results = evaluate(eval_cfg=config, target_cfg=target) print(f"Results: {results}") ``` ## Prerequisites Before using a manually deployed endpoint with NeMo Evaluator, ensure: - Your model endpoint is running and accessible - The endpoint supports OpenAI-compatible API format - You have any required API keys or authentication credentials - Your endpoint supports the required generation parameters (see below) ### Endpoint Requirements Your endpoint must support the following generation parameters for compatibility with NeMo Evaluator: - `temperature`: Controls randomness in generation (0.0 to 1.0) - `top_p`: Nucleus sampling threshold (0.0 to 1.0) - `max_tokens`: Maximum tokens to generate ## Testing Your Endpoint Before running evaluations, verify your endpoint is working as expected. ::::{dropdown} Test Completions Endpoint :icon: code-square ```bash # Basic test (no authentication) curl -X POST http://localhost:8080/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "your-model-name", "prompt": "What is machine learning?", "temperature": 0.6, "top_p": 0.95, "max_tokens": 256, "stream": false }' # With authentication curl -X POST http://localhost:8080/v1/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{ "model": "your-model-name", "prompt": "What is machine learning?", "temperature": 0.6, "top_p": 0.95, "max_tokens": 256, "stream": false }' ``` :::: ::::{dropdown} Test Chat Completions Endpoint :icon: code-square ```bash # Basic test (no authentication) curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "your-model-name", "messages": [ { "role": "user", "content": "What is machine learning?" } ], "temperature": 0.6, "top_p": 0.95, "max_tokens": 256, "stream": false }' # With authentication curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{ "model": "your-model-name", "messages": [ { "role": "user", "content": "What is machine learning?" } ], "temperature": 0.6, "top_p": 0.95, "max_tokens": 256, "stream": false }' ``` :::: :::{note} Each evaluation task requires a specific endpoint type. Verify your endpoint supports the correct type for your chosen tasks. Use `nemo-evaluator-launcher ls tasks` to see which endpoint type each task requires. ::: ## OpenAI API Compatibility Your endpoint must implement the OpenAI API format: ::::{dropdown} Completions Endpoint Format :icon: code-square **Request**: `POST /v1/completions` ```json { "model": "model-name", "prompt": "string", "max_tokens": 100, "temperature": 0.7, "top_p": 0.9 } ``` **Response**: ```json { "id": "cmpl-xxx", "object": "text_completion", "created": 1234567890, "model": "model-name", "choices": [{ "text": "generated text", "index": 0, "finish_reason": "stop" }], "usage": { "prompt_tokens": 10, "completion_tokens": 20, "total_tokens": 30 } } ``` :::: ::::{dropdown} Chat Completions Endpoint Format :icon: code-square **Request**: `POST /v1/chat/completions` ```json { "model": "model-name", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"} ], "max_tokens": 100, "temperature": 0.7 } ``` **Response**: ```json { "id": "chatcmpl-xxx", "object": "chat.completion", "created": 1234567890, "model": "model-name", "choices": [{ "message": { "role": "assistant", "content": "Hello! How can I help you?" }, "index": 0, "finish_reason": "stop" }], "usage": { "prompt_tokens": 15, "completion_tokens": 10, "total_tokens": 25 } } ``` :::: ## Troubleshooting ### Connection Issues If you encounter connection errors: 1. Verify the endpoint is running and accessible. Check the health endpoint (path varies by framework): ```bash # For vLLM, SGLang, NIM curl http://localhost:8080/health # For NeMo/Triton deployments curl http://localhost:8080/v1/triton_health ``` 2. Check that the URL in your configuration matches your deployment: - Include the full path (e.g., `/v1/completions` or `/v1/chat/completions`) - Verify the port number matches your server configuration - Ensure no firewall rules are blocking connections 3. Test with a simple curl command before running full evaluations ### Authentication Errors If you see authentication failures: 1. Verify the environment variable has a value: ```bash echo $API_KEY ``` 2. Ensure the `api_key_name` in your YAML configuration matches the environment variable name 3. Check that your endpoint requires the same authentication method ### Timeout Errors If requests are timing out: 1. Increase the timeout in your configuration: ```yaml evaluation: overrides: config.params.request_timeout: 300 # 5 minutes ``` 2. Reduce parallelism to avoid overwhelming your endpoint: ```yaml evaluation: overrides: config.params.parallelism: 1 ``` 3. Check your endpoint's logs for performance issues ## Next Steps - **Hosted services**: Compare with [hosted services](hosted-services.md) for managed solutions - **Launcher-orchestrated deployment**: [Deploy](../launcher-orchestrated/index.md) models for evaluation with `nemo-evaluator-launcher` --- orphan: true --- (launcher-orchestrated-deployment)= # Launcher-Orchestrated Deployment Let NeMo Evaluator Launcher handle both model deployment and evaluation orchestration automatically. This is the recommended approach for most users, providing automated lifecycle management, multi-backend support, and integrated monitoring. ## Overview Launcher-orchestrated deployment means the launcher: - Deploys your model using the specified deployment type - Manages the model serving lifecycle - Runs evaluations against the deployed model - Handles cleanup and resource management The launcher supports multiple deployment backends and execution environments. ## Quick Start ```bash # Deploy model and run evaluation in one command (Slurm example) HOSTNAME=cluster-login-node.com ACCOUNT=my_account OUT_DIR=/absolute/path/on/login/node nemo-evaluator-launcher run \ -o execution.hostname=$HOSTNAME \ -o execution.account=$ACCOUNT \ -o execution.output_dir=$OUT_DIR \ --config packages/nemo-evaluator-launcher/examples/slurm_vllm_basic.yaml ``` ## Execution Backends Choose the execution backend that matches your infrastructure: ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`desktop-download;1.5em;sd-mr-1` Local Execution :link: local :link-type: doc Run evaluations on your local machine against existing endpoints. **Note**: Local executor does **not** deploy models. Use Slurm or Lepton for deployment. ::: :::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` Slurm Deployment :link: slurm :link-type: doc Deploy on HPC clusters with Slurm workload manager. Ideal for large-scale evaluations with multi-node parallelism. ::: :::{grid-item-card} {octicon}`cloud;1.5em;sd-mr-1` Lepton Deployment :link: lepton :link-type: doc Deploy on Lepton AI cloud platform. Best for cloud-native deployments with managed infrastructure and auto-scaling. ::: :::: ## Deployment Types The launcher supports multiple deployment types: ### vLLM Deployment - **Fast inference** with optimized attention mechanisms - **Continuous batching** for high throughput - **Tensor parallelism** support for large models - **Memory optimization** with configurable GPU utilization ### NIM Deployment - **Production-grade reliability** with enterprise features - **NVIDIA optimized containers** for maximum performance - **Built-in monitoring** and logging capabilities - **Enterprise security** features ### SGLang Deployment - **Structured generation** support for complex tasks - **Function calling** capabilities - **JSON mode** for structured outputs - **Efficient batching** for high throughput ### No Deployment - **Use existing endpoints** without launcher deployment - **Bring-your-own-endpoint** integration - **Flexible configuration** for any OpenAI-compatible API ## Configuration Overview Basic configuration structure for launcher-orchestrated deployment: ```yaml # Use Hydra defaults to compose config defaults: - execution: slurm/default # or lepton/default; local does not deploy - deployment: vllm # or nim, sglang, none - _self_ # Deployment configuration deployment: checkpoint_path: /path/to/model # Or HuggingFace model ID served_model_name: my-model # ... deployment-specific options # Execution backend configuration execution: account: my-account output_dir: /path/to/results # ... backend-specific options # Evaluation tasks evaluation: tasks: - name: mmlu_pro - name: gsm8k ``` ## Key Benefits ### Automated Lifecycle Management - **Deployment automation**: No manual setup required - **Resource management**: Automatic allocation and cleanup - **Error handling**: Built-in retry and recovery mechanisms - **Monitoring integration**: Real-time status and logging ### Multi-Backend Support - **Consistent interface**: Same commands work across all backends - **Environment flexibility**: Local development to production clusters - **Resource optimization**: Backend-specific optimizations - **Scalability**: From single GPU to multi-node deployments ### Integrated Workflows - **End-to-end automation**: From model to results in one command - **Configuration management**: Version-controlled, reproducible configs - **Result integration**: Built-in export and analysis tools - **Monitoring and debugging**: Comprehensive logging and status tracking ## Getting Started 1. **Choose your backend**: Start with {ref}`launcher-orchestrated-local` for development 2. **Configure your model**: Set deployment type and model path 3. **Run evaluation**: Use the launcher to deploy and evaluate 4. **Monitor progress**: Check status and logs during execution 5. **Analyze results**: Export and analyze evaluation outcomes ## Next Steps - **Local Development**: Start with {ref}`launcher-orchestrated-local` for testing - **Scale Up**: Move to {ref}`launcher-orchestrated-slurm` for production workloads - **Cloud Native**: Try {ref}`launcher-orchestrated-lepton` for managed infrastructure - **Configure Adapters**: Set up {ref}`adapters` for custom processing ```{toctree} :maxdepth: 1 :hidden: Local Deployment Slurm Deployment Lepton Deployment ``` (launcher-orchestrated-local)= # Local Execution Run evaluations on your local machine using Docker containers. The local executor connects to existing model endpoints and orchestrates evaluation tasks locally. :::{important} The local executor does **not** deploy models. You must have an existing model endpoint running before starting evaluation. For launcher-orchestrated model deployment, use {ref}`launcher-orchestrated-slurm` or {ref}`launcher-orchestrated-lepton`. ::: ## Overview Local execution: - Runs evaluation containers locally using Docker - Connects to existing model endpoints (local or remote) - Suitable for development, testing, and small-scale evaluations - Supports parallel or sequential task execution ## Quick Start ```bash # Run evaluation against existing endpoint nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_basic.yaml ``` ## Configuration ### Basic Configuration ```yaml # examples/local_basic.yaml defaults: - execution: local - deployment: none - _self_ execution: output_dir: llama_3_1_8b_instruct_results # mode: sequential # Optional: run tasks sequentially instead of parallel target: api_endpoint: model_id: meta/llama-3.2-3b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY evaluation: tasks: - name: ifeval - name: gpqa_diamond ``` **Required fields:** - `execution.output_dir`: Directory for results - `target.api_endpoint.url`: Model endpoint URL - `evaluation.tasks`: List of evaluation tasks ### Execution Modes ```yaml execution: output_dir: ./results mode: parallel # Default: run tasks in parallel # mode: sequential # Run tasks one at a time ``` ### Multi-Task Evaluation ```yaml evaluation: tasks: - name: mmlu_pro overrides: config.params.limit_samples: 200 - name: gsm8k overrides: config.params.limit_samples: 100 - name: humaneval overrides: config.params.limit_samples: 50 ``` ### Task-Specific Configuration ```yaml evaluation: tasks: - name: gpqa_diamond overrides: config.params.temperature: 0.6 config.params.top_p: 0.95 config.params.max_new_tokens: 8192 config.params.parallelism: 4 env_vars: HF_TOKEN: HF_TOKEN_FOR_GPQA_DIAMOND ``` ### With Adapter Configuration Configure adapters using evaluation overrides: ```yaml target: api_endpoint: url: http://localhost:8080/v1/chat/completions model_id: my-model evaluation: overrides: target.api_endpoint.adapter_config.use_reasoning: true target.api_endpoint.adapter_config.use_system_prompt: true target.api_endpoint.adapter_config.custom_system_prompt: "Think step by step." ``` For detailed adapter configuration options, refer to {ref}`adapters`. ### Tasks Requiring Dataset Mounting Some tasks require access to local datasets. For these tasks, specify the dataset location: ```yaml evaluation: tasks: - name: mteb.techqa dataset_dir: /path/to/your/techqa/dataset ``` The system will automatically: - Mount the dataset directory into the evaluation container at `/datasets` (or a custom path if specified) - Set the `NEMO_EVALUATOR_DATASET_DIR` environment variable - Validate that all required environment variables are configured **Custom mount path example:** ```yaml evaluation: tasks: - name: mteb.techqa dataset_dir: /mnt/data/techqa dataset_mount_path: /custom/path # Optional: customize container mount point ``` ### Advanced settings If you are deploying the model locally with Docker, you can use a dedicated docker network. This will provide a secure connetion between deployment and evaluation docker containers. ```shell docker network create my-custom-network docker run --gpus all --network my-custom-network --name my-phi-container vllm/vllm-openai:latest \ --model microsoft/Phi-4-mini-instruct ``` Then use the same network in the evaluator config: ```yaml defaults: - execution: local - deployment: none - _self_ execution: output_dir: my_phi_test extra_docker_args: "--network my-custom-network" target: api_endpoint: model_id: microsoft/Phi-4-mini-instruct url: http://my-phi-container:8000/v1/chat/completions api_key_name: null evaluation: tasks: - name: simple_evals.mmlu_pro overrides: config.params.limit_samples: 10 # TEST ONLY: Limits to 10 samples for quick testing config.params.parallelism: 1 ``` Alternatively you can expose ports and use the host network: ```shell docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \ --model microsoft/Phi-4-mini-instruct ``` ```yaml execution: extra_docker_args: "--network host" ``` ## Command-Line Usage ### Basic Commands ```bash # Run evaluation nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_basic.yaml # Dry run to preview configuration nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \ --dry-run # Override endpoint URL nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \ -o target.api_endpoint.url=http://localhost:8080/v1/chat/completions ``` ### Job Management ```bash # Check job status nemo-evaluator-launcher status # Check entire invocation nemo-evaluator-launcher status # Kill running job nemo-evaluator-launcher kill # List available tasks nemo-evaluator-launcher ls tasks # List recent runs nemo-evaluator-launcher ls runs ``` ## Requirements ### System Requirements - **Docker**: Docker Engine installed and running - **Storage**: Adequate space for evaluation containers and results - **Network**: Internet access to pull Docker images ### Model Endpoint You must have a model endpoint running and accessible before starting evaluation. Options include: - {ref}`bring-your-own-endpoint-manual` using vLLM, TensorRT-LLM, or other frameworks - {ref}`bring-your-own-endpoint-hosted` like NVIDIA API Catalog or OpenAI - Custom deployment solutions ## Troubleshooting ### Docker Issues **Docker not running:** ```bash # Check Docker status docker ps # Start Docker daemon (varies by platform) sudo systemctl start docker # Linux # Or open Docker Desktop on macOS/Windows ``` **Permission denied:** ```bash # Add user to docker group (Linux) sudo usermod -aG docker $USER # Log out and back in for changes to take effect ``` ### Endpoint Connectivity **Cannot connect to endpoint:** ```bash # Test endpoint availability curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "test", "messages": [{"role": "user", "content": "Hi"}]}' ``` **API authentication errors:** - Verify `api_key_name` matches your environment variable - Check that the environment variable has a value: `echo $NGC_API_KEY` - Check API key has proper permissions ### Evaluation Issues **Job hangs or shows no progress:** Check logs in the output directory: ```bash # Track logs in real-time tail -f //logs/stdout.log # Kill and restart if needed nemo-evaluator-launcher kill ``` **Tasks fail with errors:** - Check logs in `//logs/stdout.log` - Verify model endpoint supports required request format - Ensure adequate disk space for results ### Configuration Validation ```bash # Validate configuration before running nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \ --dry-run ``` ## Next Steps - **Deploy your own model**: See {ref}`bring-your-own-endpoint-manual` for local model serving - **Scale to HPC**: Use {ref}`launcher-orchestrated-slurm` for cluster deployments - **Cloud execution**: Try {ref}`launcher-orchestrated-lepton` for cloud-based evaluation - **Configure adapters**: Add interceptors with {ref}`adapters` (launcher-orchestrated-lepton)= # Lepton AI Deployment via Launcher Deploy and evaluate models on Lepton AI cloud platform using NeMo Evaluator Launcher orchestration. This approach provides scalable cloud inference with managed infrastructure. ## Overview Lepton launcher-orchestrated deployment: - Deploys models on Lepton AI cloud platform - Provides managed infrastructure and scaling - Supports various resource shapes and configurations - Handles deployment lifecycle in the cloud ## Quick Start ```bash # Deploy and evaluate on Lepton AI nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/lepton_vllm.yaml \ -o deployment.checkpoint_path=meta-llama/Llama-3.1-8B-Instruct \ -o deployment.lepton_config.resource_shape=gpu.1xh200 ``` This command: 1. Deploys a vLLM endpoint on Lepton AI 2. Runs the configured evaluation tasks 3. Returns an invocation ID for monitoring The launcher handles endpoint creation, evaluation execution, and provides cleanup commands. ## Prerequisites ### Lepton AI Setup ```bash # Install Lepton AI CLI pip install leptonai # Authenticate with Lepton AI lep login ``` Refer to the [Lepton AI documentation](https://docs.nvidia.com/dgx-cloud/lepton/get-started) for authentication and workspace configuration. ## Deployment Types ### vLLM Lepton Deployment High-performance inference with cloud scaling: Refer to the complete working configuration in `packages/nemo-evaluator-launcher/examples/lepton_vllm.yaml`. Key configuration sections: ```yaml deployment: type: vllm checkpoint_path: meta-llama/Llama-3.1-8B-Instruct served_model_name: llama-3.1-8b-instruct tensor_parallel_size: 1 lepton_config: resource_shape: gpu.1xh200 min_replicas: 1 max_replicas: 3 auto_scaler: scale_down: no_traffic_timeout: 3600 execution: type: lepton evaluation_tasks: timeout: 3600 evaluation: tasks: - name: ifeval ``` The launcher automatically retrieves the endpoint URL after deployment, eliminating the need for manual URL configuration. ### NIM Lepton Deployment Enterprise-grade serving in the cloud. Refer to the complete working configuration in `packages/nemo-evaluator-launcher/examples/lepton_nim.yaml`: ```yaml deployment: type: nim image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.6 served_model_name: meta/llama-3.1-8b-instruct lepton_config: resource_shape: gpu.1xh200 min_replicas: 1 max_replicas: 3 auto_scaler: scale_down: no_traffic_timeout: 3600 execution: type: lepton evaluation: tasks: - name: ifeval ``` ### SGLang Deployment SGLang is also supported as a deployment type. Use `deployment.type: sglang` with similar configuration to vLLM. ## Resource Shapes Resource shapes are Lepton platform-specific identifiers that determine the compute resources allocated to your deployment. Available shapes depend on your Lepton workspace configuration and quota. Configure in your deployment: ```yaml deployment: lepton_config: resource_shape: gpu.1xh200 # Example: Check your Lepton workspace for available shapes ``` Refer to the [Lepton AI documentation](https://docs.nvidia.com/dgx-cloud/lepton/get-started) or check your workspace settings for available resource shapes in your environment. ## Configuration Examples ### Auto-Scaling Configuration Configure auto-scaling behavior through the `lepton_config.auto_scaler` section: ```yaml deployment: lepton_config: min_replicas: 1 max_replicas: 3 auto_scaler: scale_down: no_traffic_timeout: 3600 # Seconds before scaling down scale_from_zero: false ``` ### Using Existing Endpoints To evaluate against an already-deployed Lepton endpoint without creating a new deployment, use `deployment.type: none` and provide the endpoint URL in the `target.api_endpoint` section. Refer to `packages/nemo-evaluator-launcher/examples/lepton_basic.yaml` for a complete example. ### Tasks Requiring Dataset Mounting Some tasks require access to local datasets that must be mounted into the evaluation container: ```yaml evaluation: tasks: - name: mteb.techqa dataset_dir: /path/to/shared/storage/techqa ``` The system will automatically: - Mount the dataset directory into the evaluation container - Set the `NEMO_EVALUATOR_DATASET_DIR` environment variable - Validate that all required environment variables are configured **Custom mount path example:** ```yaml evaluation: tasks: - name: mteb.techqa dataset_dir: /lepton/shared/datasets/techqa dataset_mount_path: /data/techqa # Optional: customize container mount point ``` :::{note} Ensure the dataset directory is accessible from the Lepton platform's shared storage configured in your workspace. ::: ## Advanced Configuration ### Environment Variables Pass environment variables to deployment containers through `lepton_config.envs`: ```yaml deployment: lepton_config: envs: HF_TOKEN: value_from: secret_name_ref: "HUGGING_FACE_HUB_TOKEN" CUSTOM_VAR: "direct_value" ``` ### Storage Mounts Configure persistent storage for model caching: ```yaml deployment: lepton_config: mounts: enabled: true cache_path: "/path/to/storage" mount_path: "/opt/nim/.cache" ``` ## Monitoring and Management ### Check Evaluation Status Use NeMo Evaluator Launcher commands to monitor your evaluations: ```bash # Check status using invocation ID nemo-evaluator-launcher status # Kill running evaluations and cleanup endpoints nemo-evaluator-launcher kill ``` ### Monitor Lepton Resources Use Lepton AI CLI commands to monitor platform resources: ```bash # List all deployments in your workspace lepton deployment list # Get details about a specific deployment lepton deployment get # View deployment logs lepton deployment logs # Check resource availability lepton resource list --available ``` Refer to the [Lepton AI CLI documentation](https://docs.nvidia.com/dgx-cloud/lepton/reference/cli/get-started/) for the complete command reference. ## Exporting Results After evaluation completes, export results using the export command: ```bash # Export results to MLflow nemo-evaluator-launcher export --dest mlflow ``` Refer to the {ref}`exporters-overview` for additional export options and configurations. ## Troubleshooting ### Common Issues **Deployment Timeout:** If endpoints take too long to become ready, check deployment logs: ```bash # Check deployment logs via Lepton CLI lepton deployment logs # Increase readiness timeout in configuration # (in execution.lepton_platform.deployment.endpoint_readiness_timeout) ``` **Resource Unavailable:** If your requested resource shape is unavailable: ```bash # Check available resources in your workspace lepton resource list --available # Try a different resource shape in your config ``` **Authentication Issues:** ```bash # Re-authenticate with Lepton lep login ``` **Endpoint Not Found:** If evaluation jobs cannot connect to the endpoint: 1. Verify endpoint is in "Ready" state using `lepton deployment get ` 2. Confirm the endpoint URL is accessible 3. Verify API tokens are properly set in Lepton secrets ## Next Steps - Compare with {ref}`launcher-orchestrated-slurm` for HPC cluster deployments - Explore {ref}`launcher-orchestrated-local` for local development and testing - Review complete configuration examples in the `examples/` directory (launcher-orchestrated-slurm)= # Slurm Deployment via Launcher Deploy and evaluate models on HPC clusters using Slurm workload manager through NeMo Evaluator Launcher orchestration. ## Overview Slurm launcher-orchestrated deployment: - Submits jobs to Slurm-managed HPC clusters - Supports multi-node evaluation runs - Handles resource allocation and job scheduling - Manages model deployment lifecycle within Slurm jobs ## Quick Start ```bash # Deploy and evaluate on Slurm cluster nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/slurm_vllm_checkpoint_path.yaml \ -o deployment.checkpoint_path=/shared/models/llama-3.1-8b-instruct \ -o execution.partition=gpu ``` ## vLLM Deployment ```yaml # Slurm with vLLM deployment defaults: - execution: slurm/default - deployment: vllm - _self_ deployment: type: vllm checkpoint_path: /shared/models/llama-3.1-8b-instruct served_model_name: meta-llama/Llama-3.1-8B-Instruct tensor_parallel_size: 1 data_parallel_size: 8 port: 8000 execution: account: my-account output_dir: /shared/results partition: gpu num_nodes: 1 ntasks_per_node: 1 gres: gpu:8 walltime: "02:00:00" target: api_endpoint: url: http://localhost:8000/v1/chat/completions model_id: meta-llama/Llama-3.1-8B-Instruct evaluation: tasks: - name: ifeval - name: gpqa_diamond - name: mbpp ``` ## Slurm Configuration ### Supported Parameters The following execution parameters are supported for Slurm deployments. See `configs/execution/slurm/default.yaml` in the launcher package for the base configuration: ```yaml execution: # Required parameters hostname: ??? # Slurm cluster hostname username: ${oc.env:USER} # SSH username (defaults to USER environment variable) account: ??? # Slurm account for billing output_dir: ??? # Results directory # Resource allocation partition: batch # Slurm partition/queue num_nodes: 1 # Number of nodes ntasks_per_node: 1 # Tasks per node gres: gpu:8 # GPU resources walltime: "01:00:00" # Wall time limit (HH:MM:SS) # Environment variables and mounts env_vars: deployment: {} # Environment variables for deployment container evaluation: {} # Environment variables for evaluation container mounts: deployment: {} # Mount points for deployment container (source:target format) evaluation: {} # Mount points for evaluation container (source:target format) mount_home: true # Whether to mount home directory ``` :::{note} The `gpus_per_node` parameter can be used as an alternative to `gres` for specifying GPU resources. However, `gres` is the default in the base configuration. ::: ## Configuration Examples ### Benchmark Suite Evaluation ```yaml # Run multiple benchmarks on a single model defaults: - execution: slurm/default - deployment: vllm - _self_ deployment: type: vllm checkpoint_path: /shared/models/llama-3.1-8b-instruct served_model_name: meta-llama/Llama-3.1-8B-Instruct tensor_parallel_size: 1 data_parallel_size: 8 port: 8000 execution: account: my-account output_dir: /shared/results hostname: slurm.example.com partition: gpu num_nodes: 1 ntasks_per_node: 1 gres: gpu:8 walltime: "06:00:00" target: api_endpoint: url: http://localhost:8000/v1/chat/completions model_id: meta-llama/Llama-3.1-8B-Instruct evaluation: tasks: - name: ifeval - name: gpqa_diamond - name: mbpp - name: hellaswag ``` ### Tasks Requiring Dataset Mounting Some tasks require access to local datasets stored on the cluster's shared filesystem: ```yaml evaluation: tasks: - name: mteb.techqa dataset_dir: /shared/datasets/techqa # Path on shared filesystem ``` The system will automatically: - Mount the dataset directory into the evaluation container - Set the `NEMO_EVALUATOR_DATASET_DIR` environment variable - Validate that all required environment variables are configured **Custom mount path example:** ```yaml evaluation: tasks: - name: mteb.techqa dataset_dir: /shared/datasets/techqa dataset_mount_path: /data/techqa # Optional: customize container mount point ``` :::{note} Ensure the dataset directory is accessible from all cluster nodes via shared storage (e.g., NFS, Lustre). ::: ## Job Management ### Submitting Jobs ```bash # Submit job with configuration nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/slurm_vllm_basic.yaml # Submit with configuration overrides nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/slurm_vllm_basic.yaml \ -o execution.walltime="04:00:00" \ -o execution.partition=gpu-long ``` ### Monitoring Jobs ```bash # Check job status nemo-evaluator-launcher status # List all runs (optionally filter by executor) nemo-evaluator-launcher ls runs --executor slurm ``` ### Managing Jobs ```bash # Cancel job nemo-evaluator-launcher kill ``` ### Native Slurm Commands You can also use native Slurm commands to manage jobs directly: ```bash # View job details squeue -j -o "%.18i %.9P %.50j %.8u %.2t %.10M %.6D %R" # Check job efficiency seff # Cancel Slurm job directly scancel # Hold/release job scontrol hold scontrol release # View detailed job information scontrol show job ``` ## Shared Storage Slurm evaluations require shared storage accessible from all cluster nodes: ### Model Storage Store models in a shared filesystem accessible to all compute nodes: ```bash # Example shared model directory /shared/models/ ├── llama-3.1-8b-instruct/ ├── llama-3.1-70b-instruct/ └── custom-model.nemo ``` Specify the model path in your configuration: ```yaml deployment: checkpoint_path: /shared/models/llama-3.1-8b-instruct ``` ### Results Storage Evaluation results are written to the configured output directory: ```yaml execution: output_dir: /shared/results ``` Results are organized by timestamp and invocation ID in subdirectories. ## Troubleshooting ### Common Issues **Job Pending:** ```bash # Check node availability sinfo -p gpu # Try different partition -o execution.partition="gpu-shared" ``` **Job Failed:** ```bash # Check job status nemo-evaluator-launcher status # View Slurm job details scontrol show job # Check job output logs (location shown in status output) ``` **Job Timeout:** ```bash # Increase walltime -o execution.walltime="08:00:00" # Check current walltime limit for partition sinfo -p -o "%P %l" ``` **Resource Allocation:** ```bash # Adjust GPU allocation via gres -o execution.gres=gpu:4 -o deployment.tensor_parallel_size=4 ``` ### Debugging with Slurm Commands ```bash # View job details scontrol show job # Monitor resource usage sstat -j --format=AveCPU,AveRSS,MaxRSS,AveVMSize # Job accounting information sacct -j --format=JobID,JobName,State,ExitCode,DerivedExitCode # Check job efficiency after completion seff ``` # Evaluate Automodel Checkpoints Trained by NeMo Framework This guide provides step-by-step instructions for evaluating checkpoints trained using the NeMo Framework with the Automodel backend. This section specifically covers evaluation with [nvidia-lm-eval](https://pypi.org/project/nvidia-lm-eval/), a wrapper around the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) tool. Here, we focus on benchmarks within the `lm-evaluation-harness` that depend on text generation. Evaluation on log-probability-based benchmarks is available in [Evaluate Automodel Checkpoints on Log-probability benchmarks](#evaluate-automodel-checkpoints-on-log-probability-benchmarks). ## Deploy Automodel Checkpoints This section outlines the steps to deploy Automodel checkpoints using Python commands. Automodel checkpoint deployment uses Ray Serve as the serving backend. It also offers an OpenAI API (OAI)-compatible endpoint, similar to deployments of checkpoints trained with the Megatron Core backend. An example deployment command is shown below. ```{literalinclude} _snippets/deploy_hf.sh :language: bash :start-after: "# [snippet-start]" :end-before: "# [snippet-end]" ``` The `--model_path` can refer to either a local checkpoint path or a Hugging Face model ID, as shown in the example above. In the example above, checkpoint deployment uses the `vLLM` backend. To enable accelerated inference, install `vLLM` in your environment. To install `vLLM` inside the NeMo Framework container, follow the steps below as shared in [Export-Deploy's README](https://github.com/NVIDIA-NeMo/Export-Deploy?tab=readme-ov-file#install-tensorrt-llm-vllm-or-trt-onnx-backend:~:text=cd%20/opt/export%2ddeploy%0auv%20sync%20%2d%2dinexact%20%2d%2dlink%2dmode%20symlink%20%2d%2dlocked%20%2d%2dextra%20vllm%20%24(cat%20/opt/uv_args.txt)): ```shell cd /opt/Export-Deploy uv sync --inexact --link-mode symlink --locked --extra vllm $(cat /opt/uv_args.txt) ``` To install `vLLM` outside of the NeMo Framework container, follow the steps mentioned [here](https://github.com/NVIDIA-NeMo/Export-Deploy?tab=readme-ov-file#install-tensorrt-llm-vllm-or-trt-onnx-backend:~:text=install%20transformerengine%20%2b%20vllm). :::{note} 25.11 release of NeMo Framework container comes with `vLLM` pre-installed and its not necessary to explicitly install it. However for all previous releases, please refer to the instructions above to install `vLLM` inside the NeMo Framework container. ::: If you prefer to evaluate the Automodel checkpoint without using the `vLLM` backend, remove the `--use_vllm_backend` flag from the command above. :::{note} To speed up evaluation using multiple instances, increase the `num_replicas` parameter. For additional guidance, refer to {ref}`nemo-fw-ray`. ::: ## Evaluate Automodel Checkpoints This section outlines the steps to evaluate Automodel checkpoints using Python commands. This method is quick and easy, making it ideal for interactive evaluations. Once deployment is successful, you can run evaluations using the {ref}`lib-core` API. Before starting the evaluation, it’s recommended to use the [`check_endpoint`](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator/src/nemo_evaluator/core/utils.py) function to verify that the endpoint is responsive and ready to accept requests. ```{literalinclude} _snippets/mmlu.py :language: python :start-after: "## Run the evaluation" ``` ## Evaluate Automodel Checkpoints on Log-probability Benchmarks To evaluate Automodel checkpoints on benchmarks that require log-probabilities, use the same deployment command provided in [Deploy Automodel Checkpoints](#deploy-automodel-checkpoints). These benchmarks are supported by both the `vLLM` backend (enabled via the `--use_vllm_backend` flag) and by directly deploying the Automodel checkpoint. For evaluation, you must specify the path to the `tokenizer` and set the `tokenizer_backend` parameter as shown below. The `tokenizer` files are located within the checkpoint directory. ```{literalinclude} _snippets/arc_challenge_hf.py :language: python :start-after: "## Run the evaluation" ``` ## Evaluate Automodel Checkpoints on Chat Benchmarks To evaluate Automodel checkpoints on chat benchmarks you need the chat endpoint (`/v1/chat/completions/`). The deployment command provided in [Deploy Automodel Checkpoints](#deploy-automodel-checkpoints) also exposes the chat endpoint, and the same command can be used for evaluating on chat benchmarks. For evaluation, update the URL by replacing `/v1/completions/` with `/v1/chat/completions/` as shown below. Additionally, set the `type` field to `"chat"` to indicate a chat benchmark. ```{literalinclude} _snippets/ifeval.py :language: python :start-after: "## Run the evaluation" ``` (deployment-nemo-fw)= # Deploy and Evaluate Checkpoints Trained by NeMo Framework The NeMo Framework is NVIDIA’s GPU-accelerated, end-to-end training platform for large language models (LLMs), multimodal models, and speech models. It enables seamless scaling of both pretraining and post-training workloads, from a single GPU to clusters with thousands of nodes, supporting Hugging Face/PyTorch and Megatron models. NeMo includes a suite of libraries and curated training recipes to help users build models from start to finish. The NeMo Evaluator is integrated within NeMo Framework, offering streamlined deployment and advanced evaluation capabilities for models trained using NeMo, leveraging state-of-the-art evaluation harnesses. ## Features - **Multi-Backend Deployment**: Supports PyTriton and multi-instance evaluations using the Ray Serve deployment backend - **Production-Ready**: Supports high-performance inference with CUDA graphs and flash decoding for Megatron models, vLLM backend for Hugging Face models and TRTLLM engine for TRTLLM models - **Multi-GPU and Multi-Node Support**: Enables distributed inference across multiple GPUs and compute nodes - **OpenAI-Compatible API**: Provides RESTful endpoints aligned with OpenAI API specifications ## Architecture ### 1. Deployment Layer - **PyTriton Backend**: Provides high-performance inference through the NVIDIA Triton Inference Server, with OpenAI API compatibility via a FastAPI interface. Supports model parallelism across single-node and multi-node configurations. Note: Multi-instance evaluation is not supported. - **Ray Backend**: Enables multi-instance evaluation with model parallelism on a single node using Ray Serve, while maintaining OpenAI API compatibility. Multi-node support is coming soon. For more information on the deployment, please see [NeMo Export-Deploy](https://github.com/NVIDIA-NeMo/Export-Deploy). ### 2. Evaluation Layer - **NeMo Evaluator**: Provides standardized benchmark evaluations using packages from NVIDIA Eval Factory, bundled in the NeMo Framework container. The `lm-evaluation-harness` is pre-installed by default, and additional evaluation packages can be added as needed. For more information, see {ref}`core-wheels` and {ref}`lib-core`. ```{toctree} :maxdepth: 1 :hidden: Introduction PyTriton Serving Backend Ray Serving Backend Evaluate Megatron Bridge Checkpoints Evaluate Automodel Checkpoints Evaluate TRTLLM Checkpoints ``` # Use PyTriton Server for Evaluations This guide explains how to deploy and evaluate NeMo Framework models, trained with the Megatron-Core backend, using PyTriton to serve the model. ## Introduction Deploymement with PyTriton serving backend provides high-performance inference through the NVIDIA Triton Inference Server, with OpenAI API compatibility via a FastAPI interface. It supports model parallelism across single-node and multi-node configurations, facilitating deployment of large models that cannot fit into a single device. ## Key Benefits of PyTriton Deployment - **Multi-Node support**: Deploy large models on multiple nodes using pipeline-, tensor-, context- or expert-parallelism. - **Automatic Requests Batching**: PyTriton automatically groups your requests into batches for efficient inference. ## Deploy Models Using PyTriton The deployment scripts are available inside [`/opt/Export-Deploy/scripts/deploy/nlp/`](https://github.com/NVIDIA-NeMo/Export-Deploy/tree/main/scripts/deploy/nlp) directory. The example command below uses a Hugging Face LLaMA 3 8B checkpoint that has been converted to the Megatron Bridge format. To evaluate a checkpoint saved during [pre-training or fine-tuning](https://docs.nvidia.com/nemo/megatron-bridge/latest/recipe-usage.html), provide the path to the saved checkpoint using the `--megatron_checkpoint` flag in the command below. ```{literalinclude} _snippets/deploy_pytriton.sh :language: bash :start-after: "# [snippet-start]" :end-before: "# [snippet-end]" ``` When working with a larger model, you can use model parallelism to distribute the model across available devices. The example below deploys the [Llama-3_3-Nemotron-Super-49B-v1](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1) model (converted to the Megatron Bridge format) with 8 devices and tensor parallelism: ```bash python \ /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py \ --megatron_checkpoint "/workspace/Llama-3_3-Nemotron-Super-49B-v1/iter_0000000" \ --triton_model_name "megatron_model" \ --server_port 8080 \ --num_gpus 8 \ --tensor_model_parallel_size 8 \ --max_batch_size 4 \ --inference_max_seq_length 4096 ``` Make sure to adjust the parameters to match your available resource and model architecture. ## Run Evaluations on PyTriton-Deployed Models The entry point for evaluation is the [`evaluate`](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator/src/nemo_evaluator/core/evaluate.py) function. To run evaluations on the deployed model, use the following command. Make sure to open a new terminal within the same container to execute it. For longer evaluations, it is advisable to run both the deploy and evaluate commands in tmux sessions to prevent the processes from being terminated unexpectedly and aborting the runs. It is recommended to use [`check_endpoint`](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator/src/nemo_evaluator/core/utils.py) function to verify that the endpoint is responsive and ready to accept requests before starting the evaluation. ```{literalinclude} _snippets/mmlu.py :language: python :start-after: "## Run the evaluation" ``` To evaluate the chat endpoint, update the URL by replacing `/v1/completions/` with `/v1/chat/completions/`. Additionally, set the `type` field to `"chat"` to indicate a chat benchmark. To evaluate log-probability benchmarks (e.g., `arc_challenge`), run the following code snippet after deployment. Make sure to open a new terminal within the same container to execute it. ```{literalinclude} _snippets/arc_challenge_mbridge.py :language: python :start-after: "## Run the evaluation" ``` Note that in the example above, you must provide a path to the tokenizer: ```python extra={ "tokenizer": "/workspace/mbridge_llama3_8b/iter_0000000/tokenizer", "tokenizer_backend": "huggingface", }, ``` Please refer to [`deploy_inframework_triton.py`](https://github.com/NVIDIA-NeMo/Export-Deploy/blob/main/scripts/deploy/nlp/deploy_inframework_triton.py) script and [`evaluate`](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator/src/nemo_evaluator/core/evaluate.py) function to review all available argument options, as the provided commands are only examples and do not include all arguments or their default values. For more detailed information on the arguments used in the `ApiEndpoint` and `ConfigParams` classes for evaluation, see [`api_dataclasses`](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator/src/nemo_evaluator/api/api_dataclasses.py) submodule. :::{tip} If you encounter a TimeoutError on the eval client side, please increase the `request_timeout` parameter in `ConfigParams` class to a larger value like `1000` or `1200` seconds (the default is 300). ::: (nemo-fw-ray)= # Use Ray Serve for Multi-Instance Evaluations This guide explains how to deploy and evaluate NeMo Framework models, trained with the Megatron-Core backend, using Ray Serve to enable multi-instance evaluation across available GPUs. ## Introduction Deployment with Ray Serve provides support for multiple replicas of your model across available GPUs, enabling higher throughput and better resource utilization during evaluation. This approach is particularly beneficial for evaluation scenarios where you need to process large datasets efficiently and would like to accelerate evaluation. ## Key Benefits of Ray Deployment - **Multiple Model Replicas**: Deploy multiple instances of your model to handle concurrent requests. - **Automatic Load Balancing**: Ray automatically distributes requests across available replicas. - **Scalable Architecture**: Easily scale up or down based on your hardware resources. - **Resource Optimization**: Better utilization of available GPUs. ## Deploy Models Using Ray Serve To deploy your model using Ray, use the `deploy_ray_inframework.py` script from [NeMo Export-Deploy](https://github.com/NVIDIA-NeMo/Export-Deploy): ```shell python \ /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --megatron_checkpoint "/workspace/mbridge_llama3_8b/iter_0000000/" \ # Llama3 8B HF checkpoint converted to MBridge --model_id "megatron_model" \ --port 8080 \ # Ray server port --num_gpus 4 \ # Total GPUs available --num_replicas 2 \ # Number of model replicas --tensor_model_parallel_size 2 \ # Tensor parallelism per replica --pipeline_model_parallel_size 1 \ # Pipeline parallelism per replica --context_parallel_size 1 # Context parallelism per replica ``` :::{note} Adjust `num_replicas` based on the number of instances/replicas needed. Ensure that total `num_gpus` is equal to the `num_replicas` times model parallelism configuration (i.e., `tensor_model_parallel_size * pipeline_model_parallel_size * context_parallel_size`). ::: ## Run Evaluations on Ray-Deployed Models Once your model is deployed with Ray, you can run evaluations using the same evaluation API as with PyTriton deployment. It is recommended to use the [`check_endpoint`](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator/src/nemo_evaluator/core/utils.py) function to verify that the endpoint is responsive and ready to accept requests before starting the evaluation. To evaluate on generation benchmarks use the code snippet below: ```{literalinclude} _snippets/mmlu.py :language: python :start-after: "## Run the evaluation" ``` To evaluate the chat endpoint, update the url by replacing `/v1/completions/` with `/v1/chat/completions/`. Additionally, set the `type` field to `"chat"` in both `ApiEndpoint` and `EvaluationConfig` to indicate a chat benchmark. To evaluate log-probability benchmarks (e.g., `arc_challenge`), run the following code snippet after deployment. Make sure to open a new terminal within the same container to execute it. ```{literalinclude} _snippets/arc_challenge_mbridge.py :language: python :start-after: "## Run the evaluation" ``` Note that in the example above, you must provide a path to the tokenizer: ```python extra={ "tokenizer": "/workspace/mbridge_llama3_8b/iter_0000000/tokenizer", "tokenizer_backend": "huggingface", }, ``` :::{tip} To get a performance boost from multiple replicas in Ray, increase the parallelism value in your `EvaluationConfig`. You won't see any speed improvement if `parallelism=1`. Try setting it to a higher value, such as 4 or 8. ::: # Evaluate Megatron Bridge Checkpoints Trained by NeMo Framework This guide provides step-by-step instructions for evaluating [Megatron Bridge](https://docs.nvidia.com/nemo/megatron-bridge/latest/index.html) checkpoints trained using the NeMo Framework with the Megatron Core backend. This section specifically covers evaluation with [nvidia-lm-eval](https://pypi.org/project/nvidia-lm-eval/), a wrapper around the [ lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) tool. First, we focus on benchmarks within the `lm-evaluation-harness` that depend on text generation. Evaluation on log-probability-based benchmarks is available in the subsequent section [Evaluate Megatron Bridge Checkpoints on Log-probability benchmarks](#evaluate-megatron-bridge-checkpoints-on-log-probability-benchmarks). ## Deploy Megatron Bridge Checkpoints To evaluate a checkpoint saved during pretraining or fine-tuning with [Megatron-Bridge](https://docs.nvidia.com/nemo/megatron-bridge/latest/recipe-usage.html), provide the path to the saved checkpoint using the `--megatron_checkpoint` flag in the deployment command below. Otherwise, Hugging Face checkpoints can be converted to Megatron Bridge using the single shell command: ```bash huggingface-cli login --token python -c "from megatron.bridge import AutoBridge; AutoBridge.import_ckpt('meta-llama/Meta-Llama-3-8B','/workspace/mbridge_llama3_8b/')" ``` The deployment scripts are available inside the [`/opt/Export-Deploy/scripts/deploy/nlp/`](https://github.com/NVIDIA-NeMo/Export-Deploy/tree/main/scripts/deploy/nlp) directory. Below is an example command for deployment. It uses a Hugging Face LLaMA 3 8B checkpoint that has been converted to Megatron Bridge format using the command shared above. ```{literalinclude} _snippets/deploy_mbridge.sh :language: bash :start-after: "# [snippet-start]" :end-before: "# [snippet-end]" ``` :::{note} Megatron Bridge creates checkpoints in directories named `iter_N`, where *N* is the iteration number. Each `iter_N` directory contains model weights and related artifacts. When using a checkpoint, make sure to provide the path to the appropriate `iter_N` directory. Hugging Face checkpoints converted for Megatron Bridge are typically stored in a directory named `iter_0000000`, as shown in the command above. ::: :::{note} Megatron Bridge deployment for evaluation is supported only with Ray Serve and not PyTriton. ::: ## Evaluate Megatron Bridge Checkpoints Once deployment is successful, you can run evaluations using the NeMo Evaluator API. See {ref}`lib-core` for more details. Before starting the evaluation, it’s recommended to use the [`check_endpoint`](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator/src/nemo_evaluator/core/utils.py) function to verify that the endpoint is responsive and ready to accept requests. ```{literalinclude} _snippets/mmlu.py :language: python :start-after: "## Run the evaluation" ``` ## Evaluate Megatron Bridge Checkpoints on Log-probability Benchmarks To evaluate Megatron Bridge checkpoints on benchmarks that require log-probabilities, use the same deployment command provided in [Deploy Megatron Bridge Checkpoints](#deploy-megatron-bridge-checkpoints). For evaluation, you must specify the path to the `tokenizer` and set the `tokenizer_backend` parameter as shown below. The `tokenizer` files are located within the `tokenizer` directory of the checkpoint. ```{literalinclude} _snippets/arc_challenge_mbridge.py :language: python :start-after: "## Run the evaluation" ``` ## Evaluate Megatron Bridge Checkpoints on Chat Benchmarks To evaluate Megatron Bridge checkpoints on chat benchmarks you need the chat endpoint (/v1/chat/completions/). The deployment command provided in [Deploy Megatron Bridge Checkpoints](#deploy-megatron-bridge-checkpoints) also exposes the chat endpoint, and the same command can be used for evaluating on chat benchmarks. For evaluation, update the URL by replacing `/v1/completions/` with `/v1/chat/completions/` as shown below. Additionally, set the `type` field to `"chat"` to indicate a chat benchmark. ```{literalinclude} _snippets/ifeval.py :language: python :start-after: "## Run the evaluation" ``` # Evaluate TensorRT-LLM checkpoints with NeMo Framework This guide provides step-by-step instructions for evaluating TensorRT-LLM (TRTLLM) checkpoints or models inside NeMo Framework. This guide focuses on benchmarks within the `lm-evaluation-harness` that depend on text generation. For a detailed comparison between generation-based and log-probability-based benchmarks, refer to {ref}`eval-run`. :::{note} Evaluation on log-probability-based benchmarks for TRTLLM models is currently planned for a future release. ::: ## Deploy TRTLLM Checkpoints This section outlines the steps to deploy TRTLLM checkpoints using Python commands. TRTLLM checkpoint deployment uses Ray Serve as the serving backend. It also offers an OpenAI API (OAI)-compatible endpoint, similar to deployments of checkpoints trained with the Megatron Core backend. An example deployment command is shown below. ```{literalinclude} _snippets/deploy_trtllm.sh :language: bash :start-after: "# [snippet-start]" :end-before: "# [snippet-end]" ``` ## Evaluate TRTLLM Checkpoints This section outlines the steps to evaluate TRTLLM checkpoints using Python commands. This method is quick and easy, making it ideal for interactive evaluations. Once deployment is successful, you can run evaluations using the same evaluation API described in other sections. Before starting the evaluation, it’s recommended to use the [`check_endpoint`](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator/src/nemo_evaluator/core/utils.py) function to verify that the endpoint is responsive and ready to accept requests. ```{literalinclude} _snippets/mmlu.py :language: python :start-after: "## Run the evaluation" ``` (lib)= # NeMo Evaluator Libraries Select a library for your evaluation workflow: ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` NeMo Evaluator Launcher :link: nemo-evaluator-launcher/index :link-type: doc **Start here** - Unified CLI and Python API for running evaluations across local, cluster, and hosted environments. ::: :::{grid-item-card} {octicon}`beaker;1.5em;sd-mr-1` NeMo Evaluator :link: nemo-evaluator/index :link-type: doc **Advanced usage** - Direct access to core evaluation logic for custom integrations. ::: :::: The Launcher orchestrates the NeMo Evaluator containers using identical underlying code to ensure consistent results. (deployment-generic)= # Generic Deployment Generic deployment provides flexible configuration for deploying any custom server that isn't covered by built-in deployment configurations. ## Configuration See `configs/deployment/generic.yaml` for all available parameters. ### Basic Settings Key arguments: - **`image`**: Docker image to use for deployment (required) - **`command`**: Command to run the server with template variables (required) - **`served_model_name`**: Name of the served model (required) - **`endpoints`**: API endpoint paths (chat, completions, health) - **`checkpoint_path`**: Path to model checkpoint for mounting (default: null) - **`extra_args`**: Additional command line arguments - **`env_vars`**: Environment variables as {name: value} dict ## Best Practices - Ensure server responds to health check endpoint (ensure that health endpoint is correctly parametrized) - Test configuration with `--dry_run` ## Contributing Permanent Configurations If you've successfully applied the generic deployment to serve a specific model or framework, contributions are welcome! We'll turn your working configuration into a permanent config file for the community. # Deployment Configuration Deployment configurations define how to provision and host model endpoints for evaluation. ## Deployment Types Choose the deployment type for your evaluation: ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`globe;1.5em;sd-mr-1` None (External) :link: none :link-type: doc Use existing API endpoints. No model deployment needed. ::: :::{grid-item-card} {octicon}`broadcast;1.5em;sd-mr-1` vLLM :link: vllm :link-type: doc Deploy models using the vLLM serving framework. ::: :::{grid-item-card} {octicon}`zap;1.5em;sd-mr-1` SGLang :link: sglang :link-type: doc Deploy models using the SGLang serving framework. ::: :::{grid-item-card} {octicon}`cpu;1.5em;sd-mr-1` NIM :link: nim :link-type: doc Deploy models using NVIDIA Inference Microservices. ::: :::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` TRT-LLM :link: trtllm :link-type: doc Deploy models using NVIDIA TensorRT LLM. ::: :::{grid-item-card} {octicon}`package;1.5em;sd-mr-1` Generic :link: generic :link-type: doc Deploy models using a fully custom setup. ::: :::: ## Quick Reference ```yaml deployment: type: vllm # or sglang, nim, none # ... deployment-specific settings ``` ```{toctree} :caption: Deployment Types :hidden: vLLM SGLang NIM TensorRT-LLM Generic None (External) ``` (deployment-vllm)= # vLLM Deployment Configure vLLM as the deployment backend for serving models during evaluation. ## Configuration Parameters ### Basic Settings ```yaml deployment: type: vllm image: vllm/vllm-openai:latest hf_model_handle: hf-model/handle # HuggingFace ID checkpoint_path: null # or provide a path to the stored checkpoint served_model_name: your-model-name port: 8000 ``` **Required Fields:** - `checkpoint_path` or `hf_model_handle`: Model path or HuggingFace model ID (e.g., `meta-llama/Llama-3.1-8B-Instruct`) - `served_model_name`: Name for the served model ### Performance Settings ```yaml deployment: tensor_parallel_size: 8 pipeline_parallel_size: 1 data_parallel_size: 1 gpu_memory_utilization: 0.95 ``` - **tensor_parallel_size**: Number of GPUs to split the model across (default: 8) - **pipeline_parallel_size**: Number of pipeline stages (default: 1) - **data_parallel_size**: Number of model replicas (default: 1) - **gpu_memory_utilization**: Fraction of GPU memory to use for the model (default: 0.95) ### Extra Arguments and Endpoints ```yaml deployment: extra_args: "--max-model-len 4096" endpoints: chat: /v1/chat/completions completions: /v1/completions health: /health ``` The `extra_args` field passes extra arguments to the `vllm serve` command. ## Complete Example ```yaml defaults: - execution: slurm/default - deployment: vllm - _self_ deployment: checkpoint_path: Qwen/Qwen3-4B-Instruct-2507 served_model_name: qwen3-4b-instruct-2507 tensor_parallel_size: 1 data_parallel_size: 8 extra_args: "--max-model-len 4096" execution: hostname: your-cluster-headnode account: your-account output_dir: /path/to/output walltime: 02:00:00 evaluation: tasks: - name: ifeval - name: gpqa_diamond env_vars: HF_TOKEN: HF_TOKEN_FOR_GPQA_DIAMOND # Click request access for GPQA-Diamond: https://huggingface.co/datasets/Idavidrein/gpqa ``` ## Reference The following example configuration files are available in the `examples/` directory: - `lepton_vllm_llama_3_1_8b_instruct.yaml` - vLLM deployment on Lepton platform - `slurm_llama_3_1_8b_instruct.yaml` - vLLM deployment on SLURM cluster - `slurm_llama_3_1_8b_instruct_hf.yaml` - vLLM deployment using HuggingFace model ID Use `nemo-evaluator-launcher run --dry-run` to check your configuration before running. (deployment-nim)= # NIM Deployment NIM (NVIDIA Inference Microservices) provides optimized inference microservices with OpenAI-compatible application programming interfaces. NIM deployments automatically handle model optimization, scaling, and resource management on supported platforms. ## Configuration ### Basic Settings ```yaml deployment: image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.6 served_model_name: meta/llama-3.1-8b-instruct port: 8000 ``` - **`image`**: NIM container image from [NVIDIA NIM Containers](https://catalog.ngc.nvidia.com/containers?filters=nvidia_nim) (required) - **`served_model_name`**: Name used for serving the model (required) - **`port`**: Port for the NIM server (default: 8000) ### Endpoints ```yaml endpoints: chat: /v1/chat/completions completions: /v1/completions health: /health ``` ## Integration with Lepton NIM deployment with Lepton executor: ```yaml defaults: - execution: lepton/default - deployment: nim - _self_ deployment: image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.6 served_model_name: meta/llama-3.1-8b-instruct # Platform-specific settings lepton_config: endpoint_name: nim-llama-3-1-8b-eval resource_shape: gpu.1xh200 # ... additional platform settings ``` ### Environment Variables Configure environment variables for NIM container operation: ```yaml deployment: lepton_config: envs: HF_TOKEN: value_from: secret_name_ref: "HUGGING_FACE_HUB_TOKEN" ``` **Auto-populated Variables:** The launcher automatically sets these environment variables from your deployment configuration: - `SERVED_MODEL_NAME`: Set from `deployment.served_model_name` - `NIM_MODEL_NAME`: Set from `deployment.served_model_name` - `MODEL_PORT`: Set from `deployment.port` (default: 8000) ### Resource Management #### Auto-scaling Configuration ```yaml deployment: lepton_config: min_replicas: 1 max_replicas: 3 auto_scaler: scale_down: no_traffic_timeout: 3600 scale_from_zero: false target_gpu_utilization_percentage: 0 target_throughput: qpm: 2.5 ``` #### Storage Mounts Enable model caching for faster startup: ```yaml deployment: lepton_config: mounts: enabled: true cache_path: "/path/to/model/cache" mount_path: "/opt/nim/.cache" ``` ### Security Configuration #### API Tokens ```yaml deployment: lepton_config: api_tokens: - value: "UNIQUE_ENDPOINT_TOKEN" ``` #### Image Pull Secrets ```yaml execution: lepton_platform: tasks: image_pull_secrets: - "lepton-nvidia-registry-secret" ``` ### Complete Example ```yaml defaults: - execution: lepton/default - deployment: nim - _self_ execution: output_dir: lepton_nim_llama_3_1_8b_results deployment: image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.6 served_model_name: meta/llama-3.1-8b-instruct lepton_config: endpoint_name: llama-3-1-8b resource_shape: gpu.1xh200 min_replicas: 1 max_replicas: 3 api_tokens: - value_from: token_name_ref: "ENDPOINT_API_KEY" envs: HF_TOKEN: value_from: secret_name_ref: "HUGGING_FACE_HUB_TOKEN" mounts: enabled: true cache_path: "/path/to/model/cache" mount_path: "/opt/nim/.cache" evaluation: tasks: - name: ifeval ``` ## Examples Refer to `packages/nemo-evaluator-launcher/examples/lepton_nim.yaml` for a complete NIM deployment example. ## Reference - [NIM Documentation](https://docs.nvidia.com/nim/) - [NIM Deployment Guide](https://docs.nvidia.com/nim/large-language-models/latest/deployment-guide.html) (deployment-none)= # None Deployment The "none" deployment option means **no model deployment is performed**. Instead, you provide an existing OpenAI-compatible endpoint. The launcher handles running evaluation tasks while connecting to your existing endpoint. ## When to Use None Deployment - **Existing Endpoints**: You have a running model endpoint to evaluate - **Third-Party Services**: Testing models from NVIDIA API Catalog, OpenAI, or other providers - **Custom Infrastructure**: Using your own deployment solution outside the launcher - **Cost Optimization**: Reusing existing deployments across multiple evaluation runs - **Separation of Concerns**: Keeping model deployment and evaluation as separate processes ## Key Benefits - **No Resource Management**: No need to provision or manage model deployment resources - **Platform Flexibility**: Works with Local, Lepton, and SLURM execution platforms - **Quick Setup**: Minimal configuration required - just point to your endpoint - **Cost Effective**: Leverage existing deployments without additional infrastructure ## Universal Configuration These configuration patterns apply to all execution platforms when using "none" deployment. ### Target Endpoint Setup ```yaml target: api_endpoint: model_id: meta/llama-3.1-8b-instruct # Model identifier (required) url: https://your-endpoint.com/v1/chat/completions # Endpoint URL (required) api_key_name: API_KEY # Environment variable name (recommended) ``` ## Platform Examples Choose your execution platform and see the specific configuration needed: ::::{tab-set} :::{tab-item} Local **Best for**: Development, testing, small-scale evaluations ```yaml defaults: - execution: local - deployment: none - _self_ execution: output_dir: results target: api_endpoint: model_id: meta/llama-3.2-3b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY evaluation: tasks: - name: gpqa_diamond env_vars: HF_TOKEN: HF_TOKEN_FOR_GPQA_DIAMOND # Click request access for GPQA-Diamond: https://huggingface.co/datasets/Idavidrein/gpqa ``` **Key Points:** - Minimal configuration required - Set environment variables in your shell - Limited by local machine resources ::: :::{tab-item} Lepton **Best for**: Production evaluations, team environments, scalable workloads ```yaml defaults: - execution: lepton/default - deployment: none - _self_ execution: output_dir: results lepton_platform: tasks: api_tokens: - value_from: token_name_ref: "ENDPOINT_API_KEY" env_vars: HF_TOKEN: value_from: secret_name_ref: "HUGGING_FACE_HUB_TOKEN" API_KEY: value_from: secret_name_ref: "ENDPOINT_API_KEY" node_group: "your-node-group" mounts: - from: "node-nfs:shared-fs" path: "/workspace/path" mount_path: "/workspace" target: api_endpoint: model_id: meta/llama-3.1-8b-instruct url: https://your-endpoint.lepton.run/v1/chat/completions api_key_name: API_KEY evaluation: tasks: - name: gpqa_diamond ``` **Key Points:** - Requires Lepton credentials (`lep login`) - Use `secret_name_ref` for secure credential storage - Configure node groups and storage mounts - Handles larger evaluation workloads ::: :::{tab-item} SLURM **Best for**: HPC environments, large-scale evaluations, batch processing ```yaml defaults: - execution: slurm/default - deployment: none - _self_ execution: account: your-slurm-account output_dir: /shared/filesystem/results walltime: "02:00:00" partition: cpu_short gpus_per_node: null # No GPUs needed target: api_endpoint: model_id: meta/llama-3.2-3b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY # API Key with access to build.nvidia.com evaluation: tasks: - name: gpqa_diamond env_vars: HF_TOKEN: HF_TOKEN_FOR_GPQA_DIAMOND # Click request access for GPQA-Diamond: https://huggingface.co/datasets/Idavidrein/gpqa ``` **Key Points:** - Requires SLURM account and accessible output directory - Creates one job per benchmark evaluation - Uses CPU partitions (no GPUs needed for none deployment) ::: :::: (deployment-sglang)= # SGLang Deployment SGLang is a serving framework for large language models. This deployment type launches SGLang servers using the `lmsysorg/sglang` Docker image. ## Configuration ### Required Settings See the complete configuration structure in `configs/deployment/sglang.yaml`. ```yaml deployment: type: sglang image: lmsysorg/sglang:latest hf_model_handle: hf-model/handle # HuggingFace ID checkpoint_path: null # or provide a path to the stored checkpoint served_model_name: your-model-name port: 8000 ``` **Required Fields:** - `checkpoint_path` or `hf_model_handle`: Model path or HuggingFace model ID (e.g., `meta-llama/Llama-3.1-8B-Instruct`) - `served_model_name`: Name for the served model ### Optional Settings ```yaml deployment: tensor_parallel_size: 8 # Default: 8 data_parallel_size: 1 # Default: 1 extra_args: "" # Extra SGLang server arguments env_vars: {} # Environment variables (key: value dict) ``` **Configuration Fields:** - `tensor_parallel_size`: Number of GPUs for tensor parallelism (default: 8) - `data_parallel_size`: Number of data parallel replicas (default: 1) - `extra_args`: Extra command-line arguments to pass to SGLang server - `env_vars`: Environment variables for the container ### API Endpoints The SGLang deployment exposes OpenAI-compatible endpoints: ```yaml endpoints: chat: /v1/chat/completions completions: /v1/completions health: /health ``` ## Example Configuration ```yaml defaults: - execution: slurm/default - deployment: sglang - _self_ deployment: checkpoint_path: Qwen/Qwen3-4B-Instruct-2507 served_model_name: qwen3-4b-instruct-2507 tensor_parallel_size: 8 data_parallel_size: 8 extra_args: "" execution: hostname: your-cluster-headnode account: your-account output_dir: /path/to/output walltime: 02:00:00 evaluation: tasks: - name: gpqa_diamond - name: ifeval env_vars: HF_TOKEN: HF_TOKEN_FOR_GPQA_DIAMOND # or use HF_HOME ``` ## Command Template The launcher uses the following command template to start the SGLang server (from `configs/deployment/sglang.yaml`): ```bash python3 -m sglang.launch_server \ --model-path ${oc.select:deployment.hf_model_handle,/checkpoint} \ --host 0.0.0.0 \ --port ${deployment.port} \ --served-model-name ${deployment.served_model_name} \ --tp ${deployment.tensor_parallel_size} \ --dp ${deployment.data_parallel_size} \ ${deployment.extra_args} ``` :::{note} The `${oc.select:deployment.hf_model_handle,/checkpoint}` syntax uses OmegaConf's select resolver. In practice, set `checkpoint_path` with your model path or HuggingFace model ID. ::: ## Reference **Configuration File:** - Source: `packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/deployment/sglang.yaml` **Related Documentation:** - [Deployment Configuration Overview](index.md) - [Execution Platform Configuration](../executors/index.md) - [SGLang Documentation](https://docs.sglang.ai/) (deployment-trtllm)= # TensorRT LLM (TRT-LLM) Deployment Configure TRT-LLM as the deployment backend for serving models during evaluation. ## Configuration Parameters ### Basic Settings ```yaml deployment: type: trtllm image: nvcr.io/nvidia/tensorrt-llm/release:1.0.0 checkpoint_path: /path/to/model served_model_name: your-model-name port: 8000 ``` ### Parallelism Configuration ```yaml deployment: tensor_parallel_size: 4 pipeline_parallel_size: 1 ``` - **tensor_parallel_size**: Number of GPUs to split the model across (default: 4) - **pipeline_parallel_size**: Number of pipeline stages (default: 1) ### Extra Arguments and Endpoints ```yaml deployment: extra_args: "--ep_size 2" endpoints: chat: /v1/chat/completions completions: /v1/completions health: /health ``` The `extra_args` field passes extra arguments to the `trtllm-serve serve ` command. ## Complete Example ```yaml defaults: - execution: slurm/default - deployment: trtllm - _self_ deployment: checkpoint_path: /path/to/checkpoint served_model_name: llama-3.1-8b-instruct tensor_parallel_size: 1 extra_args: "" execution: account: your-account output_dir: /path/to/output walltime: 02:00:00 evaluation: tasks: - name: ifeval - name: gpqa_diamond env_vars: HF_TOKEN: HF_TOKEN_FOR_GPQA_DIAMOND # Click request access for GPQA-Diamond: https://huggingface.co/datasets/Idavidrein/gpqa ``` Use `nemo-evaluator-launcher run --dry-run` to check your configuration before running. (evaluation-configuration)= # Evaluation Configuration Evaluation configuration defines which benchmarks to run and their configuration. It is common for all executors and can be reused between them to launch the exact same tasks. **Important**: Each task has its own default values that you can override. For comprehensive override options, see {ref}`parameter-overrides`. ## Configuration Structure ```yaml evaluation: nemo_evaluator_config: # Global overrides for all tasks config: params: request_timeout: 3600 tasks: - name: task_name # Use default benchmark configuration - name: another_task nemo_evaluator_config: # Task-specific overrides config: params: # Task-specific overrides temperature: 0.6 top_p: 0.95 env_vars: # Task-specific environment variables HF_TOKEN: MY_HF_TOKEN ``` ## Key Components ### Global Overrides - **`overrides`**: Parameter overrides that apply to all tasks - **`env_vars`**: Environment variables that apply to all tasks ### Task Configuration - **`tasks`**: List of evaluation tasks to run - **`name`**: Name of the benchmark task - **`overrides`**: Task-specific parameter overrides - **`env_vars`**: Task-specific environment variables For a comprehensive list of available tasks, their descriptions, and task-specific parameters, see {ref}`nemo-evaluator-containers`. ## Advanced Task Configuration ### Parameter Overrides The overrides system is crucial for leveraging the full flexibility of the common endpoint interceptors and task configuration layer. This is where nemo-evaluator intersects with nemo-evaluator-launcher, providing a unified configuration interface. #### Global Overrides Settings applied to all tasks listed in the config. ```yaml evaluation: nemo_evaluator_config: config: params: request_timeout: 3600 temperature: 0.7 ``` #### Task-Specific Overrides Parameters passed to a job for a single task. They take precedence over global evaluation settings. ```yaml evaluation: tasks: - name: gpqa_diamond nemo_evaluator_config: config: params: temperature: 0.6 top_p: 0.95 max_new_tokens: 8192 parallelism: 32 - name: mbpp nemo_evaluator_config: config: params: temperature: 0.2 top_p: 0.95 max_new_tokens: 2048 extra: n_samples: 5 ``` ### Environment Variables Task-specific environment variables. These parameters are set for a single job and don't affect other tasks: ```yaml evaluation: tasks: - name: task_name1 # HF_TOKEN and CUSTOM_VAR are available for task_name1 env_vars: HF_TOKEN: MY_HF_TOKEN CUSTOM_VAR: CUSTOM_VALUE - name: task_name2 # HF_TOKEN and CUSTOM_VAR are not set for task_name2 ``` ### Dataset Directory Mounting Some evaluation tasks require access to local datasets that must be mounted into the evaluation container. Tasks that require dataset mounting will have `NEMO_EVALUATOR_DATASET_DIR` in their `required_env_vars`. When using such tasks, you must specify: - **`dataset_dir`**: Path to the dataset on the host machine - **`dataset_mount_path`** (optional): Path where the dataset should be mounted inside the container (defaults to `/datasets`) ```yaml evaluation: tasks: - name: mteb.techqa dataset_dir: /path/to/your/techqa/dataset # dataset_mount_path: /datasets # Optional, defaults to /datasets ``` The system will: 1. Mount the host path (`dataset_dir`) to the container path (`dataset_mount_path`) 2. Automatically set the `NEMO_EVALUATOR_DATASET_DIR` environment variable to point to the mounted path inside the container 3. Validate that the required environment variable is properly configured **Example with custom mount path:** ```yaml evaluation: tasks: - name: mteb.techqa dataset_dir: /mnt/data/techqa dataset_mount_path: /data/techqa # Custom container path ``` ## When to Use Use evaluation configuration when you want to: - **Change Default Sampling Parameters**: Adjust temperature, top_p, max_new_tokens for different tasks - **Change Default Task Values**: Override benchmark-specific default configurations - **Configure Task-Specific Parameters**: Set custom parameters for individual benchmarks (e.g., n_samples for code generation tasks) - **Debug and Test**: Launch with limited samples for validation - **Adjust Endpoint Capabilities**: Configure request timeouts, max retries, and parallel request limits :::{tip} For overriding long strings, use YAML multiline syntax with `>-`: ```yaml config.params.extra.custom_field: >- This is a long string that spans multiple lines and will be passed as a single value with spaces replacing the newlines. ``` This preserves formatting and allows for complex multi-line configurations. ::: ## Reference - **Parameter Overrides**: {ref}`parameter-overrides` - Complete guide to available parameters and override syntax - **Adapter Configuration**: For advanced request/response modification (system prompts, payload modification, reasoning handling), see {ref}`nemo-evaluator-interceptors` - **Task Configuration**: {ref}`lib-core` - Complete nemo-evaluator documentation - **Available Tasks**: {ref}`nemo-evaluator-containers` - Browse all available evaluation tasks and benchmarks (executors-overview)= # Executors Executors run evaluations by orchestrating containerized benchmarks in different environments. They handle resource management, IO paths, and job scheduling across various execution backends, from local development to large-scale cluster deployments. **Core concepts**: - Your model is separate from the evaluation container; communication is via an OpenAI‑compatible API - Each benchmark runs in a Docker container pulled from the NVIDIA NGC catalog - Execution backends can optionally manage model deployment ## Choosing an Executor Select the executor that best matches your environment and requirements: ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`desktop-download;1.5em;sd-mr-1` Local Executor :link: local :link-type: doc Run evaluations on your local machine using Docker for rapid iteration and development workflows. ::: :::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` Slurm Executor :link: slurm :link-type: doc Execute large-scale evaluations on Slurm-managed high-performance computing clusters with optional model deployment. ::: :::{grid-item-card} {octicon}`cloud;1.5em;sd-mr-1` Lepton Executor :link: lepton :link-type: doc Run evaluations on Lepton AI's hosted infrastructure with automatic model deployment and scaling. ::: :::: :::{toctree} :caption: Executors :hidden: Local Executor Slurm Executor Lepton Executor ::: (executor-local)= # Local Executor The Local executor runs evaluations on your machine using Docker. It provides a fast way to iterate if you have Docker installed, evaluating existing endpoints. See common concepts and commands in {ref}`executors-overview`. ## Prerequisites - Docker - Python environment with the NeMo Evaluator Launcher CLI available (install the launcher by following {ref}`gs-install`) ## Quick Start For detailed step-by-step instructions on evaluating existing endpoints, refer to the {ref}`gs-quickstart-launcher` guide, which covers: - Choosing models and tasks - Setting up API keys (for NVIDIA APIs, see [Setting up API Keys](https://docs.omniverse.nvidia.com/guide-sdg/latest/setup.html#preview-and-set-up-an-api-key)) - Creating configuration files - Running evaluations Here's a quick overview for the Local executor: ### Run evaluation for existing endpoint ```bash # Run evaluation nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \ -o target.api_endpoint.api_key_name=NGC_API_KEY ``` ## Environment Variables The Local executor supports passing environment variables from your local machine to evaluation containers: ### How It Works The executor passes environment variables to Docker containers using `docker run -e KEY=VALUE` flags. The executor automatically adds `$` to your variable names from the configuration `env_vars` (for example, `OPENAI_API_KEY` becomes `$OPENAI_API_KEY`). ### Configuration ```yaml evaluation: env_vars: API_KEY: YOUR_API_KEY_ENV_VAR_NAME CUSTOM_VAR: YOUR_CUSTOM_ENV_VAR_NAME tasks: - name: my_task env_vars: TASK_SPECIFIC_VAR: TASK_ENV_VAR_NAME ``` ## Secrets and API Keys The executor handles API keys the same way as environment variables - store them as environment variables on your machine and reference them in the `env_vars` configuration. ## Mounting and Storage The Local executor uses Docker volume mounts for data persistence: ### Docker Volumes - **Results Mount**: Each task's artifacts directory mounts as `/results` in evaluation containers - **Custom Mounts**: Use to `extra_docker_args` field to define custom volume mounts (see [Advanced configuration](#advanced-configuration) ) ## Advanced configuration You can customize your local executor by specifying `extra_docker_args`. This parameter allows you to pass any flag to the `docker run` command that is executed by the NeMo Evaluator Launcher. You can use it to mount additional volumes, set environment variables or customize your network settings. For example, if you would like your job to use a specific docker network, you can specify: ```yaml execution: extra_docker_args: "--network my-custom-network" ``` Replace `my-custom-network` with `host` to access the host network. To mount additional custom volumes, do: ```yaml execution: extra_docker_args: "--volume /my/local/path:/my/container/path" ``` ## Rerunning Evaluations The Local executor generates reusable scripts for rerunning evaluations: ### Script Generation The Local executor automatically generates scripts: - **`run_all.sequential.sh`**: Script to run all evaluation tasks sequentially (in output directory) - **`run.sh`**: Individual scripts for each task (in each task subdirectory) - **Reproducible**: Scripts contain all necessary commands and configurations ### Manual Rerun ```bash # Rerun all tasks cd /path/to/output_dir/2024-01-15-10-30-45-abc12345/ bash run_all.sequential.sh # Rerun specific task cd /path/to/output_dir/2024-01-15-10-30-45-abc12345/task1/ bash run.sh ``` ## Key Features - **Docker-based execution**: Isolated, reproducible runs - **Script generation**: Reusable scripts for rerunning evaluations - **Real-time logs**: Status tracking via log files ## Monitoring and Job Management For monitoring jobs, checking status, and managing evaluations, see {ref}`executors-overview`. (executor-lepton)= # Lepton Executor The Lepton executor deploys endpoints and runs evaluations on Lepton AI. It's designed for fast, isolated, parallel evaluations using hosted or deployed endpoints. See common concepts and commands in the executors overview. ## Prerequisites - Lepton AI account and workspace access - Lepton AI credentials configured - Appropriate container images and permissions (for deployment flows) ## Install Lepton AI SDK Install the Lepton AI SDK: ```bash pip install leptonai ``` ## Authenticate with Your Lepton Workspace Log in to your Lepton AI workspace: ```bash lep login ``` Follow the prompts to authenticate with your Lepton AI credentials. ## Quick Start Run a Lepton evaluation using the provided examples: ```bash # Deploy NIM model and run evaluation nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/lepton_nim.yaml # Deploy vLLM model and run evaluation nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/lepton_vllm.yaml # Use an existing endpoint (no deployment) nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/lepton_basic.yaml ``` ## Parallel Deployment Strategy - Dedicated endpoints: Each task gets its own endpoint of the same model - Parallel deployment: All endpoints are created simultaneously (~3x faster) - Resource isolation: Independent tasks avoid mutual interference - Storage isolation: Per-invocation subdirectories are created in your configured mount paths - Simple cleanup: Single command tears down endpoints and storage ```{mermaid} graph TD A["nemo-evaluator-launcher run"] --> B["Load Tasks"] B --> D["Endpoints Deployment"] D --> E1["Deployment 1: Create Endpoint 1"] D --> E2["Deployment 2: Create Endpoint 2"] D --> E3["Deployment 3: Create Endpoint 3"] E1 --> F["Wait for All Ready"] E2 --> F E3 --> F F --> G["Mount Storage per Task"] G --> H["Parallel Tasks Creation as Jobs in Lepton"] H --> J1["Task 1: Job 1 Evaluation"] H --> J2["Task 2: Job 2 Evaluation"] H --> J3["Task 3: Job 3 Evaluation"] J1 --> K["Execute in Parallel"] J2 --> K J3 --> K K --> L["Finish"] ``` ## Configuration Lepton executor configurations require: - **Execution backend**: `execution: lepton/default` - **Lepton platform settings**: Node groups, resource shapes, secrets, and storage mounts Refer to the complete working examples in the `examples/` directory: - `lepton_vllm_llama_3_1_8b_instruct.yaml` - vLLM deployment - `lepton_nim_llama_3_1_8b_instruct.yaml` - NIM container deployment - `lepton_none_llama_3_1_8b_instruct.yaml` - Use existing endpoint These example files include: - Lepton-specific resource configuration (`lepton_config.resource_shape`, node groups) - Environment variable references to secrets (HuggingFace tokens, API keys) - Storage mount configurations for model caching - Auto-scaling settings for deployments ## Monitoring and Troubleshooting Check the status of your evaluation runs: ```bash # Check status of a specific invocation nemo-evaluator-launcher status # Kill running jobs and cleanup endpoints nemo-evaluator-launcher kill ``` Common issues: - Ensure Lepton credentials are valid (`lep login`) - Verify container images are accessible from your Lepton workspace - Check that endpoints reach Ready state before jobs start - Confirm secrets are configured in Lepton UI (Settings → Secrets) (executor-slurm)= # Slurm Executor The Slurm executor runs evaluations on high‑performance computing (HPC) clusters managed by Slurm, an open‑source workload manager widely used in research and enterprise environments. It schedules and executes jobs across cluster nodes, enabling parallel, large‑scale evaluation runs while preserving reproducibility via containerized benchmarks. See common concepts and commands in {ref}`executors-overview`. Slurm can optionally host your model for the scope of an evaluation by deploying a serving container on the cluster and pointing the benchmark to that temporary endpoint. In this mode, two containers are used: one for the evaluation harness and one for the model server. The evaluation configuration includes a deployment section when this is enabled. See the examples in the examples/ directory for ready‑to‑use configurations. If you do not require deployment on Slurm, simply omit the deployment section from your configuration and set the model's endpoint URL directly (any OpenAI‑compatible endpoint that you host elsewhere). ## Prerequisites - Access to a Slurm cluster (with appropriate partitions/queues) - [Pyxis SPANK plugin](https://github.com/NVIDIA/pyxis) installed on the cluster ## Configuration Overview ### Connecting to Your Slurm Cluster To run evaluations on Slurm, specify how to connect to your cluster ```yaml execution: hostname: your-cluster-headnode # Slurm headnode (login node) username: your_username # Cluster username (defaults to $USER env var) account: your_allocation # Slurm account or project name output_dir: /shared/scratch/your_username/eval_results # Absolute, shared path ``` :::{note} When specifying the parameters make sure to provide: - `hostname`: Slurm headnode (login node) where you normally SSH to submit jobs. - `output_dir`: must be an **absolute path** on a shared filesystem (e.g., /shared/scratch/ or /home/) accessible to both the headnode and compute nodes. ::: ### Model Deployment Options When deploying models on Slurm, you have two options for specifying your model source: #### Option 1: HuggingFace Models (Recommended - Automatic Download) - Use valid Hugging Face model IDs for `hf_model_handle` (for example, `meta-llama/Llama-3.1-8B-Instruct`). - Browse model IDs: [Hugging Face Models](https://huggingface.co/models). ```yaml deployment: checkpoint_path: null # Set to null when using hf_model_handle hf_model_handle: meta-llama/Llama-3.1-8B-Instruct # HuggingFace model ID ``` **Benefits:** - Model is automatically downloaded during deployment - No need to pre-download or manage model files - Works with any HuggingFace model (public or private with valid access tokens) **Requirements:** - Set `HF_TOKEN` environment variable if accessing gated models - Internet access from compute nodes (or model cached locally) #### Option 2: Local Model Files (Manual Setup Required) If you work with a checkpoint stored on locally on the cluster, use `checkpoint_path`: ```yaml deployment: checkpoint_path: /shared/models/llama-3.1-8b-instruct # model directory accessible to compute nodes # Do NOT set hf_model_handle when using checkpoint_path ``` **Note:** - The directory must exist, be accessible from compute nodes, and contain model files - Slurm does not automatically download models when using `checkpoint_path` ### Environment Variables The Slurm executor supports environment variables through `execution.env_vars`: ```yaml execution: env_vars: deployment: CUDA_VISIBLE_DEVICES: "0,1,2,3,4,5,6,7" USER: ${oc.env:USER} # References host environment variable evaluation: CUSTOM_VAR: "YOUR_CUSTOM_ENV_VAR_VALUE" # Set the value directly evaluation: env_vars: CUSTOM_VAR: CUSTOM_ENV_VAR_NAME # Please note, this is an env var name! tasks: - name: my_task env_vars: TASK_SPECIFIC_VAR: TASK_ENV_VAR_NAME # Please note, this is an env var name! ``` **How to use environment variables:** - **Deployment Variables**: Use `execution.env_vars.deployment` for model serving containers - **Evaluation Variables**: Use `execution.env_vars.evaluation` for evaluation containers - **Direct Values**: Use quoted strings for direct values - **Hydra Environment Variables**: Use `${oc.env:VARIABLE_NAME}` to reference host environment variables :::{note} The `${oc.env:VARIABLE_NAME}` syntax reads variables defined in your local environment (the one you use to execute `nemo-evaluator-launcher run` command), not the environment on the SLURM cluster. ::: ### Secrets and API Keys API keys are handled the same way as environment variables - store them as environment variables on your machine and reference them in the `execution.env_vars` configuration. **Security Considerations:** - **No Hardcoding**: Never put API keys directly in configuration files, use `${oc.env:ENV_VAR_NAME}` instead. - **SSH Security**: Ensure secure SSH configuration for key transmission to the cluster. - **File Permissions**: Ensure configuration files have appropriate permissions (not world-readable). - **Public Clusters**: Secrets in `execution.env_vars` are stored in plain text in the batch script and saved under `output_dir` on the login node. Use caution when handling sensitive data on public clusters. ### Mounting and Storage The Slurm executor provides sophisticated mounting capabilities: ```yaml execution: mounts: deployment: /path/to/checkpoints: /checkpoint /path/to/cache: /cache evaluation: /path/to/data: /data /path/to/results: /results mount_home: true # Mount user home directory ``` **Mount Types:**: - **Deployment Mounts**: For model checkpoints, cache directories, and model data. - **Evaluation Mounts**: For input data, additional artifacts, and evaluation-specific files - **Home Mount**: Optional mounting of user home directory (enabled by default) ## Complete Configuration Example Here's a complete Slurm executor configuration using HuggingFace models: ```yaml # examples/slurm_llama_3_1_8b_instruct.yaml defaults: - execution: slurm/default - deployment: vllm - _self_ execution: hostname: your-cluster-headnode account: your_account output_dir: /shared/scratch/your_username/eval_results partition: gpu walltime: "04:00:00" gpus_per_node: 8 env_vars: deployment: HF_TOKEN: ${oc.env:HF_TOKEN} # Needed to access meta-llama/Llama-3.1-8B-Instruct gated model deployment: hf_model_handle: meta-llama/Llama-3.1-8B-Instruct checkpoint_path: null served_model_name: meta-llama/Llama-3.1-8B-Instruct tensor_parallel_size: 1 data_parallel_size: 8 evaluation: tasks: - name: hellaswag - name: arc_challenge - name: winogrande ``` This configuration: - Uses the Slurm execution backend - Deploys a vLLM model server on the cluster - Requests GPU resources (8 GPUs per node, 4-hour time limit) - Runs three benchmark tasks in parallel - Saves benchmark artifacts to `output_dir` ## Resuming The Slurm executor includes advanced auto-resume capabilities: ### Automatic Resumption - **Timeout Handling**: Jobs automatically resume after timeout - **Preemption Recovery**: Automatic resumption after job preemption - **Node Failure Recovery**: Jobs resume after node failures - **Dependency Management**: Uses Slurm job dependencies for resumption ### How It Works 1. **Initial Submission**: Job is submitted with auto-resume handler 2. **Failure Detection**: Script detects timeout/preemption/failure 3. **Automatic Resubmission**: New job is submitted with dependency on previous job 4. **Progress Preservation**: Evaluation continues from where it left off ### Maximum Total Walltime To prevent jobs from resuming indefinitely, you can configure a maximum total wall-clock time across all resumes using the `max_walltime` parameter: ```yaml execution: walltime: "04:00:00" # Time limit per job submission max_walltime: "24:00:00" # Maximum total time across all resumes (optional) ``` **How it works:** - The actual runtime of each job is tracked using SLURM's `sacct` command - When a job resumes, the previous job's actual elapsed time is added to the accumulated total - Before starting each resumed job, the accumulated runtime is checked against `max_walltime` - If the accumulated runtime exceeds `max_walltime`, the job chain stops with an error - This prevents runaway jobs that might otherwise resume indefinitely **Configuration:** - `max_walltime`: Maximum total runtime in `HH:MM:SS` format (e.g., `"24:00:00"` for 24 hours) - Defaults to `"120:00:00"` (120 hours). Set to `null` for unlimited resuming :::{note} The `max_walltime` tracks **actual job execution time only**, excluding time spent waiting in the queue. This ensures accurate runtime accounting even when jobs are repeatedly preempted or must wait for resources. ::: ## Monitoring and Job Management For monitoring jobs, checking status, and managing evaluations, see the [Executors Overview](overview.md#job-management) section. (configuration-overview)= # Configuration The nemo-evaluator-launcher uses [Hydra](https://hydra.cc/docs/intro/) for configuration management, enabling flexible composition and command-line overrides. ## How it Works 1. **Choose your deployment**: Start with `deployment: none` to use existing endpoints 2. **Set your execution platform**: Use `execution: local` for development 3. **Configure your target**: Point to your API endpoint 4. **Select benchmarks**: Add evaluation tasks 5. **Test first**: Always use `--dry-run` to verify ```bash # Verify configuration nemo-evaluator-launcher run --config your_config.yaml --dry-run # Run evaluation nemo-evaluator-launcher run --config your_config.yaml ``` ### Basic Structure Every configuration has four main sections: ```yaml defaults: - execution: local # Where to run: local, lepton, slurm - deployment: none # How to deploy: none, vllm, sglang, nim, trtllm, generic - _self_ execution: output_dir: results # Required: where to save results target: # Required for deployment: none api_endpoint: model_id: meta/llama-3.2-3b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY evaluation: # Required: what benchmarks to run - name: gpqa_diamond - name: ifeval ``` ## Deployment Options Choose how to serve your model for evaluation: ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`globe;1.5em;sd-mr-1` None (External) :link: deployment/none :link-type: doc Use existing API endpoints like NVIDIA API Catalog, OpenAI, or custom deployments. No model deployment needed. ::: :::{grid-item-card} {octicon}`broadcast;1.5em;sd-mr-1` vLLM :link: deployment/vllm :link-type: doc High-performance LLM serving with advanced parallelism strategies. Best for production workloads and large models. ::: :::{grid-item-card} {octicon}`zap;1.5em;sd-mr-1` SGLang :link: deployment/sglang :link-type: doc Fast serving framework optimized for structured generation and high-throughput inference with efficient memory usage. ::: :::{grid-item-card} {octicon}`cpu;1.5em;sd-mr-1` NIM :link: deployment/nim :link-type: doc NVIDIA-optimized inference microservices with automatic scaling, optimization, and enterprise-grade features. ::: :::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` TRT-LLM :link: deployment/trtllm :link-type: doc NVIDIA TensorRT LLM. ::: :::{grid-item-card} {octicon}`package;1.5em;sd-mr-1` Generic :link: deployment/generic :link-type: doc Deploy models using a fully custom setup. ::: :::: ## Execution Platforms Choose where to run your evaluations: ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`desktop-download;1.5em;sd-mr-1` Local :link: executors/local :link-type: doc Docker-based evaluation on your local machine. Perfect for development, testing, and small-scale evaluations. ::: :::{grid-item-card} {octicon}`cloud;1.5em;sd-mr-1` Lepton :link: executors/lepton :link-type: doc Cloud execution with on-demand GPU provisioning. Ideal for production evaluations and scalable workloads. ::: :::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` SLURM :link: executors/slurm :link-type: doc HPC cluster execution with resource management. Best for large-scale evaluations and batch processing. ::: :::: ## Evaluation Configuration ::::{grid} 1 1 1 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`checklist;1.5em;sd-mr-1` Tasks & Benchmarks :link: evaluation/index :link-type: doc Configure evaluation tasks, parameter overrides, and environment variables for your benchmarks. ::: :::: ## Command Line Overrides Override any configuration value using the `-o` flag: ```bash # Basic override nemo-evaluator-launcher run --config your_config.yaml \ -o execution.output_dir=my_results # Multiple overrides nemo-evaluator-launcher run --config your_config.yaml \ -o execution.output_dir=my_results \ -o target.api_endpoint.url="https://new-endpoint.com/v1/chat/completions" ``` ```{toctree} :caption: Configuration :hidden: Deployment Executors Evaluation ``` (exporter-gsheets)= # Google Sheets Exporter (`gsheets`) Exports accuracy metrics to a Google Sheet. Dynamically creates/extends header columns based on observed metrics and appends one row per job. - **Purpose**: Centralized spreadsheet for tracking results across runs - **Requirements**: `gspread` installed and a Google service account with access ## Usage Export evaluation results to a Google Sheets spreadsheet for easy sharing and analysis. ::::{tab-set} :::{tab-item} CLI Export results from a specific evaluation run to Google Sheets: ```bash # Export results using default spreadsheet name nemo-evaluator-launcher export 8abcd123 --dest gsheets # Export with custom spreadsheet name and ID nemo-evaluator-launcher export 8abcd123 --dest gsheets \ -o export.gsheets.spreadsheet_name="My Results" \ -o export.gsheets.spreadsheet_id=1ABC...XYZ ``` ::: :::{tab-item} Python Export results programmatically with custom configuration: ```python from nemo_evaluator_launcher.api.functional import export_results # Basic export to Google Sheets export_results( invocation_ids=["8abcd123"], dest="gsheets", config={ "spreadsheet_name": "NeMo Evaluator Launcher Results" } ) # Export with service account and filtered metrics export_results( invocation_ids=["8abcd123", "9def4567"], dest="gsheets", config={ "spreadsheet_name": "Model Comparison Results", "service_account_file": "/path/to/service-account.json", "log_metrics": ["accuracy", "f1_score"] } ) ``` ::: :::{tab-item} YAML Config Configure Google Sheets export in your evaluation YAML file for automatic export on completion: ```yaml execution: auto_export: destinations: ["gsheets"] export: gsheets: spreadsheet_name: "LLM Evaluation Results" spreadsheet_id: "1ABC...XYZ" # optional: use existing sheet service_account_file: "/path/to/service-account.json" log_metrics: ["accuracy", "pass@1"] ``` :::: ## Key Configuration ```{list-table} :header-rows: 1 :widths: 25 25 25 25 * - Parameter - Type - Description - Default/Notes * - `service_account_file` - str, optional - Path to service account JSON - Uses default credentials if omitted * - `spreadsheet_name` - str, optional - Target spreadsheet name. Used to open existing sheets or name new ones. - Default: "NeMo Evaluator Launcher Results" * - `spreadsheet_id` - str, optional - Target spreadsheet ID. Find it in the spreadsheet URL: `https://docs.google.com/spreadsheets/d//edit` - Required if your service account can't create sheets due to quota limits. * - `log_metrics` - list[str], optional - Filter metrics to log - All metrics if omitted ``` (exporters-overview)= # Exporters Exporters move evaluation results and artifacts from completed runs to external destinations for analysis, sharing, and reporting. They provide flexible options for integrating evaluation results into your existing workflows and tools. ## How to Set an Exporter ::::{tab-set} :::{tab-item} CLI ```bash nemo-evaluator-launcher export [ ...] \ --dest \ [options] ``` ::: :::{tab-item} Python ```python from nemo_evaluator_launcher.api.functional import export_results export_results( invocation_ids=["8abcd123"], dest="local", config={ "format": "json", "output_dir": "./out" } ) ``` ::: :::: ## Choosing an Exporter Select exporters based on your analysis and reporting needs: ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`file-directory;1.5em;sd-mr-1` Local Files :link: local :link-type: doc Export results and artifacts to local or network file systems for custom analysis and archival. ::: :::{grid-item-card} {octicon}`graph;1.5em;sd-mr-1` Weights & Biases :link: wandb :link-type: doc Track metrics, artifacts, and run metadata in W&B for comprehensive experiment management. ::: :::{grid-item-card} {octicon}`database;1.5em;sd-mr-1` MLflow :link: mlflow :link-type: doc Export metrics and artifacts to MLflow Tracking Server for centralized ML lifecycle management. ::: :::{grid-item-card} {octicon}`table;1.5em;sd-mr-1` Google Sheets :link: gsheets :link-type: doc Export metrics to Google Sheets for easy sharing, reporting, and collaborative analysis. ::: :::: You can configure multiple exporters simultaneously to support different stakeholder needs and workflow integration points. :::{toctree} :caption: Exporters :hidden: Local Files Weights & Biases MLflow Google Sheets ::: (exporter-local)= # Local Exporter (`local`) Exports artifacts and optional summaries to the local filesystem. When used with remote executors, stages artifacts from remote locations. Can produce consolidated JSON or CSV summaries of metrics. ## Usage Export evaluation results and artifacts to your local filesystem with optional summary reports. ::::{tab-set} :::{tab-item} CLI Export artifacts and generate summary reports locally: ```bash # Basic export to current directory nemo-evaluator-launcher export 8abcd123 --dest local # Export with JSON summary to custom directory nemo-evaluator-launcher export 8abcd123 --dest local --format json --output-dir ./evaluation-results/ # Export multiple runs with CSV summary and logs included nemo-evaluator-launcher export 8abcd123 9def4567 --dest local --format csv --copy-logs --output-dir ./results # Export only specific metrics to a custom filename nemo-evaluator-launcher export 8abcd123 --dest local --format json --log-metrics accuracy --log-metrics bleu --output-filename model_metrics.json ``` ::: :::{tab-item} Python Export results programmatically with flexible configuration: ```python from nemo_evaluator_launcher.api.functional import export_results # Basic local export with JSON summary export_results( invocation_ids=["8abcd123"], dest="local", config={ "format": "json", "output_dir": "./results" } ) # Export multiple runs with comprehensive configuration export_results( invocation_ids=["8abcd123", "9def4567"], dest="local", config={ "output_dir": "./evaluation-outputs", "format": "csv", "copy_logs": True, "only_required": False, # Include all artifacts "log_metrics": ["accuracy", "f1_score", "perplexity"], "output_filename": "comprehensive_results.csv" } ) # Export artifacts only (no summary) export_results( invocation_ids=["8abcd123"], dest="local", config={ "output_dir": "./artifacts-only", "format": None, # No summary file "copy_logs": True } ) ``` ::: :::{tab-item} YAML Config Configure local export in your evaluation YAML file for automatic export on completion: ```yaml execution: auto_export: destinations: ["local"] export: local: format: "json" output_dir: "./results" ``` :::: ## Key Configuration ```{list-table} :header-rows: 1 :widths: 25 15 45 15 * - Parameter - Type - Description - Default * - `output_dir` - str - Destination directory for exported results - `.` (CLI), `./nemo-evaluator-launcher-results` (Python API) * - `copy_logs` - bool - Include logs alongside artifacts - `false` * - `only_required` - bool - Copy only required and optional artifacts; excludes other files - `true` * - `format` - str | null - Summary format: `json`, `csv`, or `null` for artifacts only - `null` * - `log_metrics` - list[str] - Filter metrics by name (exact or substring match) - All metrics * - `output_filename` - str - Override default summary filename (`processed_results.json` or `processed_results.csv`) - `processed_results.` ``` (exporter-mlflow)= # MLflow Exporter (`mlflow`) Exports accuracy metrics and artifacts to an MLflow Tracking Server. - **Purpose**: Centralize metrics, parameters, and artifacts in MLflow for experiment tracking - **Requirements**: `mlflow` package installed and a reachable MLflow tracking server :::{dropdown} **Prerequisites: MLflow Server Setup** :open: Before exporting results, ensure that an **MLflow Tracking Server** is running and reachable. If no server is active, export attempts will fail with connection errors. ### Quick Start: Local Tracking Server For local development or testing: ```bash # Install MLflow pip install nemo-evaluator-launcher[mlflow] # Start a local tracking server (runs on: http://127.0.0.1:5000) mlflow server --host 127.0.0.1 --port 5000 ``` This starts MLflow with a local SQLite backend and a file-based artifact store under current directory. ### Production Deployments For production or multi-user setups: * **Remote MLflow Server**: Deploy MLflow on a dedicated VM or container. * **Docker**: ```bash docker run -p 5000:5000 ghcr.io/mlflow/mlflow:latest \ mlflow server --host 0.0.0.0 ``` * **Cloud-Managed Services**: Use hosted options such as **Databricks MLflow** or **AWS SageMaker MLflow**. For detailed deployment and configuration options, see the [official MLflow Tracking Server documentation](https://mlflow.org/docs/latest/tracking/server.html). ::: ## Usage Export evaluation results to MLflow Tracking Server for centralized experiment management. ::::{tab-set} :::{tab-item} Auto-Export (Recommended) Configure MLflow export to run automatically after evaluation completes. Add MLflow configuration to your run config YAML file: ```yaml execution: auto_export: destinations: ["mlflow"] export: mlflow: tracking_uri: "http://mlflow.example.com:5000" target: api_endpoint: model_id: meta/llama-3.2-3b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions evaluation: tasks: - name: simple_evals.mmlu ``` Alternatively you can use `MLFLOW_TRACKING_URI` environment variable: ```yaml execution: auto_export: destinations: ["mlflow"] # Export-related env vars (placeholders expanded at runtime) env_vars: export: # you can skip export.mlflow.tracking_uri if you set this var MLFLOW_TRACKING_URI: MLFLOW_TRACKING_URI ``` Set optional fields to customize your export: ```yaml execution: auto_export: destinations: ["mlflow"] export: mlflow: tracking_uri: "http://mlflow.example.com:5000" experiment_name: "llm-evaluation" description: "Llama 3.1 8B evaluation" log_metrics: ["mmlu_score_macro", "mmlu_score_micro"] tags: model_family: "llama" version: "3.1" extra_metadata: hardware: "A100" batch_size: 32 log_artifacts: true ``` Run the evaluation with auto-export enabled: ```bash nemo-evaluator-launcher run --config ./my_config.yaml ``` ::: :::{tab-item} Manual Export (Python API) Export results programmatically after evaluation completes: ```python from nemo_evaluator_launcher.api.functional import export_results # Basic MLflow export export_results( invocation_ids=["8abcd123"], dest="mlflow", config={ "tracking_uri": "http://mlflow:5000", "experiment_name": "model-evaluation" } ) # Export with metadata and tags export_results( invocation_ids=["8abcd123"], dest="mlflow", config={ "tracking_uri": "http://mlflow:5000", "experiment_name": "llm-benchmarks", "run_name": "llama-3.1-8b-mmlu", "description": "Evaluation of Llama 3.1 8B on MMLU", "tags": { "model_family": "llama", "model_version": "3.1", "benchmark": "mmlu" }, "log_metrics": ["accuracy"], "extra_metadata": { "hardware": "A100-80GB", "batch_size": 32 } } ) # Export with artifacts disabled export_results( invocation_ids=["8abcd123"], dest="mlflow", config={ "tracking_uri": "http://mlflow:5000", "experiment_name": "model-comparison", "log_artifacts": False } ) # Skip if run already exists export_results( invocation_ids=["8abcd123"], dest="mlflow", config={ "tracking_uri": "http://mlflow:5000", "experiment_name": "nightly-evals", "skip_existing": True } ) ``` ::: :::{tab-item} Manual Export (CLI) Export results after evaluation completes: ```shell # Default export nemo-evaluator-launcher export 8abcd123 --dest mlflow # With overrides nemo-evaluator-launcher export 8abcd123 --dest mlflow \ -o export.mlflow.tracking_uri=http://mlflow:5000 \ -o export.mlflow.experiment_name=my-exp # With metric filtering nemo-evaluator-launcher export 8abcd123 --dest mlflow --log-metrics accuracy pass@1 ``` ::: :::: ## Configuration Parameters ```{list-table} :header-rows: 1 :widths: 25 15 45 15 * - Parameter - Type - Description - Default * - `tracking_uri` - str - MLflow tracking server URI - Required if env var `MLFLOW_TRACKING_URI` is not set * - `experiment_name` - str - MLflow experiment name - `"nemo-evaluator-launcher"` * - `run_name` - str - Run display name - Auto-generated * - `description` - str - Run description - None * - `tags` - dict[str, str] - Custom tags for the run - None * - `extra_metadata` - dict - Additional parameters logged to MLflow - None * - `skip_existing` - bool - Skip export if run exists for invocation. Useful to avoid creating duplicate runs when re-exporting. - `false` * - `log_metrics` - list[str] - Filter metrics by substring match - All metrics * - `log_artifacts` - bool - Upload evaluation artifacts - `true` * - `log_logs` - bool - Upload execution logs - `false` * - `only_required` - bool - Copy only required artifacts - `true` ``` (exporter-wandb)= # Weights & Biases Exporter (`wandb`) Exports accuracy metrics and artifacts to W&B. Supports either per-task runs or a single multi-task run per invocation, with artifact logging and run metadata. - **Purpose**: Track runs, metrics, and artifacts in W&B - **Requirements**: `wandb` installed and credentials configured ## Usage Export evaluation results to Weights & Biases for experiment tracking, visualization, and collaboration. ::::{tab-set} :::{tab-item} CLI Basic export to W&B using credentials and project settings from your evaluation configuration: ```bash # Export to W&B (uses config from evaluation run) nemo-evaluator-launcher export 8abcd123 --dest wandb # Filter metrics to export specific measurements nemo-evaluator-launcher export 8abcd123 --dest wandb --log-metrics accuracy f1_score ``` ```{note} Specify W&B configuration (entity, project, tags, etc.) in your evaluation YAML configuration file under `execution.auto_export.configs.wandb`. The CLI export command reads these settings from the stored job configuration. ``` ::: :::{tab-item} Python Export results programmatically with W&B configuration: ```python from nemo_evaluator_launcher.api.functional import export_results # Basic W&B export export_results( invocation_ids=["8abcd123"], dest="wandb", config={ "entity": "myorg", "project": "model-evaluations" } ) # Export with metadata and organization export_results( invocation_ids=["8abcd123"], dest="wandb", config={ "entity": "myorg", "project": "llm-benchmarks", "name": "llama-3.1-8b-eval", "group": "llama-family-comparison", "description": "Evaluation of Llama 3.1 8B on benchmarks", "tags": ["llama-3.1", "8b"], "log_mode": "per_task", "log_metrics": ["accuracy"], "log_artifacts": True, "extra_metadata": { "hardware": "A100-80GB" } } ) # Multi-task mode: single run for all tasks export_results( invocation_ids=["8abcd123"], dest="wandb", config={ "entity": "myorg", "project": "model-comparison", "log_mode": "multi_task", "log_artifacts": False } ) ``` ::: :::{tab-item} YAML Config Configure W&B export in your evaluation YAML file for automatic export on completion: ```yaml execution: auto_export: destinations: ["wandb"] # Export-related env vars (placeholders expanded at runtime) env_vars: export: WANDB_API_KEY: WANDB_API_KEY export: wandb: entity: "myorg" project: "llm-benchmarks" name: "llama-3.1-8b-instruct-v1" group: "baseline-evals" tags: ["llama-3.1", "baseline"] description: "Baseline evaluation" log_mode: "multi_task" log_metrics: ["accuracy"] log_artifacts: true log_logs: true only_required: false extra_metadata: hardware: "H100" checkpoint: "path/to/checkpoint" ``` ::: :::: ## Configuration Parameters ```{list-table} :header-rows: 1 :widths: 20 15 50 15 * - Parameter - Type - Description - Default * - `entity` - str - W&B entity (organization or username) - Required * - `project` - str - W&B project name - Required * - `log_mode` - str - Logging mode: `per_task` creates separate runs for each evaluation task; `multi_task` creates a single run for all tasks - `per_task` * - `name` - str - Run display name. If not specified, auto-generated as `eval-{invocation_id}-{benchmark}` (per_task) or `eval-{invocation_id}` (multi_task) - Auto-generated * - `group` - str - Run group for organizing related runs - Invocation ID * - `tags` - list[str] - Tags for categorizing the run - None * - `description` - str - Run description (stored as W&B notes) - None * - `log_metrics` - list[str] - Metric name patterns to filter (e.g., `["accuracy", "f1"]`). Logs only metrics containing these substrings - All metrics * - `log_artifacts` - bool - Whether to upload evaluation artifacts (results files, configs) to W&B - `true` * - `log_logs` - bool - Upload execution logs - `false` * - `only_required` - bool - Copy only required artifacts - `true` * - `extra_metadata` - dict - Additional metadata stored in run config (e.g., hardware, hyperparameters) - `{}` * - `job_type` - str - W&B job type classification - `evaluation` ``` (lib-launcher)= # NeMo Evaluator Launcher The *Orchestration Layer* empowers you to run AI model evaluations at scale. Use the unified CLI and programmatic interfaces to discover benchmarks, configure runs, submit jobs, monitor progress, and export results. :::{tip} **New to evaluation?** Start with {ref}`gs-quickstart-launcher` for a step-by-step walkthrough. ::: ## Get Started ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`play;1.5em;sd-mr-1` Quickstart :link: ../../get-started/quickstart/launcher :link-type: doc Step-by-step guide to install, configure, and run your first evaluation in minutes. ::: :::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Configuration :link: configuration/index :link-type: doc Complete configuration schema, examples, and advanced patterns for all use cases. ::: :::: ## Execution ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`iterations;1.5em;sd-mr-1` Executors :link: configuration/executors/index :link-type: doc Execute evaluations on your local machine, HPC cluster (Slurm), or cloud platform (Lepton AI). ::: :::{grid-item-card} {octicon}`device-desktop;1.5em;sd-mr-1` Local Executor :link: configuration/executors/local :link-type: doc Docker-based evaluation on your workstation. Perfect for development and testing. ::: :::{grid-item-card} {octicon}`organization;1.5em;sd-mr-1` Slurm Executor :link: configuration/executors/slurm :link-type: doc HPC cluster execution with automatic resource management and job scheduling. ::: :::{grid-item-card} {octicon}`cloud;1.5em;sd-mr-1` Lepton Executor :link: configuration/executors/lepton :link-type: doc Cloud execution with on-demand GPU provisioning and automatic scaling. ::: :::: ## Export ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`upload;1.5em;sd-mr-1` Exporters :link: exporters/index :link-type: doc Export results to MLflow, Weights & Biases, Google Sheets, or local files with one command. ::: :::{grid-item-card} {octicon}`database;1.5em;sd-mr-1` MLflow Export :link: exporters/mlflow :link-type: doc Export evaluation results and metrics to MLflow for experiment tracking. ::: :::{grid-item-card} {octicon}`graph;1.5em;sd-mr-1` W&B Export :link: exporters/wandb :link-type: doc Integrate with Weights & Biases for advanced visualization and collaboration. ::: :::{grid-item-card} {octicon}`table;1.5em;sd-mr-1` Sheets Export :link: exporters/gsheets :link-type: doc Export to Google Sheets for easy sharing and analysis with stakeholders. ::: :::: ## Typical Workflow 1. **Choose execution backend** (local, Slurm, Lepton AI) 2. **Select example configuration** from the examples directory 3. **Point to your model endpoint** (OpenAI-compatible API) 4. **Launch evaluation** via CLI or Python API 5. **Monitor progress** and export results to your preferred platform ## When to Use the Launcher Use the launcher whenever you want: - **Unified interface** for running evaluations across different backends - **Multi-benchmark coordination** with concurrent execution - **Turnkey reproducibility** with saved configurations - **Easy result export** to MLOps platforms and dashboards - **Production-ready orchestration** with monitoring and lifecycle management :::{toctree} :caption: NeMo Evaluator Launcher :hidden: About NeMo Evaluator Launcher Configuration Exporters ::: # API Reference ## Available Data Classes The API provides several dataclasses for configuration: ```python from nemo_evaluator.api.api_dataclasses import ( EvaluationConfig, # Main evaluation configuration EvaluationTarget, # Target model configuration ConfigParams, # Evaluation parameters ApiEndpoint, # API endpoint configuration EvaluationResult, # Evaluation results TaskResult, # Individual task results MetricResult, # Metric scores Score, # Score representation ScoreStats, # Score statistics GroupResult, # Grouped results EndpointType, # Endpoint type enum Evaluation # Complete evaluation object ) ``` ## `run_eval` The main entry point for running evaluations. This is a CLI entry point that parses command line arguments. ```python from nemo_evaluator.api.run import run_eval def run_eval() -> None: """ CLI entry point for running evaluations. This function parses command line arguments and executes evaluations. It does not take parameters directly - all configuration is passed through CLI arguments. CLI Arguments: --eval_type: Type of evaluation to run (such as "mmlu_pro", "gpqa_diamond") --model_id: Model identifier (such as "meta/llama-3.2-3b-instruct") --model_url: API endpoint URL (such as "https://integrate.api.NVIDIA.com/v1/chat/completions" for chat endpoint type) --model_type: Endpoint type ("chat", "completions", "vlm", "embedding") --api_key_name: Environment variable name for API key integration with endpoints (optional) --output_dir: Output directory for results --run_config: Path to YAML Run Configuration file (optional) --overrides: Comma-separated dot-style parameter overrides (optional) --dry_run: Show rendered config without running (optional) --debug: Enable debug logging (optional, deprecated, use NV_LOG_LEVEL=DEBUG env var) Usage: run_eval() # Parses sys.argv automatically """ ``` :::{note} The `run_eval()` function is designed as a CLI entry point. For programmatic usage, use the underlying configuration objects and the `evaluate()` function directly. ::: ## `evaluate` The core evaluation function for programmatic usage. ```python from nemo_evaluator.core.evaluate import evaluate from nemo_evaluator.api.api_dataclasses import EvaluationConfig, EvaluationTarget def evaluate( eval_cfg: EvaluationConfig, target_cfg: EvaluationTarget ) -> EvaluationResult: """ Run an evaluation using configuration objects. Args: eval_cfg: Evaluation configuration object target_cfg: Target configuration object Returns: EvaluationResult: Evaluation results and metadata """ ``` **Example Programmatic Usage:** ```python from nemo_evaluator.core.evaluate import evaluate from nemo_evaluator.api.api_dataclasses import ( EvaluationConfig, EvaluationTarget, ConfigParams, ApiEndpoint ) # Create evaluation configuration eval_config = EvaluationConfig( type="simple_evals.mmlu_pro", output_dir="./results", params=ConfigParams( limit_samples=100, temperature=0.1 ) ) # Create target configuration target_config = EvaluationTarget( api_endpoint=ApiEndpoint( url="https://integrate.api.NVIDIA.com/v1/chat/completions", model_id="meta/llama-3.2-3b-instruct", type="chat", api_key="MY_API_KEY" # Name of the environment variable that stores api_key ) ) # Run evaluation result = evaluate(eval_config, target_config) ``` ## Data Structures ### `EvaluationConfig` Configuration for evaluation runs, defined in `api_dataclasses.py`. ```python from nemo_evaluator.api.api_dataclasses import EvaluationConfig class EvaluationConfig: """Configuration for evaluation runs.""" type: str # Type of evaluation - benchmark to be run output_dir: str # Output directory params: str # parameter overrides ``` ### `EvaluationTarget` Target configuration for API endpoints, defined in `api_dataclasses.py`. ```python from nemo_evaluator.api.api_dataclasses import EvaluationTarget, EndpointType class EvaluationTarget: """Target configuration for API endpoints.""" api_endpoint: ApiEndpoint # API endpoint to be used for evaluation class ApiEndpoint: url: str # API endpoint URL model_id: str # Model name or identifier type: str # Endpoint type (chat, completions, vlm, or embedding) api_key: str # Name of the env variable that stores API key adapter_config: AdapterConfig # Adapter configuration ``` In the ApiEndpoint dataclass, `type` should be one of: `EndpointType.CHAT`, `EndpointType.COMPLETIONS`, `EndpointType.VLM`, `EndpointType.EMBEDDING`: - `CHAT` endpoint accepts structured input as a sequence of messages (such as system, user, assistant roles). It returns a model-generated message, enabling controlled multi-turn interactions. - `COMPLETIONS` endpoint takes a single prompt string and returns a text continuation, typically used for one-shot or single-turn tasks without conversational structure. - `VLM` endpoint hosts a model that has vision capabilities. - `EMBEDDING` endpoint hosts an embedding model. ## Adapter System ### `AdapterConfig` Configuration for the adapter system, defined in `adapter_config.py`. ```python from nemo_evaluator.adapters.adapter_config import AdapterConfig class AdapterConfig: """Configuration for the adapter system.""" discovery: DiscoveryConfig # Module discovery configuration interceptors: list[InterceptorConfig] # List of interceptors post_eval_hooks: list[PostEvalHookConfig] # Post-evaluation hooks endpoint_type: str # Default endpoint type caching_dir: str | None # Legacy caching directory ``` ### `InterceptorConfig` Configuration for individual interceptors. ```python from nemo_evaluator.adapters.adapter_config import InterceptorConfig class InterceptorConfig: """Configuration for a single interceptor.""" name: str # Interceptor name enabled: bool # Whether enabled config: dict[str, Any] # Interceptor-specific configuration ``` ### `DiscoveryConfig` Configuration for discovering third-party modules and directories. ```python from nemo_evaluator.adapters.adapter_config import DiscoveryConfig class DiscoveryConfig: """Configuration for discovering 3rd party modules and directories.""" modules: list[str] # List of module paths to discover dirs: list[str] # List of directory paths to discover ``` ## Available Interceptors ### 1. Request Logging Interceptor ```python from nemo_evaluator.adapters.interceptors.logging_interceptor import LoggingInterceptor # Configuration interceptor_config = { "name": "request_logging", "enabled": True, "config": { "output_dir": "/tmp/logs", "max_requests": 1000, "log_failed_requests": True } } ``` **Features:** - Logs all API requests and responses - Configurable output directory - Request/response count limits - Failed request logging ### 2. Caching Interceptor ```python from nemo_evaluator.adapters.interceptors.caching_interceptor import CachingInterceptor # Configuration interceptor_config = { "name": "caching", "enabled": True, "config": { "cache_dir": "/tmp/cache", "reuse_cached_responses": True, "save_requests": True, "save_responses": True, "max_saved_requests": 1000, "max_saved_responses": 1000 } } ``` **Features:** - Response caching for performance - Persistent storage - responses are saved to disk, allowing resumption after process termination - Configurable cache directory - Request/response persistence - Cache size limits ### 3. Reasoning Interceptor ```python from nemo_evaluator.adapters.interceptors.reasoning_interceptor import ReasoningInterceptor # Configuration interceptor_config = { "name": "reasoning", "enabled": True, "config": { "start_reasoning_token": "", "end_reasoning_token": "", "add_reasoning": True, "enable_reasoning_tracking": True } } ``` **Features:** - Reasoning chain support - Custom reasoning tokens - Reasoning tracking and analysis - Chain-of-thought prompting ### 4. System Message Interceptor ```python from nemo_evaluator.adapters.interceptors.system_message_interceptor import SystemMessageInterceptor # Configuration interceptor_config = { "name": "system_message", "enabled": True, "config": { "system_message": "You are a helpful AI assistant.", "strategy": "prepend" # Optional: "replace", "append", or "prepend" (default) } } ``` **Features:** - Custom system prompt injection - Multiple strategies for handling existing system messages (replace, prepend, append) - Consistent system behavior - Flexible system message composition **Use Cases:** - Modify system prompts for different evaluation scenarios - Test different prompt variations without code changes - Replace existing system messages for consistent behavior - Prepend or append instructions to existing system messages - A/B testing of different prompt strategies ### 5. Endpoint Interceptor ```python from nemo_evaluator.adapters.interceptors.endpoint_interceptor import EndpointInterceptor # Configuration interceptor_config = { "name": "endpoint", "enabled": True, "config": { "endpoint_url": "https://api.example.com/v1/chat/completions", "timeout": 30 } } ``` **Features:** - Endpoint URL management - Request timeout configuration - Endpoint validation ### 6. Payload Modifier Interceptor ```python from nemo_evaluator.adapters.interceptors.payload_modifier_interceptor import PayloadModifierInterceptor # Configuration interceptor_config = { "name": "payload_modifier", "enabled": True, "config": { "params_to_add": { "extra_body": { "chat_template_kwargs": { "enable_thinking": False } } }, "params_to_remove": ["field_in_msgs_to_remove"], "params_to_rename": {"max_tokens": "max_completion_tokens"} } } ``` **Explanation:** This interceptor is particularly useful when custom behavior is needed. In this example, the `enable_thinking` parameter is a custom key that controls the reasoning mode of the model. When set to `False`, it disables the model's internal reasoning/thinking process, which can be useful for scenarios where you want more direct responses without the model's step-by-step reasoning output. The `field_in_msgs_to_remove` field would be removed recursively from all messages in the payload. **Features:** - Custom parameter injection - Remove fields recursively at all levels of the payload - Rename top-level payload keys ### 7. Client Error Interceptor ```python from nemo_evaluator.adapters.interceptors.raise_client_error_interceptor import RaiseClientErrorInterceptor # Configuration interceptor_config = { "name": "raise_client_error", "enabled": True, "config": { "raise_on_error": True, "error_threshold": 400 } } ``` **Features:** - Error handling and propagation - Configurable error thresholds - Client error management # Code Generation Containers Containers specialized for evaluating code generation models and programming language capabilities. --- ## BigCode Evaluation Harness Container **NGC Catalog**: [bigcode-evaluation-harness](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bigcode-evaluation-harness) Container specialized for evaluating code generation models and programming language models. **Use Cases:** - Code generation quality assessment - Programming problem solving - Code completion evaluation - Software engineering task assessment **Pull Command:** ```bash docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:{{ docker_compose_latest }} ``` **Default Parameters:** | Parameter | Value | |-----------|-------| | `limit_samples` | `None` | | `max_new_tokens` | `512` | | `temperature` | `1e-07` | | `top_p` | `0.9999999` | | `parallelism` | `10` | | `max_retries` | `5` | | `request_timeout` | `30` | | `do_sample` | `True` | | `n_samples` | `1` | --- ## Compute Eval Container **NGC Catalog**: [compute-eval](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/compute-eval) Container specialized for evaluating CUDA code generation. **Use Cases:** - CUDA code generation - CCCL programming problems **Pull Command:** ```bash docker pull nvcr.io/nvidia/eval-factory/compute-eval:{{ docker_compose_latest }} ``` **Default Parameters:** | Parameter | Value | |-----------|-------| | `limit_samples` | `None` | | `max_new_tokens` | `2048` | | `temperature` | `0` | | `top_p` | `0.00001` | | `parallelism` | `1` | | `max_retries` | `2` | | `request_timeout` | `3600` | --- ## LiveCodeBench Container **NGC Catalog**: [livecodebench](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/livecodebench) LiveCodeBench provides holistic and contamination-free evaluation of coding capabilities of LLMs. It continuously collects new problems from contests across three competition platforms -- LeetCode, AtCoder, and CodeForces. **Use Cases:** - Holistic coding capability evaluation - Contamination-free assessment - Contest-style problem solving - Code generation and execution - Test output prediction - Self-repair capabilities **Pull Command:** ```bash docker pull nvcr.io/nvidia/eval-factory/livecodebench:{{ docker_compose_latest }} ``` **Default Parameters:** | Parameter | Value | |-----------|-------| | `limit_samples` | `None` | | `max_new_tokens` | `4096` | | `temperature` | `0.0` | | `top_p` | `1e-05` | | `parallelism` | `10` | | `max_retries` | `5` | | `request_timeout` | `60` | | `n_samples` | `10` | | `num_process_evaluate` | `5` | | `cache_batch_size` | `10` | | `support_system_role` | `False` | | `cot_code_execution` | `False` | **Supported Versions:** v1-v6, 0724_0125, 0824_0225 --- ## SciCode Container **NGC Catalog**: [scicode](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/scicode) SciCode is a challenging benchmark designed to evaluate the capabilities of language models in generating code for solving realistic scientific research problems with diverse coverage across 16 subdomains from six domains. **Use Cases:** - Scientific research code generation - Multi-domain scientific programming - Research workflow automation - Scientific computation evaluation - Domain-specific coding tasks **Pull Command:** ```bash docker pull nvcr.io/nvidia/eval-factory/scicode:{{ docker_compose_latest }} ``` **Default Parameters:** | Parameter | Value | |-----------|-------| | `limit_samples` | `None` | | `temperature` | `0` | | `max_new_tokens` | `2048` | | `top_p` | `1e-05` | | `request_timeout` | `60` | | `max_retries` | `2` | | `with_background` | `False` | | `include_dev` | `False` | | `n_samples` | `1` | | `eval_threads` | `None` | **Supported Domains:** Physics, Math, Material Science, Biology, Chemistry (16 subdomains from five domains) # Model Efficiency Containers specialized in evaluating Large Language Model efficiency. --- ## GenAIPerf Container **NGC Catalog**: [genai-perf](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/genai-perf) Container for assessing the speed of processing requests by the server. **Use Cases:** - Analysis time to first token (TTF) and inter-token latency (ITL) - Assessment of server efficiency under load - Summarization scenario: long input, short output - Generation scenatio: short input, long output **Pull Command:** ```bash docker pull nvcr.io/nvidia/eval-factory/genai-perf:{{ docker_compose_latest }} ``` **Default Parameters:** | Parameter | Value | |-----------|-------| | `parallelism` | `1` | Benchmark-specific parameters (passed via `extra` field): | Parameter | Description | |-----------|-------| | `tokenizer` | HuggingFace tokenizer to use for calculating the number of tokens. **Requied parameter** (default: `None`)| | `warmup` | Whether to run warmup (default: `True`) | | `isl` | Input sequence length (default: task-specific, see below) | | `osl` | Output sequence length (default: task-specific, see below) | **Supported Benchmarks:** - `genai_perf_summarization` - Speed analysis with `isl: 5000` and `osl: 500`. - `genai_perf_generation` - Speed analysis with `isl: 500` and `osl: 5000`. (nemo-evaluator-containers)= # NeMo Evaluator Containers NeMo Evaluator provides a collection of specialized containers for different evaluation frameworks and tasks. Each container is optimized and tested to work seamlessly with NVIDIA hardware and software stack, providing consistent, reproducible environments for AI model evaluation. ## Container Categories ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`cpu;1.5em;sd-mr-1` Language Models :link: language-models :link-type: doc Containers for evaluating large language models across academic benchmarks and custom tasks. ::: :::{grid-item-card} {octicon}`file-code;1.5em;sd-mr-1` Code Generation :link: code-generation :link-type: doc Specialized containers for evaluating code generation and programming capabilities. ::: :::{grid-item-card} {octicon}`eye;1.5em;sd-mr-1` Vision-Language :link: vision-language :link-type: doc Multimodal evaluation containers for vision-language understanding and reasoning. ::: :::{grid-item-card} {octicon}`shield-check;1.5em;sd-mr-1` Safety & Security :link: safety-security :link-type: doc Containers focused on safety evaluation, bias detection, and security testing. ::: :::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Specialized Tools :link: specialized-tools :link-type: doc Containers focused on agentic AI capabilities and advanced reasoning. ::: :::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` Efficiency :link: efficiency :link-type: doc Containers for evaluating speed of input processing and output generation. ::: :::: --- ## Quick Start ### Basic Container Usage ```bash # Pull a container docker pull nvcr.io/nvidia/eval-factory/: # Example: Pull simple-evals container docker pull nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} # Run with GPU support docker run -it nvcr.io/nvidia/eval-factory/: ``` ### Prerequisites - Docker and NVIDIA Container Toolkit (for GPU support) - NVIDIA GPU (for GPU-accelerated evaluation) - Sufficient disk space for models and datasets For detailed usage instructions, refer to the {ref}`cli-workflows` guide. :::{toctree} :caption: Container Reference :hidden: Language Models Code Generation Vision-Language Safety & Security Specialized Tools Efficiency ::: # Language Model Containers Containers specialized for evaluating large language models across academic benchmarks, custom tasks, and conversation scenarios. --- ## Simple-Evals Container **NGC Catalog**: [simple-evals](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/simple-evals) Container for lightweight evaluation tasks and simple model assessments. **Use Cases:** - Simple question-answering evaluation - Math and reasoning capabilities - Basic Python coding **Pull Command:** ```bash docker pull nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} ``` **Default Parameters:** | Parameter | Value | |-----------|-------| | `limit_samples` | `None` | | `max_new_tokens` | `4096` | | `temperature` | `0` | | `top_p` | `1e-05` | | `parallelism` | `10` | | `max_retries` | `5` | | `request_timeout` | `60` | | `downsampling_ratio` | `None` | | `add_system_prompt` | `False` | | `custom_config` | `None` | | `judge` | `{'url': None, 'model_id': None, 'api_key': None, 'backend': 'openai', 'request_timeout': 600, 'max_retries': 16, 'temperature': 0.0, 'top_p': 0.0001, 'max_tokens': 1024, 'max_concurrent_requests': None}` | --- ## LM-Evaluation-Harness Container **NGC Catalog**: [lm-evaluation-harness](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/lm-evaluation-harness) Container based on the Language Model Evaluation Harness framework for comprehensive language model evaluation. **Use Cases:** - Standard NLP benchmarks - Language model performance evaluation - Multi-task assessment - Academic benchmark evaluation **Pull Command:** ```bash docker pull nvcr.io/nvidia/eval-factory/lm-evaluation-harness:{{ docker_compose_latest }} ``` **Default Parameters:** | Parameter | Value | |-----------|-------| | `limit_samples` | `None` | | `max_new_tokens` | `None` | | `temperature` | `1e-07` | | `top_p` | `0.9999999` | | `parallelism` | `10` | | `max_retries` | `5` | | `request_timeout` | `30` | | `tokenizer` | `None` | | `tokenizer_backend` | `None` | | `downsampling_ratio` | `None` | | `tokenized_requests` | `False` | --- ## MT-Bench Container **NGC Catalog**: [mtbench](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mtbench) Container for MT-Bench evaluation framework, designed for multi-turn conversation evaluation. **Use Cases:** - Multi-turn dialogue evaluation - Conversation quality assessment - Context maintenance evaluation - Interactive AI system testing **Pull Command:** ```bash docker pull nvcr.io/nvidia/eval-factory/mtbench:{{ docker_compose_latest }} ``` **Default Parameters:** | Parameter | Value | |-----------|-------| | `max_new_tokens` | `1024` | | `parallelism` | `10` | | `max_retries` | `5` | | `request_timeout` | `30` | | `judge` | `{'url': None, 'model_id': 'gpt-4', 'api_key': None, 'request_timeout': 60, 'max_retries': 16, 'temperature': 0.0, 'top_p': 0.0001, 'max_tokens': 2048}` | --- ## HELM Container **NGC Catalog**: [helm](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/helm) Container for the Holistic Evaluation of Language Models (HELM) framework, with a focus on MedHELM - an extensible evaluation framework for assessing LLM performance for medical tasks. **Use Cases:** - Medical AI model evaluation - Clinical task assessment - Healthcare-specific benchmarking - Diagnostic decision-making evaluation - Patient communication assessment - Medical knowledge evaluation **Pull Command:** ```bash docker pull nvcr.io/nvidia/eval-factory/helm:{{ docker_compose_latest }} ``` **Default Parameters:** | Parameter | Value | |-----------|-------| | `limit_samples` | `None` | | `parallelism` | `1` | | `data_path` | `None` | | `num_output_tokens` | `None` | | `subject` | `None` | | `condition` | `None` | | `max_length` | `None` | | `num_train_trials` | `None` | | `subset` | `None` | | `gpt_judge_api_key` | `GPT_JUDGE_API_KEY` | | `llama_judge_api_key` | `LLAMA_JUDGE_API_KEY` | | `claude_judge_api_key` | `CLAUDE_JUDGE_API_KEY` | --- ## RAG Retriever Evaluation Container **NGC Catalog**: [rag_retriever_eval](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/rag_retriever_eval) Container for evaluating Retrieval-Augmented Generation (RAG) systems and their retrieval capabilities. **Use Cases:** - Document retrieval accuracy - Context relevance assessment - RAG pipeline evaluation - Information retrieval performance **Pull Command:** ```bash docker pull nvcr.io/nvidia/eval-factory/rag_retriever_eval:{{ docker_compose_latest }} ``` --- ## HLE Container **NGC Catalog**: [hle](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/hle) Container for Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark with broad subject coverage. **Use Cases:** - Academic knowledge and problem solving evaluation - Multi-modal benchmark testing - Frontier knowledge assessment - Subject-matter expertise evaluation **Pull Command:** ```bash docker pull nvcr.io/nvidia/eval-factory/hle:{{ docker_compose_latest }} ``` **Default Parameters:** | Parameter | Value | |-----------|-------| | `limit_samples` | `None` | | `max_new_tokens` | `4096` | | `temperature` | `0.0` | | `top_p` | `1.0` | | `parallelism` | `100` | | `max_retries` | `30` | | `request_timeout` | `600.0` | --- ## IFBench Container **NGC Catalog**: [ifbench](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/ifbench) Container for a challenging benchmark for precise instruction following evaluation. **Use Cases:** - Precise instruction following evaluation - Out-of-distribution constraint verification - Multiturn constraint isolation testing - Instruction following robustness assessment - Verifiable instruction compliance testing **Pull Command:** ```bash docker pull nvcr.io/nvidia/eval-factory/ifbench:{{ docker_compose_latest }} ``` **Default Parameters:** | Parameter | Value | |-----------|-------| | `limit_samples` | `None` | | `max_new_tokens` | `4096` | | `temperature` | `0.01` | | `top_p` | `0.95` | | `parallelism` | `8` | | `max_retries` | `5` | --- ## MMATH Container **NGC Catalog**: [mmath](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mmath) Container for multilingual mathematical reasoning evaluation across multiple languages. **Use Cases:** - Multilingual mathematical reasoning evaluation - Cross-lingual mathematical problem solving assessment - Mathematical reasoning robustness across languages - Complex mathematical reasoning capability testing - Translation quality validation for mathematical content **Pull Command:** ```bash docker pull nvcr.io/nvidia/eval-factory/mmath:{{ docker_compose_latest }} ``` **Default Parameters:** | Parameter | Value | |-----------|-------| | `limit_samples` | `None` | | `max_new_tokens` | `32768` | | `temperature` | `0.6` | | `top_p` | `0.95` | | `parallelism` | `8` | | `max_retries` | `5` | | `language` | `en` | **Supported Languages:** EN, ZH, AR, ES, FR, JA, KO, PT, TH, VI ## ProfBench Container **NGC Catalog**: [profbench](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/profbench) Container for assessing performance accross professional domains in business and scientific research. **Use Cases:** - Evaluation of professional knowledge accross Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA - Report generation capabilities - Quality assessment of LLM judges **Pull Command:** ```bash docker pull nvcr.io/nvidia/eval-factory/profbench:{{ docker_compose_latest }} ``` **Default Parameters:** | Parameter | Value | |-----------|-------| | `limit_samples` | `None` | | `max_new_tokens` | `4096` | | `temperature` | `0.0` | | `top_p` | `0.00001` | | `parallelism` | `10` | | `max_retries` | `5` | | `request_timeout` | `600` | --- ## NeMo Skills Container **NGC Catalog**: [nemo-skills](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/nemo-skills) Container for assessing LLM capabilities in science, maths and agentic workflows. **Use Cases:** - Evaluation of reasoning capabilities - Advanced math and coding skills - Agentic workflow **Pull Command:** ```bash docker pull nvcr.io/nvidia/eval-factory/nemo-skills:{{ docker_compose_latest }} ``` **Default Parameters:** | Parameter | Value | |-----------|-------| | `limit_samples` | `None` | | `max_new_tokens` | `65536` | | `temperature` | `None` | | `top_p` | `None` | | `parallelism` | `16` | # Safety and Security Containers Containers specialized for evaluating AI model safety, security, and robustness against various threats and biases. --- ## Garak Container **NGC Catalog**: [garak](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/garak) Container for security and robustness evaluation of AI models. **Use Cases:** - Security testing - Adversarial attack evaluation - Robustness assessment - Safety evaluation **Pull Command:** ```bash docker pull nvcr.io/nvidia/eval-factory/garak:{{ docker_compose_latest }} ``` **Default Parameters:** | Parameter | Value | |-----------|-------| | `max_new_tokens` | `150` | | `temperature` | `0.1` | | `top_p` | `0.7` | | `parallelism` | `32` | | `probes` | `None` | **Key Features:** - Automated security testing - Vulnerability detection - Prompt injection testing - Adversarial robustness evaluation - Comprehensive security reporting **Security Test Categories:** - Prompt Injection Attacks - Data Extraction Attempts - Jailbreak Techniques - Adversarial Prompts - Social Engineering Tests --- ## Safety Harness Container **NGC Catalog**: [safety-harness](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/safety-harness) Container for comprehensive safety evaluation of AI models. **Use Cases:** - Safety alignment evaluation - Harmful content detection - Bias and fairness assessment - Ethical AI evaluation **Pull Command:** ```bash docker pull nvcr.io/nvidia/eval-factory/safety-harness:{{ docker_compose_latest }} ``` **Required Environment Variables:** - `HF_TOKEN`: Required for aegis_v2 safety evaluation tasks **Default Parameters:** | Parameter | Value | |-----------|-------| | `limit_samples` | `None` | | `max_new_tokens` | `6144` | | `temperature` | `0.6` | | `top_p` | `0.95` | | `parallelism` | `8` | | `max_retries` | `5` | | `request_timeout` | `30` | | `judge` | `{'url': None, 'model_id': None, 'api_key': None, 'parallelism': 32, 'request_timeout': 60, 'max_retries': 16}` | **Key Features:** - Comprehensive safety benchmarks - Bias detection and measurement - Harmful content classification - Ethical alignment assessment - Detailed safety reporting **Safety Evaluation Areas:** - Bias and Fairness - Harmful Content Generation - Toxicity Detection - Hate Speech Identification - Ethical Decision Making - Social Impact Assessment # Specialized Tools Containers Containers for specialized evaluation tasks including agentic AI capabilities and advanced reasoning assessments. --- ## Agentic Evaluation Container **NGC Catalog**: [agentic_eval](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/agentic_eval) Container for evaluating agentic AI models on tool usage and planning tasks. **Use Cases:** - Tool usage evaluation - Planning tasks assessment **Pull Command:** ```bash docker pull nvcr.io/nvidia/eval-factory/agentic_eval:{{ docker_compose_latest }} ``` **Supported Benchmarks:** - `agentic_eval_answer_accuracy` - `agentic_eval_goal_accuracy_with_reference` - `agentic_eval_goal_accuracy_without_reference` - `agentic_eval_topic_adherence` - `agentic_eval_tool_call_accuracy` --- ## BFCL Container **NGC Catalog**: [bfcl](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bfcl) Container for Berkeley Function-Calling Leaderboard evaluation framework. **Use Cases:** - Tool usage evaluation - Multi-turn interactions - Native support for function/tool calling - Function calling evaluation **Pull Command:** ```bash docker pull nvcr.io/nvidia/eval-factory/bfcl:{{ docker_compose_latest }} ``` **Default Parameters:** | Parameter | Value | |-----------|-------| | `limit_samples` | `None` | | `parallelism` | `10` | | `native_calling` | `False` | | `custom_dataset` | `{'path': None, 'format': None, 'data_template_path': None}` | --- ## ToolTalk Container **NGC Catalog**: [tooltalk](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/tooltalk) Container for evaluating AI models' ability to use tools and APIs effectively. **Use Cases:** - Tool usage evaluation - API interaction assessment - Function calling evaluation - External tool integration testing **Pull Command:** ```bash docker pull nvcr.io/nvidia/eval-factory/tooltalk:{{ docker_compose_latest }} ``` **Default Parameters:** | Parameter | Value | |-----------|-------| | `limit_samples` | `None` | # Vision-Language Containers Containers specialized for evaluating multimodal models that process both visual and textual information. --- ## VLMEvalKit Container **NGC Catalog**: [vlmevalkit](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/vlmevalkit) Container for Vision-Language Model evaluation toolkit. **Use Cases:** - Multimodal model evaluation - Image-text understanding assessment - Visual reasoning evaluation - Cross-modal performance testing **Pull Command:** ```bash docker pull nvcr.io/nvidia/eval-factory/vlmevalkit:{{ docker_compose_latest }} ``` **Default Parameters:** | Parameter | Value | |-----------|-------| | `limit_samples` | `None` | | `max_new_tokens` | `2048` | | `temperature` | `0` | | `top_p` | `None` | | `parallelism` | `4` | | `max_retries` | `5` | | `request_timeout` | `60` | **Supported Benchmarks:** - `ocrbench` - Optical character recognition and text understanding - `slidevqa` - Slide-based visual question answering (requires `OPENAI_CLIENT_ID`, `OPENAI_CLIENT_SECRET`) - `chartqa` - Chart and graph interpretation - `ai2d_judge` - AI2 Diagram understanding (requires `OPENAI_CLIENT_ID`, `OPENAI_CLIENT_SECRET`) (advanced-features)= # Advanced Features This section covers advanced FDF features including conditional parameter handling, parameter inheritance, and dynamic configuration. ## Conditional Parameter Handling Use Jinja2 conditionals to handle optional parameters: ```yaml command: >- example_eval --model {{target.api_endpoint.model_id}} {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} ``` ### Common Conditional Patterns **Check for null/none values**: ```jinja {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} ``` **Check for boolean flags**: ```jinja {% if config.params.extra.add_system_prompt %} --add_system_prompt {% endif %} ``` **Check if variable is defined**: ```jinja {% if config.params.extra.args is defined %} {{ config.params.extra.args }} {% endif %} ``` **Check for specific values**: ```jinja {% if target.api_endpoint.type == "chat" %} --use_chat_format {% endif %} ``` ## Parameter Inheritance Parameters follow a hierarchical override system: 1. **Framework defaults** (4th priority) - Lowest priority 2. **Evaluation defaults** (3rd priority) 3. **User configuration** (2nd priority) 4. **CLI overrides** (1st priority) - Highest priority ### Inheritance Example **Framework defaults (framework.yml)**: ```yaml defaults: config: params: temperature: 0.0 max_new_tokens: 4096 ``` **Evaluation defaults (framework.yml)**: ```yaml evaluations: - name: humaneval defaults: config: params: max_new_tokens: 1024 # Overrides framework default ``` **User configuration (config.yaml)**: ```yaml config: params: max_new_tokens: 512 # Overrides evaluation default temperature: 0.7 # Overrides framework default ``` **CLI overrides**: ```bash nemo-evaluator run_eval --overrides config.params.temperature=1.0 # Overrides all previous values ``` For more information on how to use these overrides, refer to the {ref}`parameter-overrides` documentation. ## Dynamic Configuration Use template variables to reference other configuration sections. For example, re-use `config.output_dir` for `--cache` input argument: ```yaml command: >- example_eval --output {{config.output_dir}} --cache {{config.output_dir}}/cache ``` ### Dynamic Configuration Patterns **Reference output directory**: ```yaml --results {{config.output_dir}}/results.json --logs {{config.output_dir}}/logs ``` **Compose complex paths**: ```yaml --data_dir {{config.output_dir}}/data/{{config.params.task}} ``` **Use task type in paths**: ```yaml --cache {{config.output_dir}}/cache/{{config.type}} ``` **Reference model information**: ```yaml --model_name {{target.api_endpoint.model_id}} --endpoint {{target.api_endpoint.url}} ``` ## Environment Variable Handling **Export API keys conditionally**: ```jinja {% if target.api_endpoint.api_key is not none %}export API_KEY=${{target.api_endpoint.api_key}} && {% endif %} ``` **Set multiple environment variables**: ```jinja {% if target.api_endpoint.api_key is not none %}export API_KEY=${{target.api_endpoint.api_key}} && {% endif %} {% if config.params.extra.custom_env is defined %}export CUSTOM_VAR={{config.params.extra.custom_env}} && {% endif %} ``` ## Complex Command Templates **Multi-line commands with conditionals**: ```yaml command: >- {% if target.api_endpoint.api_key is not none %}export API_KEY=${{target.api_endpoint.api_key}} && {% endif %} example_eval --model {{target.api_endpoint.model_id}} --task {{config.params.task}} --url {{target.api_endpoint.url}} {% if config.params.limit_samples is not none %}--first_n {{config.params.limit_samples}}{% endif %} {% if config.params.extra.add_system_prompt %}--add_system_prompt{% endif %} {% if target.api_endpoint.type == "chat" %}--use_chat_format{% endif %} --output {{config.output_dir}} {% if config.params.extra.args is defined %}{{ config.params.extra.args }}{% endif %} ``` ## Best Practices - Always check if optional parameters are defined before using them - Use `is not none` for nullable parameters with default values - Use `is defined` for truly optional parameters that may not exist - Keep conditional logic simple and readable - Document custom parameters in the framework's README - Test all conditional branches with different configurations - Use parameter inheritance to avoid duplication - Leverage dynamic paths to organize output files (defaults-section)= # Defaults Section The `defaults` section defines the default configuration and execution command that will be used across all evaluations unless overridden. Overriding is supported either through `--overrides` flag (refer to {ref}`parameter-overrides`) or {ref}`run-configuration`. ## Command Template The `command` field uses Jinja2 templating to dynamically generate execution commands based on configuration parameters. ```yaml defaults: command: >- {% if target.api_endpoint.api_key is not none %}export API_KEY=${{target.api_endpoint.api_key}} && {% endif %} example_eval --model {{target.api_endpoint.model_id}} --task {{config.params.task}} --url {{target.api_endpoint.url}} --temperature {{config.params.temperature}} # ... additional parameters ``` **Important Note**: `example_eval` is a placeholder representing your actual CLI command. When onboarding your harness, replace this with your real command (e.g., `lm-eval`, `bigcode-eval`, `gorilla-eval`, etc.). ## Template Variables ### Target API Endpoint Variables - **`{{target.api_endpoint.api_key}}`**: Name of the environment variable storing API key - **`{{target.api_endpoint.model_id}}`**: Target model identifier - **`{{target.api_endpoint.stream}}`**: Whether responses should be streamed - **`{{target.api_endpoint.type}}`**: The type of the target endpoint - **`{{target.api_endpoint.url}}`**: URL of the model - **`{{target.api_endpoint.adapter_config}}`**: Adapter configuration ### Evaluation Configuration Variables - **`{{config.output_dir}}`**: Output directory for results - **`{{config.type}}`**: Type of the task - **`{{config.supported_endpoint_types}}`**: Supported endpoint types (chat/completions) ### Configuration Parameters - **`{{config.params.task}}`**: Evaluation task type - **`{{config.params.temperature}}`**: Model temperature setting - **`{{config.params.limit_samples}}`**: Sample limit for evaluation - **`{{config.params.max_new_tokens}}`**: Maximum tokens to generate - **`{{config.params.max_retries}}`**: Number of REST request retries - **`{{config.params.parallelism}}`**: Parallelism to be used - **`{{config.params.request_timeout}}`**: REST response timeout - **`{{config.params.top_p}}`**: Top-p sampling parameter - **`{{config.params.extra}}`**: Framework-specific parameters ## Configuration Defaults The following example shows common parameter defaults. Each framework defines its own default values in the framework.yml file. ```yaml defaults: config: params: limit_samples: null # No limit on samples by default max_new_tokens: 4096 # Maximum tokens to generate temperature: 0.0 # Deterministic generation top_p: 0.00001 # Nucleus sampling parameter parallelism: 10 # Number of parallel requests max_retries: 5 # Maximum API retry attempts request_timeout: 60 # Request timeout in seconds extra: # Framework-specific parameters n_samples: null # Number of sampled responses per input downsampling_ratio: null # Data downsampling ratio add_system_prompt: false # Include system prompt ``` ## Parameter Categories ### Core Parameters Basic evaluation settings that control model behavior: - `temperature`: Controls randomness in generation (0.0 = deterministic) - `max_new_tokens`: Maximum length of generated output - `top_p`: Nucleus sampling parameter for diversity ### Performance Parameters Settings that affect execution speed and reliability: - `parallelism`: Number of parallel API requests - `request_timeout`: Maximum wait time for API responses - `max_retries`: Number of retry attempts for failed requests ### Framework Parameters Task-specific configuration options: - `task`: Specific evaluation task to run - `limit_samples`: Limit number of samples for testing ### Extra Parameters Custom parameters specific to your framework. Use it for: - specifying number of sampled responses per input query - judge configuration - configuring few-shot settings ## Target Configuration ```yaml defaults: target: api_endpoint: type: chat # Default endpoint type supported_endpoint_types: # All supported types - chat - completions - vlm - embedding ``` ### Endpoint Types **chat**: Multi-turn conversation format following the OpenAI chat completions API (`/v1/chat/completions`). Use this for models that support conversational interactions with role-based messages (system, user, assistant). **completions**: Single-turn text completion format following the OpenAI completions API (`/v1/completions`). Use this for models that generate text based on a single prompt without conversation context. Often used for log-probability evaluations. **vlm**: Vision-language model endpoints that support image inputs alongside text (`/v1/chat/completions`). Use this for multimodal evaluations that include visual content. **embedding**: Embedding generation endpoints for retrieval and similarity evaluations (`/v1/embeddings`). Use this for tasks that require vector representations of text. (evaluations-section)= # Evaluations Section The `evaluations` section defines the specific evaluation types available in your framework, each with its own configuration defaults. ## Structure ```yaml evaluations: - name: example_task_1 # Evaluation name description: Basic functionality demo # Human-readable description defaults: config: type: "example_task_1" # Evaluation identifier supported_endpoint_types: # Supported endpoints for this task - chat - completions params: task: "example_task_1" # Task identifier used by the harness temperature: 0.0 # Task-specific temperature max_new_tokens: 1024 # Task-specific token limit extra: custom_key: "custom_value" # Task-specific custom param ``` ## Fields ### name **Type**: String **Required**: Yes Name for the evaluation type. **Example**: ```yaml name: HumanEval ``` ### description **Type**: String **Required**: Yes Clear description of what the evaluation measures. This helps users understand the purpose and scope of the evaluation. **Example**: ```yaml description: Evaluates code generation capabilities using the HumanEval benchmark dataset ``` ### type **Type**: String **Required**: Yes Unique configuration identifier used by the framework. This is used to reference the evaluation in CLI commands and configurations. This typically matches the `name` field but may differ based on your framework's conventions. **Example**: ```yaml type: "humaneval" ``` ### supported_endpoint_types **Type**: List of strings **Required**: Yes API endpoint types compatible with this evaluation. Specify which endpoint types work with this evaluation task: - `chat` - Conversational format with role-based messages - `completions` - Single-turn text completion - `vlm` - Vision-language model with image support - `embedding` - Embedding generation for retrieval tasks **Example**: ```yaml supported_endpoint_types: - chat - completions ``` ### params **Type**: Object **Required**: No Task-specific parameter overrides that differ from the framework-level defaults. Use this to customize settings for individual evaluation types. **Example**: ```yaml params: task: "humaneval" temperature: 0.0 max_new_tokens: 1024 extra: custom_key: "custom_value" ``` ## Multiple Evaluations You can define multiple evaluation types in a single FDF: ```yaml evaluations: - name: humaneval description: Code generation evaluation defaults: config: type: "humaneval" supported_endpoint_types: - chat - completions params: task: "humaneval" max_new_tokens: 1024 - name: mbpp description: Python programming evaluation defaults: config: type: "mbpp" supported_endpoint_types: - chat params: task: "mbpp" max_new_tokens: 512 ``` ## Best Practices - Use descriptive names that indicate the evaluation purpose - Provide comprehensive descriptions for each evaluation type - List endpoint types that are actually supported and tested - Override parameters when they differ from framework defaults - Use the `extra` object for framework-specific custom parameters - Group related evaluations together in the same FDF - Test each evaluation type with all specified endpoint types (fdf-troubleshooting)= # Troubleshooting This section covers common issues encountered when creating and using Framework Definition Files. ## Common Issues ::::{dropdown} Template Errors :icon: code-square **Symptom**: Template rendering fails with syntax errors. **Causes**: - Missing closing braces in Jinja2 templates - Invalid variable references - Incorrect conditional syntax **Solutions**: Check that all template variables use correct syntax: ```yaml # Correct {{target.api_endpoint.model_id}} # Incorrect {{target.api_endpoint.model_id} {target.api_endpoint.model_id}} ``` Verify conditional statements are properly formatted: ```jinja # Correct {% if config.params.limit_samples is not none %} --first_n {{config.params.limit_samples}}{% endif %} # Incorrect {% if config.params.limit_samples != none %} --first_n {{config.params.limit_samples}}{% end %} ``` :::: ::::{dropdown} Parameter Conflicts :icon: code-square **Symptom**: Parameters are not overriding as expected. **Causes**: - Incorrect parameter paths in overrides - Type mismatches between default and override values - Missing parameter definitions in defaults section - Incorrect indentation in the YAML config **Solutions**: Ensure parameter paths are correct: ```bash # Correct --overrides config.params.temperature=0.7 # Incorrect --overrides params.temperature=0.7 --overrides config.temperature=0.7 ``` Verify parameter types match: ```yaml # Correct temperature: 0.7 # Float # Incorrect temperature: "0.7" # String ``` Make sure to use the correct indentation: ```yaml # Correct defaults: config: params: limit_samples: null max_new_tokens: 4096 # max_new_tokens belongs to params # Incorrect defaults: config: params: limit_samples: null max_new_tokens: 4096 # max_new_tokens is outside of params ``` :::: ::::{dropdown} Type Mismatches :icon: code-square **Symptom**: Validation errors about incorrect parameter types. **Causes**: - String values used for numeric parameters - Missing quotes for string values - Boolean values as strings **Solutions**: Use correct types for each parameter: ```yaml # Correct temperature: 0.7 # Float max_new_tokens: 1024 # Integer add_system_prompt: false # Boolean task: "humaneval" # String # Incorrect temperature: "0.7" # String instead of float max_new_tokens: "1024" # String instead of integer add_system_prompt: "false" # String instead of boolean ``` :::: ::::{dropdown} Missing Fields :icon: code-square **Symptom**: Validation fails with "required field missing" errors. **Causes**: - Incomplete framework section - Missing required parameters - Omitted evaluation configurations **Solutions**: Ensure all required framework fields are present: ```yaml framework: name: your-framework # Required pkg_name: your_framework # Required full_name: Your Framework # Required description: Description... # Required url: https://github.com/... # Required ``` Include all required evaluation fields: ```yaml evaluations: - name: task_name # Required description: Task description # Required defaults: config: type: "task_type" # Required supported_endpoint_types: # Required - chat ``` :::: ## Debug Mode Enable debug logging to see how your FDF is processed. Use the `--debug` flag or set the logging level: ```bash # Using debug flag nemo-evaluator run_eval --eval_type your_evaluation --debug # Or set log level environment variable export LOG_LEVEL=DEBUG nemo-evaluator run_eval --eval_type your_evaluation ``` ### Debug Output Debug mode provides detailed information about: - FDF discovery and loading - Template variable resolution - Parameter inheritance and overrides - Command generation - Validation errors with stack traces ### Interpreting Debug Logs Debug logs show the FDF loading and processing workflow. Key information includes: **FDF Loading**: Shows which framework.yml files are discovered and loaded **Template Rendering**: Displays template variable substitution and final rendered commands **Parameter Overrides**: Shows how configuration values cascade through the inheritance hierarchy **Validation Errors**: Provides detailed error messages when FDF structure or templates are invalid ## Validation Tips **Test incrementally**: Start with a minimal FDF and add sections progressively. **Validate templates separately**: Test Jinja2 templates in isolation before adding to FDF. **Check references**: Ensure all template variables reference existing configuration paths. **Use examples**: Base your FDF on existing, working examples from the NeMo Evaluator repository. **Verify syntax**: Use a YAML validator to catch formatting errors. ## Getting Help If you encounter issues not covered here: 1. Check the FDF examples in the NeMo Evaluator repository 2. Review debug logs for specific error messages 3. Verify your framework's CLI works independently 4. Consult the {ref}`extending-evaluator` documentation 5. Search for similar issues in the project's issue tracker (framework-section)= # Framework Section The `framework` section contains basic identification and metadata for your evaluation framework. ## Structure ```yaml framework: name: example-evaluation-framework # Internal framework identifier pkg_name: example_evaluation_framework # Python package name full_name: Example Evaluation Framework # Human-readable display name description: A comprehensive example... # Detailed description url: https://github.com/example/... # Original repository URL ``` ## Fields ### name **Type**: String **Required**: Yes Unique identifier used internally by the system. This should be a lowercase, hyphenated string that identifies your framework. **Example**: ```yaml name: bigcode-evaluation-harness ``` ### pkg_name **Type**: String **Required**: Yes Python package name for your framework. This typically matches the `name` field but uses underscores instead of hyphens to follow Python naming conventions. **Example**: ```yaml pkg_name: bigcode_evaluation_harness ``` ### full_name **Type**: String **Required**: Recommended Human-readable name displayed in the UI and documentation. Use proper capitalization and spacing. **Example**: ```yaml full_name: BigCode Evaluation Harness ``` ### description **Type**: String **Required**: Recommended Comprehensive description of the framework's purpose, capabilities, and use cases. This helps users understand when to use your framework. **Example**: ```yaml description: A comprehensive evaluation harness for code generation models, supporting multiple programming languages and diverse coding tasks. ``` ### url **Type**: String (URL) **Required**: Recommended Link to the original benchmark or framework repository. This provides users with access to more documentation and source code. **Example**: ```yaml url: https://github.com/bigcode-project/bigcode-evaluation-harness ``` ## Best Practices - Use consistent naming across `name`, `pkg_name`, and `full_name` - Keep the `name` field URL-friendly (lowercase, hyphens) - Write clear, concise descriptions that highlight unique features - Link to the canonical upstream repository when available - Verify that the URL is accessible and up-to-date ## Minimal Requirements At minimum, an FDF requires the `name` and `pkg_name` fields. However, including `full_name`, `description`, and `url` is strongly recommended for better documentation and user experience. (framework-definition-file)= # Framework Definition File (FDF) Framework Definition Files are YAML configuration files that integrate evaluation frameworks into NeMo Evaluator. They define framework metadata, execution commands, and evaluation tasks. **New to FDFs?** Learn about {ref}`the concepts and architecture ` before creating one. ## Prerequisites Before creating an FDF, you should: - Understand YAML syntax and structure - Be familiar with your evaluation framework's CLI interface - Have basic knowledge of Jinja2 templating - Know the API endpoint types your framework supports ## Getting Started **Creating your first FDF?** Follow this sequence: 1. {ref}`framework-section` - Define framework metadata 2. {ref}`defaults-section` - Configure command templates and parameters 3. {ref}`evaluations-section` - Define evaluation tasks 4. {ref}`integration` - Integrate with Eval Factory **Need help?** Refer to {ref}`fdf-troubleshooting` for debugging common issues. ## Complete Example The FDF follows a hierarchical structure with three main sections. Here's a minimal but complete example: ```yaml # 1. Framework Identification framework: name: my-custom-eval pkg_name: my_custom_eval full_name: My Custom Evaluation Framework description: Evaluates domain-specific capabilities url: https://github.com/example/my-eval # 2. Default Command and Configuration defaults: command: >- {% if target.api_endpoint.api_key is not none %}export API_KEY=${{target.api_endpoint.api_key}} && {% endif %} my-eval-cli --model {{target.api_endpoint.model_id}} --task {{config.params.task}} --output {{config.output_dir}} config: params: temperature: 0.0 max_new_tokens: 1024 target: api_endpoint: type: chat supported_endpoint_types: - chat - completions # 3. Evaluation Types evaluations: - name: my_task_1 description: First evaluation task defaults: config: type: my_task_1 supported_endpoint_types: - chat params: task: my_task_1 ``` ## Reference Documentation ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`package;1.5em;sd-mr-1` Framework Section :link: framework-section :link-type: ref Define framework metadata including name, package information, and repository URL. ::: :::{grid-item-card} {octicon}`list-unordered;1.5em;sd-mr-1` Defaults Section :link: defaults-section :link-type: ref Configure default parameters, command templates, and target endpoint settings. ::: :::{grid-item-card} {octicon}`checklist;1.5em;sd-mr-1` Evaluations Section :link: evaluations-section :link-type: ref Define specific evaluation types with task-specific configurations and parameters. ::: :::{grid-item-card} {octicon}`telescope;1.5em;sd-mr-1` Advanced Features :link: advanced-features :link-type: ref Use conditionals, parameter inheritance, and dynamic configuration in your FDF. ::: :::{grid-item-card} {octicon}`plug;1.5em;sd-mr-1` Integration :link: integration :link-type: ref Learn how to integrate your FDF with the Eval Factory system. ::: :::{grid-item-card} {octicon}`question;1.5em;sd-mr-1` Troubleshooting :link: fdf-troubleshooting :link-type: ref Debug common issues with template errors, parameters, and validation. ::: :::: ## Related Documentation - {ref}`eval-custom-tasks` - Learn how to create custom evaluation tasks - {ref}`extending-evaluator` - Overview of extending the NeMo Evaluator - {ref}`parameter-overrides` - Using parameter overrides in evaluations :::{toctree} :maxdepth: 1 :hidden: Framework Section Defaults Section Evaluations Section Advanced Features Integration Troubleshooting ::: (integration)= # Integration with Eval Factory This section describes how to integrate your Framework Definition File with the Eval Factory system. ## File Location Place your FDF in the `core_evals//` directory of your framework package: ``` your-framework/ core_evals/ your_framework/ framework.yml # This is your FDF output.py # Output parser (custom) __init__.py # Empty init file setup.py # Package configuration README.md # Framework documentation ``` ### Directory Structure Explanation **core_evals/**: Root directory for evaluation framework definitions. This directory name is required by the Eval Factory system. **your_framework/**: Subdirectory named after your framework (must match `framework.name` from your FDF). **framework.yml**: Your Framework Definition File. This exact filename is required. **output.py**: Custom output parser for processing evaluation results. This file should implement the parsing logic specific to your framework's output format. **__init__.py**: Empty initialization file to make the directory a Python package. ## Validation The FDF is validated by the NeMo Evaluator system when loaded. Validation occurs through Pydantic models that ensure: - Required fields are present (`name`, `pkg_name`, `command`) - Parameter types are correct (strings, integers, floats, lists) - Template syntax is valid (Jinja2 parsing) - Configuration consistency (endpoint types, parameter references) ### Validation Checks **Schema Validation**: Pydantic models ensure required fields exist and have correct types when the FDF is parsed. **Template Validation**: Jinja2 templates are rendered with `StrictUndefined`, which raises errors for undefined variables. **Reference Validation**: Template variables must reference valid fields in the `Evaluation` model (`config`, `target`, `framework_name`, `pkg_name`). **Consistency Validation**: Endpoint types and parameters should be consistent across framework defaults and evaluation-specific configurations. ## Registration Once your FDF is properly located and validated, the Eval Factory system automatically: 1. Discovers your framework during initialization 2. Parses the FDF and validates its structure 3. Registers available evaluation types 4. Makes your framework available via CLI commands ## Using Your Framework After successful integration, you can use your framework with the Eval Factory CLI: ```bash # List available frameworks and tasks nemo-evaluator ls # Run an evaluation nemo-evaluator run_eval --eval_type your_evaluation --model_id my-model ... ``` ## Package Configuration Ensure your `setup.py` or `pyproject.toml` includes the FDF in package data: ```python from setuptools import setup, find_packages setup( name="your-framework", packages=find_packages(), package_data={ "core_evals": ["**/*.yml"], }, include_package_data=True, ) ``` ```toml [tool.setuptools.package-data] core_evals = ["**/*.yml"] ``` ## Best Practices - Follow the exact directory structure and naming conventions - Test your FDF validation locally before deployment - Document your framework's output format in README.md - Include example configurations in your documentation - Provide sample commands for common use cases - Version your FDF changes alongside framework updates - Keep the FDF synchronized with your framework's capabilities (extending-evaluator)= # Extending NeMo Evaluator Extend NeMo Evaluator with custom benchmarks, evaluation frameworks, and integrations. Learn how to define new evaluation frameworks and integrate them into the NeMo Evaluator ecosystem using standardized configuration patterns. ::::{grid} 1 1 1 1 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`tools;1.5em;sd-mr-1` Framework Definition File :link: framework-definition-file :link-type: ref Learn how to create Framework Definition Files (FDF) to integrate custom evaluation frameworks and benchmarks into the NeMo Evaluator ecosystem. ::: :::: ## Extension Patterns NeMo Evaluator supports several patterns for extending functionality: ### Framework Definition Files (FDF) The primary extension mechanism uses YAML configuration files to define: - Framework metadata and dependencies - Default configurations and parameters - Evaluation types and task definitions - Container integration specifications ### Integration Benefits - **Standardization**: Follow established patterns for configuration and execution - **Reproducibility**: Leverage the same deterministic configuration system - **Compatibility**: Work seamlessly with existing launchers and exporters - **Community**: Share frameworks through the standard FDF format ## Start with Extensions **Building a production framework?** Follow these steps: 1. **Review Existing Frameworks**: Study existing FDF files to understand the structure 2. **Define Your Framework**: Create an FDF that describes your evaluation framework 3. **Test Integration**: Validate that your framework works with NeMo Evaluator workflows 4. **Container Packaging**: Package your framework as a container for distribution For detailed reference documentation, refer to {ref}`framework-definition-file`. :::{toctree} :caption: Extending NeMo Evaluator :hidden: Framework Definition File ::: (lib-core)= # NeMo Evaluator The *Core Evaluation Engine* delivers standardized, reproducible AI model evaluation through containerized benchmarks and a flexible adapter architecture. :::{tip} **Need orchestration?** For CLI and multi-backend execution, use the [NeMo Evaluator Launcher](../nemo-evaluator-launcher/index.md). ::: ## Get Started ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`workflow;1.5em;sd-mr-1` Workflows :link: workflows/index :link-type: doc Run evaluations using pre-built containers directly or integrate them through the Python API. ::: :::{grid-item-card} {octicon}`container;1.5em;sd-mr-1` Containers :link: containers/index :link-type: doc Ready-to-use evaluation containers with curated benchmarks and frameworks. ::: :::: ## Reference and Customization ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`plug;1.5em;sd-mr-1` Interceptors :link: interceptors/index :link-type: doc Set up interceptors to handle requests, responses, logging, caching, and custom processing. ::: :::{grid-item-card} {octicon}`log;1.5em;sd-mr-1` Logging :link: logging :link-type: doc Comprehensive logging setup for evaluation runs, debugging, and audit trails. ::: :::{grid-item-card} {octicon}`tools;1.5em;sd-mr-1` Extending :link: extending/index :link-type: doc Add custom benchmarks and frameworks by defining configuration and interfaces. ::: :::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` Python API Reference :link: ../../references/api/nemo-evaluator/api/index :link-type: doc Python API documentation for programmatic evaluation control and integration. ::: :::{grid-item-card} {octicon}`terminal;1.5em;sd-mr-1` CLI Reference :link: ../../references/api/nemo-evaluator/cli :link-type: doc Command-line interface for direct container and evaluation execution. ::: :::: :::{toctree} :caption: NeMo Evaluator Core :hidden: About NeMo Evaluator Workflows Benchmark Containers Interceptors Logging Extending ::: (workflows-overview)= # Workflows Learn how to use NeMo Evaluator through different workflow patterns. Whether you prefer programmatic control through Python APIs or CLI, these guides provide practical examples for integrating evaluations into your ML pipelines. ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`command-palette;1.5em;sd-mr-1` CLI :link: cli :link-type: doc Run evaluations using the pre-built NGC containers and command line interface. ::: :::{grid-item-card} {octicon}`file-code;1.5em;sd-mr-1` Python API :link: python-api :link-type: doc Use the NeMo Evaluator Python API to integrate evaluations directly into your existing ML pipelines and applications. ::: :::: ## Choose Your Workflow - **Python API**: Integrate evaluations directly into your existing Python applications when you need dynamic configuration management or programmatic control - **CLI**: Use CLI when you work with CI/CD systems, container orchestration platforms, or other non-interactive workflows. Both approaches use the same underlying evaluation package and produce identical, reproducible results. Choose based on your integration requirements and preferred level of abstraction. :::{toctree} :caption: Workflows :hidden: CLI Python API ::: (cli-workflows)= # CLI Workflows This document explains how to use evaluation containers within NeMo Evaluator workflows, focusing on command execution and configuration. ## Overview Evaluation containers provide consistent, reproducible environments for running AI model evaluations. For a comprehensive list of all available containers, refer to {ref}`nemo-evaluator-containers`. ## Basic CLI ### Using YAML Configuration Define your config: ```yaml config: type: mmlu_pro output_dir: /workspace/results params: limit_samples: 10 target: api_endpoint: url: https://integrate.api.nvidia.com/v1/chat/completions model_id: meta/llama-3.2-3b-instruct type: chat api_key: NGC_API_KEY ``` Run evaluation: ```bash export HF_TOKEN=hf_xxx export NGC_API_KEY=nvapi-xxx nemo-evaluator run_eval \ --run_config /workspace/my_config.yml ``` ### Using CLI overrides Provide all arguments through CLI: ```bash export HF_TOKEN=hf_xxx export NGC_API_KEY=nvapi-xxx nemo-evaluator run_eval \ --eval_type mmlu_pro \ --model_id meta/llama-3.2-3b-instruct \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ --model_type chat \ --api_key_name NGC_API_KEY \ --output_dir /workspace/results \ --overrides 'config.params.limit_samples=10' ``` ## Interceptor Configuration The adapter system uses interceptors to modify requests and responses. Configure interceptors using the `--overrides` parameter. For detailed interceptor configuration, refer to {ref}`nemo-evaluator-interceptors`. :::{note} Always remember to include `endpoint` Interceptor at the and of your custom Interceptors chain. ::: ### Enable Request Logging ```yaml config: type: mmlu_pro output_dir: /workspace/results params: limit_samples: 10 target: api_endpoint: url: https://integrate.api.nvidia.com/v1/chat/completions model_id: meta/llama-3.2-3b-instruct type: chat api_key: NGC_API_KEY adapter_config: interceptors: - name: "request_logging" enabled: true config: max_requests: 1000 - name: "endpoint" enabled: true config: {} ``` ```bash export HF_TOKEN=hf_xxx export NGC_API_KEY=nvapi-xxx nemo-evaluator run_eval \ --run_config /workspace/my_config.yml ``` ### Enable Caching ```yaml config: type: mmlu_pro output_dir: /workspace/results params: limit_samples: 10 target: api_endpoint: url: https://integrate.api.nvidia.com/v1/chat/completions model_id: meta/llama-3.2-3b-instruct type: chat api_key: NGC_API_KEY adapter_config: interceptors: - name: "caching" enabled: true config: cache_dir: "./evaluation_cache" reuse_cached_responses: true save_requests: true save_responses: true max_saved_requests: 1000 max_saved_responses: 1000 - name: "endpoint" enabled: true config: {} ``` ```bash export HF_TOKEN=hf_xxx export NGC_API_KEY=nvapi-xxx nemo-evaluator run_eval \ --run_config /workspace/my_config.yml ``` ### Multiple Interceptors ```yaml config: type: mmlu_pro output_dir: /workspace/results params: limit_samples: 10 target: api_endpoint: url: https://integrate.api.nvidia.com/v1/chat/completions model_id: meta/llama-3.2-3b-instruct type: chat api_key: NGC_API_KEY adapter_config: interceptors: - name: "caching" enabled: true config: cache_dir: "./evaluation_cache" reuse_cached_responses: true save_requests: true save_responses: true max_saved_requests: 1000 max_saved_responses: 1000 - name: "request_logging" enabled: true config: max_requests: 1000 - name: "reasoning" config: start_reasoning_token: "" end_reasoning_token: "" add_reasoning: true enable_reasoning_tracking: true - name: "endpoint" enabled: true config: {} ``` ```bash export HF_TOKEN=hf_xxx export NGC_API_KEY=nvapi-xxx nemo-evaluator run_eval \ --run_config /workspace/my_config.yml ``` ### Legacy Configuration Support Provide Interceptor configuration with `--overrides` flag: ```bash nemo-evaluator run_eval \ --eval_type mmlu_pro \ --model_id meta/llama-3.2-3b-instruct \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ --model_type chat \ --api_key_name NGC_API_KEY \ --output_dir ./results \ --overrides 'target.api_endpoint.adapter_config.use_request_logging=True,target.api_endpoint.adapter_config.max_saved_requests=1000,target.api_endpoint.adapter_config.use_caching=True,target.api_endpoint.adapter_config.caching_dir=./cache,target.api_endpoint.adapter_config.reuse_cached_responses=True' ``` :::{note} Legacy parameters will be automatically converted to the modern interceptor-based configuration. For new projects, use the YAML interceptor configutation shown above. ::: ## Troubleshooting ### Port Conflicts If you manually specify the adapter server port, you can encounter port conflicts. Try selecting a differnt port: ```bash export ADAPTER_PORT=3828 export ADAPTER_HOST=localhost ``` :::{note} You can also rely on NeMo Evaluator's dynamic port binding feature. ::: ### API Key Issues Verify your API key environment variable: ```bash echo $MY_API_KEY ``` ## Environment Variables ### Adapter Server Configuration ```bash export ADAPTER_PORT=3828 # Default: 3825 export ADAPTER_HOST=localhost ``` ### API Key Management ```bash export MY_API_KEY=your_api_key_here export HF_TOKEN=your_hf_token_here ``` (python-api-workflows)= # Python API The NeMo Evaluator Python API provides programmatic access to evaluation capabilities through the `nemo-evaluator` package, allowing you to integrate evaluations into existing ML pipelines, automate workflows, and build custom evaluation applications. ## Overview The Python API is built on top of NeMo Evaluator and provides: - **Programmatic Evaluation**: Run evaluations from Python code using `evaluate` - **Configuration Management**: Dynamic configuration and parameter management - **Adapter Integration**: Access to the full adapter system capabilities - **Result Processing**: Programmatic access to evaluation results - **Pipeline Integration**: Seamless integration with existing ML workflows ## Basic Usage ### Basic Evaluation Run a simple evaluation with minimal configuration: ```python from nemo_evaluator.core.evaluate import evaluate from nemo_evaluator.api.api_dataclasses import ( EvaluationConfig, EvaluationTarget, ApiEndpoint, EndpointType, ConfigParams ) # Configure evaluation eval_config = EvaluationConfig( type="mmlu_pro", output_dir="./results", params=ConfigParams( limit_samples=3, temperature=0.0, max_new_tokens=1024, parallelism=1 ) ) # Configure target endpoint target_config = EvaluationTarget( api_endpoint=ApiEndpoint( model_id="meta/llama-3.2-3b-instruct", url="https://integrate.api.nvidia.com/v1/chat/completions", type=EndpointType.CHAT, api_key="nvapi-your-key-here" ) ) # Run evaluation result = evaluate(eval_cfg=eval_config, target_cfg=target_config) print(f"Evaluation completed: {result}") ``` ### Evaluation With Adapter Interceptors Use interceptors for advanced features such as caching, logging, and reasoning: ```python from nemo_evaluator.core.evaluate import evaluate from nemo_evaluator.api.api_dataclasses import ( EvaluationConfig, EvaluationTarget, ApiEndpoint, EndpointType, ConfigParams ) from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig # Configure evaluation eval_config = EvaluationConfig( type="mmlu_pro", output_dir="./results", params=ConfigParams( limit_samples=10, temperature=0.0, max_new_tokens=1024, parallelism=1 ) ) # Configure adapter with interceptors adapter_config = AdapterConfig( interceptors=[ # Add custom system message InterceptorConfig( name="system_message", config={ "system_message": "You are a helpful AI assistant. Please provide accurate and detailed answers." } ), # Enable request logging InterceptorConfig( name="request_logging", config={"max_requests": 50} ), # Enable caching InterceptorConfig( name="caching", config={ "cache_dir": "./evaluation_cache", "reuse_cached_responses": True } ), # Enable response logging InterceptorConfig( name="response_logging", config={"max_responses": 50} ), # Enable reasoning extraction InterceptorConfig( name="reasoning", config={ "start_reasoning_token": "", "end_reasoning_token": "" } ), # Enable progress tracking InterceptorConfig( name="progress_tracking" ), InterceptorConfig( name="endpoint" ), ] ) # Configure target with adapter target_config = EvaluationTarget( api_endpoint=ApiEndpoint( model_id="meta/llama-3.2-3b-instruct", url="https://integrate.api.nvidia.com/v1/chat/completions", type=EndpointType.CHAT, api_key="nvapi-your-key-here", adapter_config=adapter_config ) ) # Run evaluation result = evaluate(eval_cfg=eval_config, target_cfg=target_config) print(f"Evaluation completed: {result}") ``` ## Related Documentation - **API Reference**: For complete API documentation, refer to the [API Reference](../api.md) page - **Adapter Configuration**: For detailed interceptor configuration options, refer to the {ref}`adapters-usage` page - **Interceptor Documentation**: For information about available interceptors, refer to the [Interceptors](../interceptors/index.md) page (nemo-evaluator-interceptors)= # Interceptors Interceptors provide fine-grained control over request and response processing during model evaluation through a configurable pipeline architecture. ## Overview The adapter system processes model API calls through a configurable pipeline of interceptors. Each interceptor can inspect, modify, or augment requests and responses as they flow through the evaluation process. ```{mermaid} graph LR A[Evaluation Request] --> B[Adapter System] B --> C[Interceptor Pipeline] C --> D[Model API] D --> E[Response Pipeline] E --> F[Processed Response] subgraph "Request Processing" C --> G[System Message] G --> H[Payload Modifier] H --> I[Request Logging] I --> J[Caching Check] J --> K[Endpoint Call] end subgraph "Response Processing" E --> L[Response Logging] L --> M[Reasoning Extraction] M --> N[Progress Tracking] N --> O[Cache Storage] end style B fill:#f3e5f5 style C fill:#e1f5fe style E fill:#e8f5e8 ``` ## Request Interceptors ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`comment;1.5em;sd-mr-1` System Messages :link: system-messages :link-type: doc Modify system messages in requests. ::: :::{grid-item-card} {octicon}`pencil;1.5em;sd-mr-1` Payload Modification :link: payload-modification :link-type: doc Add, remove, or modify request parameters. ::: :::{grid-item-card} {octicon}`sign-in;1.5em;sd-mr-1` Request Logging :link: request-logging :link-type: doc Logs requests for debugging, analysis, and audit purposes. ::: :::: ## Request-Response Interceptors ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`cache;1.5em;sd-mr-1` Caching :link: caching :link-type: doc Cache requests and responses to improve performance and reduce API calls. ::: :::{grid-item-card} {octicon}`broadcast;1.5em;sd-mr-1` Endpoint :link: endpoint :link-type: doc Communicates with the model endpoint. ::: :::: ## Response ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`sign-out;1.5em;sd-mr-1` Response Logging :link: response-logging :link-type: doc Logs responses for debugging, analysis, and audit purposes. ::: :::{grid-item-card} {octicon}`pulse;1.5em;sd-mr-1` Progress Tracking :link: progress-tracking :link-type: doc Track evaluation progress and status updates. ::: :::{grid-item-card} {octicon}`alert;1.5em;sd-mr-1` Raising on Client Errors :link: raise-client-error :link-type: doc Allows to fail fast on non-retryable client errors ::: :::{grid-item-card} {octicon}`comment-discussion;1.5em;sd-mr-1` Reasoning :link: reasoning :link-type: doc Handle reasoning tokens and track reasoning metrics. ::: :::{grid-item-card} {octicon}`meter;1.5em;sd-mr-1` Response Statistics :link: response-stats :link-type: doc Collects statistics from API responses for metrics collection and analysis. ::: :::: ## Process Post-Evaluation Results ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`report;1.5em;sd-mr-1` Post-Evaluation Hooks :link: post-evaluation-hooks :link-type: doc Run additional processing, reporting, or cleanup after evaluations complete. ::: :::: :::{toctree} :caption: Interceptors :hidden: System Messages Payload Modification Request Logging Caching Endpoint Response Logging Progress Tracking Raising on Client Errors Reasoning Response Statistics Post-Evaluation Hooks ::: (interceptor-system-messages)= # System Messages ## Overview The `SystemMessageInterceptor` modifies incoming requests to include custom system messages. This interceptor works with chat-format requests, replacing any existing system messages with the configured message. :::{tip} Add {ref}`interceptor-request-logging` to your interceptor chain to verify if your requests are modified correctly. ::: ## Configuration ### CLI Configuration ```bash --overrides 'target.api_endpoint.adapter_config.use_system_prompt=True,target.api_endpoint.adapter_config.custom_system_prompt="You are a helpful assistant."' ``` ### YAML Configuration ```yaml target: api_endpoint: adapter_config: interceptors: - name: system_message config: system_message: "You are a helpful AI assistant." strategy: "prepend" # Optional: "replace", "append", or "prepend" (default) - name: "endpoint" enabled: true config: {} ``` **Example with different strategies:** ```yaml # Replace existing system message - name: system_message config: system_message: "You are a precise assistant." strategy: "replace" # Prepend to existing system message (default) - name: system_message config: system_message: "Important: " strategy: "prepend" # Append to existing system message - name: system_message config: system_message: "Remember to be concise." strategy: "append" ``` ## Configuration Options For detailed configuration options, please refer to the {ref}`interceptor_reference` Python API reference. ## Behavior The interceptor modifies chat-format requests by: 1. Removing any existing system messages from the messages array 2. Inserting the configured system message as the first message 3. Preserving all other request parameters ### Example Request Transformation ```python # Original request { "messages": [ {"role": "user", "content": "What is 2+2?"} ] } # After system message interceptor { "messages": [ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "What is 2+2?"} ] } ``` If an existing system message is present, the interceptor replaces it: ```python # Original request with existing system message { "messages": [ {"role": "system", "content": "Old system message"}, {"role": "user", "content": "What is 2+2?"} ] } # After system message interceptor { "messages": [ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "What is 2+2?"} ] } ``` (interceptor-caching)= # Caching ## Overview The `CachingInterceptor` implements a caching system that can store responses based on request content, enabling faster re-runs of evaluations and reducing costs when using paid APIs. ## Configuration ### CLI Configuration ```bash --overrides 'target.api_endpoint.adapter_config.use_caching=True,target.api_endpoint.adapter_config.caching_dir=./cache,target.api_endpoint.adapter_config.reuse_cached_responses=True' ``` ### YAML Configuration ```yaml target: api_endpoint: adapter_config: interceptors: - name: "caching" enabled: true config: cache_dir: "./evaluation_cache" reuse_cached_responses: true save_requests: true save_responses: true max_saved_requests: 1000 max_saved_responses: 1000 - name: "endpoint" enabled: true config: {} ``` ## Configuration Options For detailed configuration options, please refer to the {ref}`interceptor_reference` Python API reference. ## Cache Key Generation The interceptor generates the cache key by creating a SHA256 hash of the JSON-serialized request data using `json.dumps()` with `sort_keys=True` for consistent ordering. ```python import hashlib import json # Request data request_data = { "messages": [{"role": "user", "content": "What is 2+2?"}], "temperature": 0.0, "max_new_tokens": 512 } # Generate cache key data_str = json.dumps(request_data, sort_keys=True) cache_key = hashlib.sha256(data_str.encode("utf-8")).hexdigest() # Result: "abc123def456..." (64-character hex string) ``` ## Cache Storage Format The caching interceptor stores data in three separate disk-backed key-value stores within the configured cache directory: - **Response Cache** (`{cache_dir}/responses/`): Stores raw response content (bytes) keyed by cache key (when `save_responses=True` or `reuse_cached_responses=True`) - **Headers Cache** (`{cache_dir}/headers/`): Stores response headers (dictionary) keyed by cache key (when `save_requests=True`) - **Request Cache** (`{cache_dir}/requests/`): Stores request data (dictionary) keyed by cache key (when `save_requests=True`) Each cache uses a SHA256 hash of the request data as the lookup key. When a cache hit occurs, the interceptor retrieves both the response content and headers using the same cache key. ## Cache Behavior ### Cache Hit Process 1. **Request arrives** at the caching interceptor 2. **Generate cache key** from request parameters 3. **Check cache** for existing response 4. **Return cached response** if found (sets `cache_hit=True`) 5. **Skip API call** and continue to next interceptor ### Cache Miss Process 1. **Request continues** to endpoint interceptor 2. **Response received** from model API 3. **Store response** in cache with generated key 4. **Continue processing** with response interceptors (interceptor-endpoint)= # Endpoint Interceptor ## Overview **Required interceptor** that handles the actual API communication. This interceptor must be present in every configuration as it performs the final request to the target API endpoint. **Important**: This interceptor should always be placed after the last request interceptor and before the first response interceptor. ## Configuration ### CLI Configuration ```bash # The endpoint interceptor is automatically enabled and requires no additional CLI configuration ``` ### YAML Configuration ```yaml target: api_endpoint: adapter_config: interceptors: - name: "endpoint" enabled: true config: {} ``` ## Configuration Options The Endpoint Interceptor is configured automatically. No configuration is required. (interceptor-payload-modification)= # Payload Modification ## Overview `PayloadParamsModifierInterceptor` adds, removes, or modifies request parameters before sending them to the model endpoint. :::{tip} Add {ref}`interceptor-request-logging` to your interceptor chain to verify if your requests are modified correctly. ::: ## Configuration ### CLI Configuration ```bash --overrides 'target.api_endpoint.adapter_config.params_to_add={"temperature":0.7},target.api_endpoint.adapter_config.params_to_remove=["max_tokens"]' ``` ### YAML Configuration ```yaml target: api_endpoint: adapter_config: interceptors: - name: "payload_modifier" enabled: true config: params_to_add: temperature: 0.7 top_p: 0.9 params_to_remove: - "top_k" # top-level field in the payload to remove - "reasoning_content" # field in the message to remove params_to_rename: old_param: "new_param" - name: "endpoint" enabled: true config: {} ``` :::{note} In the example above, the `reasoning_content` field will be removed recursively from all messages in the payload. ::: ## Configuration Options For detailed configuration options, please refer to the {ref}`interceptor_reference` Python API reference. :::{note} The interceptor applies operations in the following order: remove → add → rename. This means you can remove a parameter and then add a different value for the same parameter name. ::: ## Use Cases ### Parameter Standardization Ensure consistent parameters across evaluations by adding or removing parameters: ```yaml config: params_to_add: temperature: 0.7 top_p: 0.9 params_to_remove: - "frequency_penalty" - "presence_penalty" ``` ### Model-Specific Configuration Add parameters required by specific model endpoints, such as chat template configuration: ```yaml config: params_to_add: extra_body: chat_template_kwargs: enable_thinking: false ``` ### API Compatibility Rename parameters for compatibility with different API versions or endpoint specifications: ```yaml config: params_to_rename: max_new_tokens: "max_tokens" num_return_sequences: "n" ``` # Post-Evaluation Hooks Run processing or reporting tasks after evaluations complete. Post-evaluation hooks execute after the main evaluation finishes. The built-in `PostEvalReportHook` hook generates HTML and JSON reports from cached request-response pairs. ## Report Generation Generate HTML and JSON reports with evaluation request-response examples. ### YAML Configuration ```yaml target: api_endpoint: adapter_config: post_eval_hooks: - name: "post_eval_report" enabled: true config: report_types: ["html", "json"] html_report_size: 10 ``` ### CLI Configuration ```bash --overrides 'target.api_endpoint.adapter_config.generate_html_report=True' ``` ## Configuration Options For detailed configuration options, please refer to the {ref}`interceptor_reference` Python API reference. ## Report Output The hook generates reports in the evaluation output directory: - **HTML Report**: `{output_dir}/report.html` - Interactive report with request-response pairs and curl commands - **JSON Report**: `{output_dir}/report.json` - Machine-readable report with structured data # Progress Tracking ## Overview `ProgressTrackingInterceptor` tracks evaluation progress by counting processed samples and optionally sending updates to a webhook endpoint. ## Configuration ### CLI Configuration ```bash --overrides 'target.api_endpoint.adapter_config.use_progress_tracking=True,target.api_endpoint.adapter_config.progress_tracking_url=http://monitoring:3828/progress' ``` ### YAML Configuration ```yaml target: api_endpoint: adapter_config: interceptors: - name: "endpoint" enabled: true config: {} - name: "progress_tracking" enabled: true config: progress_tracking_url: "http://monitoring:3828/progress" progress_tracking_interval: 10 request_method: "PATCH" output_dir: "/tmp/output" ``` ## Configuration Options For detailed configuration options, please refer to the {ref}`interceptor_reference` Python API reference. ## Behavior The interceptor tracks the number of responses processed and: 1. **Sends webhook updates**: Posts progress updates to the configured URL at the specified interval 2. **Saves progress to disk**: If `output_dir` is configured, writes progress count to a `progress` file in that directory 3. **Resumes from checkpoint**: If a progress file exists on initialization, resumes counting from that value (interceptor-raise-client-error)= # Raise Client Error Interceptor ## Overview The Raise `RaiseClientErrorInterceptor` handles non-retryable client errors by raising exceptions instead of continuing the benchmark evaluation. By default, it will raise exceptions on 4xx HTTP status codes (excluding 408 Request Timeout and 429 Too Many Requests, which are typically retryable). This interceptor is useful when you want to fail fast on client errors that indicate configuration issues, authentication problems, or other non-recoverable errors rather than continuing the evaluation with failed requests. ## Configuration ### CLI Configuration ```bash --overrides 'target.api_endpoint.adapter_config.use_raise_client_errors=True' ``` ### YAML Configuration ::::{tab-set} :::{tab-item} Default Configuration Raises on 4xx status codes except 408 (Request Timeout) and 429 (Too Many Requests). ```yaml target: api_endpoint: adapter_config: interceptors: - name: "endpoint" enabled: true config: {} - name: "raise_client_errors" enabled: true config: # Default configuration - raises on 4xx except 408, 429 exclude_status_codes: [408, 429] status_code_range_start: 400 status_code_range_end: 499 ``` ::: :::{tab-item} Specific Status Codes Raises only on specific status codes rather than a range. ```yaml target: api_endpoint: adapter_config: interceptors: - name: "raise_client_errors" enabled: true config: # Custom configuration - only specific status codes status_codes: [400, 401, 403, 404] - name: "endpoint" enabled: true config: {} ``` ::: :::{tab-item} Custom Exclusions Uses a status code range with custom exclusions, including 404 Not Found. ```yaml target: api_endpoint: adapter_config: interceptors: - name: "raise_client_errors" enabled: true config: # Custom range with different exclusions status_code_range_start: 400 status_code_range_end: 499 exclude_status_codes: [408, 429, 404] # Also exclude 404 not found - name: "endpoint" enabled: true config: {} ``` ::: :::: ## Configuration Options For detailed configuration options, please refer to the {ref}`interceptor_reference` Python API reference. ## Behavior ### Default Behavior - Raises exceptions on HTTP status codes 400-499 - Excludes 408 (Request Timeout) and 429 (Too Many Requests) as these are typically retryable - Logs critical errors before raising the exception ### Configuration Logic 1. If `status_codes` is specified, only those exact status codes will trigger exceptions 2. If `status_codes` is not specified, the range defined by `status_code_range_start` and `status_code_range_end` is used 3. `exclude_status_codes` are always excluded from raising exceptions 4. Cannot have the same status code in both `status_codes` and `exclude_status_codes` ### Error Handling - Raises `FatalErrorException` when a matching status code is encountered - Logs critical error messages with status code and URL information - Stops the evaluation process immediately ## Examples ::::{tab-set} :::{tab-item} Auth Failures Only Raises exceptions only on authentication and authorization failures. ```yaml config: status_codes: [401, 403] ``` ::: :::{tab-item} All Client Errors Except Rate Limiting Raises on all 4xx errors except timeout and rate limit errors. ```yaml config: status_code_range_start: 400 status_code_range_end: 499 exclude_status_codes: [408, 429] ``` ::: :::{tab-item} Strict Mode - All Client Errors Raises exceptions on any 4xx status code without exclusions. ```yaml config: status_code_range_start: 400 status_code_range_end: 499 exclude_status_codes: [] ``` ::: :::: ## Common Use Cases - **API Configuration Validation**: Fail immediately on authentication errors (401, 403) - **Input Validation**: Stop evaluation on bad request errors (400) - **Resource Existence**: Fail on not found errors (404) for critical resources - **Development/Testing**: Use strict mode to catch all client-side issues - **Production**: Use default settings to allow retryable errors while catching configuration issues (interceptor-reasoning)= # Reasoning ## Overview The `ResponseReasoningInterceptor` handles models that generate explicit reasoning steps, typically enclosed in special tokens. It removes reasoning content from the final response and tracks reasoning metrics for analysis. ## Configuration ### Python Configuration ### CLI Configuration ```bash --overrides 'target.api_endpoint.adapter_config.process_reasoning_traces=True,target.api_endpoint.adapter_config.end_reasoning_token="",target.api_endpoint.adapter_config.start_reasoning_token=""' ``` ### YAML Configuration ```yaml target: api_endpoint: adapter_config: interceptors: - name: "endpoint" enabled: true config: {} - name: reasoning config: start_reasoning_token: "" end_reasoning_token: "" add_reasoning: true enable_reasoning_tracking: true ``` ## Configuration Options For detailed configuration options, please refer to the {ref}`interceptor_reference` Python API reference. ## Processing Examples ### Basic Reasoning Stripping ```python # Original response from model original_content = "Let me solve this step by step. 2+2 is basic addition. 2 plus 2 equals 4.The answer is 4." # After reasoning interceptor processing # The content field has reasoning removed processed_content = "The answer is 4." ``` ### Multi-Step Reasoning ```python # Original response with multi-line reasoning original_content = """ This is a word problem. Let me break it down: 1. John has 5 apples 2. He gives away 2 apples 3. So he has 5 - 2 = 3 apples left John has 3 apples remaining.""" # After processing: reasoning tokens and content are removed processed_content = "John has 3 apples remaining." ``` ## Tracked Metrics The interceptor automatically tracks the following statistics: | Metric | Description | |--------|-------------| | `total_responses` | Total number of responses processed | | `responses_with_reasoning` | Number of responses containing reasoning content | | `reasoning_finished_count` | Number of responses where reasoning completed (end token found) | | `reasoning_finished_ratio` | Percentage (expressed as ratio within 0-1) of responses where reasoning completed to all responses with reasoning | | `reasoning_started_count` | Number of responses where reasoning started | | `reasoning_unfinished_count` | Number of responses where reasoning started but did not complete (end token not found) | | `avg_reasoning_words` | Average word count in reasoning content | | `avg_reasoning_tokens` | Average token count in reasoning content | | `avg_original_content_words` | Average word count in original content (before processing) | | `avg_updated_content_words` | Average word count in updated content (after processing) | | `avg_updated_content_tokens` | Average token count in updated content | | `max_reasoning_words` | Maximum word count in reasoning content | | `max_reasoning_tokens` | Maximum token count in reasoning content | | `max_original_content_words` | | | `max_updated_content_words` | | | `max_updated_content_tokens` | Maximum token count in updated content | | `total_reasoning_words` | Total word count across all reasoning content | | `total_reasoning_tokens` | Total token count across all reasoning content | | `total_original_content_words` | Total word count in original content (before processing) | | `total_updated_content_words` | Total word count in updated content (after processing) | | `total_updated_content_tokens` | Total token count in updated content | These statistics are saved to `eval_factory_metrics.json` under the `reasoning` key after evaluation completes. ## Example: Custom Reasoning Tokens ```yaml target: api_endpoint: adapter_config: interceptors: - name: reasoning config: start_reasoning_token: "[REASONING]" end_reasoning_token: "[/REASONING]" add_reasoning: true enable_reasoning_tracking: true - name: "endpoint" enabled: true config: {} ``` (interceptor-request-logging)= # Request Logging Interceptor ## Overview The `RequestLoggingInterceptor` captures and logs incoming API requests for debugging, analysis, and audit purposes. This interceptor is essential for troubleshooting evaluation issues and understanding request patterns. ## Configuration ### CLI Configuration ```bash --overrides 'target.api_endpoint.adapter_config.use_request_logging=True,target.api_endpoint.adapter_config.max_saved_requests=1000' ``` ### YAML Configuration ```yaml target: api_endpoint: adapter_config: interceptors: - name: "request_logging" enabled: true config: max_requests: 1000 - name: "endpoint" enabled: true config: {} ``` ## Configuration Options For detailed configuration options, please refer to the {ref}`interceptor_reference` Python API reference. (interceptor-response-logging)= # Response Logging Interceptor ## Overview The `ResponseLoggingInterceptor` captures and logs API responses for analysis and debugging. Use this interceptor to examine model outputs and identify response patterns. ## Configuration ### CLI Configuration ```bash --overrides 'target.api_endpoint.adapter_config.use_response_logging=True,target.api_endpoint.adapter_config.max_saved_responses=1000' ``` ### YAML Configuration ```yaml target: api_endpoint: adapter_config: interceptors: - name: "endpoint" enabled: true config: {} - name: "response_logging" enabled: true config: max_responses: 1000 ``` ## Configuration Options For detailed configuration options, please refer to the {ref}`interceptor_reference` Python API reference. (interceptor-response-stats)= # Response Stats Interceptor ## Overview The `ResponseStatsInterceptor` collects comprehensive aggregated statistics from API responses for metrics collection and analysis. It tracks detailed metrics about token usage, response patterns, performance characteristics, and API behavior throughout the evaluation process. This interceptor is essential for understanding API performance, cost analysis, and monitoring evaluation runs. It provides both real-time aggregated statistics and detailed per-request tracking capabilities. **Key Statistics Tracked:** - Token usage (prompt, completion, total) with averages and maximums - Response status codes and counts - Finish reasons and stop reasons - Tool calls and function calls counts - Response latency (average and maximum) - Total response count and successful responses - Inference run times and timing analysis ## Configuration ### CLI Configuration ```bash --overrides 'target.api_endpoint.adapter_config.tracking_requests_stats=True,target.api_endpoint.adapter_config.response_stats_cache=/tmp/response_stats_interceptor,target.api_endpoint.adapter_config.logging_aggregated_stats_interval=100' ``` ### YAML Configuration ```yaml target: api_endpoint: adapter_config: interceptors: - name: "response_stats" enabled: true config: # Default configuration - collect all statistics collect_token_stats: true collect_finish_reasons: true collect_tool_calls: true save_individuals: true cache_dir: "/tmp/response_stats_interceptor" logging_aggregated_stats_interval: 100 - name: "endpoint" enabled: true config: {} ``` ```yaml target: api_endpoint: adapter_config: interceptors: - name: "response_stats" enabled: true config: # Minimal configuration - only basic stats collect_token_stats: false collect_finish_reasons: false collect_tool_calls: false save_individuals: false logging_aggregated_stats_interval: 50 - name: "endpoint" enabled: true config: {} ``` ```yaml target: api_endpoint: adapter_config: interceptors: - name: "endpoint" enabled: true config: {} - name: "response_stats" enabled: true config: # Custom configuration with periodic saving collect_token_stats: true collect_finish_reasons: true collect_tool_calls: true stats_file_saving_interval: 100 save_individuals: true cache_dir: "/custom/stats/cache" logging_aggregated_stats_interval: 25 ``` ## Configuration Options For detailed configuration options, please refer to the {ref}`interceptor_reference` Python API reference. ## Behavior ### Statistics Collection The interceptor automatically collects statistics from successful API responses (HTTP 200) and tracks basic information for all responses regardless of status code. **For Successful Responses (200):** - Parses JSON response body - Extracts token usage from `usage` field - Collects finish reasons from `choices[].finish_reason` - Counts tool calls and function calls - Calculates running averages and maximums **For All Responses:** - Tracks status code distribution - Measures response latency - Records response timestamps - Maintains response counts ### Data Storage - **Aggregated Stats**: Continuously updated running statistics stored in cache - **Individual Stats**: Per-request details stored with request IDs (if enabled) - **Metrics File**: Final statistics saved to `eval_factory_metrics.json` - **Thread Safety**: All operations are thread-safe using locks ### Timing Analysis - Tracks inference run times across multiple evaluation runs - Calculates time from first to last request per run - Estimates time to first request from adapter initialization - Provides detailed timing breakdowns for performance analysis ## Statistics Output ### Aggregated Statistics ```json { "response_stats": { "description": "Response statistics saved during processing", "avg_prompt_tokens": 150.5, "avg_total_tokens": 200.3, "avg_completion_tokens": 49.8, "avg_latency_ms": 1250.2, "max_prompt_tokens": 300, "max_total_tokens": 450, "max_completion_tokens": 150, "max_latency_ms": 3000, "count": 1000, "successful_count": 995, "tool_calls_count": 50, "function_calls_count": 25, "finish_reason": { "stop": 800, "length": 150, "tool_calls": 45 }, "status_codes": { "200": 995, "429": 3, "500": 2 }, "inference_time": 45.6, "run_id": 0 } } ``` ### Individual Request Statistics (if enabled) ```json { "request_id": "req_123", "timestamp": 1698765432.123, "status_code": 200, "prompt_tokens": 150, "total_tokens": 200, "completion_tokens": 50, "finish_reason": "stop", "tool_calls_count": 0, "function_calls_count": 0, "run_id": 0 } ``` ## Common Use Cases - **Cost Analysis**: Track token usage patterns to estimate API costs - **Performance Monitoring**: Monitor response times and throughput - **Quality Assessment**: Analyze finish reasons and response patterns - **Tool Usage Analysis**: Track function and tool call frequencies - **Debugging**: Individual request tracking for troubleshooting - **Capacity Planning**: Understand API usage patterns and limits - **A/B Testing**: Compare statistics across different configurations - **Production Monitoring**: Real-time visibility into API behavior ## Integration Notes - **Post-Evaluation Hook**: Automatically saves final statistics after evaluation completes - **Cache Persistence**: Statistics survive across runs and can be aggregated - **Thread Safety**: Safe for concurrent request processing - **Memory Efficient**: Uses running averages to avoid storing all individual values - **Caching Strategy**: Handles cache hits by skipping statistics collection to avoid double-counting (nemo-evaluator-logging)= # Logging Configuration This document describes how to configure and use logging in the NVIDIA NeMo Evaluator framework. ## Log Levels Set these environment variables for logging configuration: ```bash # Set log level (INFO, DEBUG, WARNING, ERROR, CRITICAL) export LOG_LEVEL=DEBUG # or (legacy, still supported) export NEMO_EVALUATOR_LOG_LEVEL=DEBUG ``` ```{list-table} :header-rows: 1 :widths: 15 35 50 * - Level - Description - Use Case * - `INFO` - General information - Normal operation logs * - `DEBUG` - Detailed debugging - Development and troubleshooting * - `WARNING` - Warning messages - Potential issues * - `ERROR` - Error messages - Problems that need attention * - `CRITICAL` - Critical errors - Severe problems requiring immediate action ``` ## Log Output ### Console Output Logs appear in the console (stderr) with color coding: - **Green**: INFO messages - **Yellow**: WARNING messages - **Red**: ERROR messages - **Red background**: CRITICAL messages - **Gray**: DEBUG messages ### Custom Log Directory Specify a custom log directory using the `NEMO_EVALUATOR_LOG_DIR` environment variable: ```bash # Set custom log directory export NEMO_EVALUATOR_LOG_DIR=/path/to/logs/ # Run evaluation (logs will be written to the specified directory) nemo-evaluator run_eval ... ``` If `NEMO_EVALUATOR_LOG_DIR` is not set, logs appear in the console (stderr) without file output. ## Using Logging Interceptors NeMo Evaluator supports dedicated interceptors for request and response logging. Add logging to your adapter configuration: ```yaml target: api_endpoint: adapter_config: interceptors: - name: "request_logging" config: log_request_body: true log_request_headers: true - name: "response_logging" config: log_response_body: true log_response_headers: true ``` ## Request Tracking Each request automatically gets a unique UUID that appears in all related log messages. This helps trace requests through the system. ## Troubleshooting ### No logs appearing - Enable logging interceptors in your configuration - Verify log level with `LOG_LEVEL=INFO` or `NEMO_EVALUATOR_LOG_LEVEL=INFO` ### Missing DEBUG logs - Set `LOG_LEVEL=DEBUG` or `NEMO_EVALUATOR_LOG_LEVEL=DEBUG` ### Logs not going to files - Check directory permissions - Verify log directory path with `NEMO_EVALUATOR_LOG_DIR` ### Debug mode ```bash export LOG_LEVEL=DEBUG ``` ## Examples ### Basic logging ```bash # Enable DEBUG logging export LOG_LEVEL=DEBUG # Run evaluation with logging nemo-evaluator run_eval --eval_type mmlu_pro --model_id gpt-4 ... ``` ### Custom log directory ```bash # Specify custom log location using environment variable export NEMO_EVALUATOR_LOG_DIR=./my_logs/ # Run evaluation with logging to custom directory nemo-evaluator run_eval --eval_type mmlu_pro ... ``` ### Environment verification ```bash echo "LOG_LEVEL: $LOG_LEVEL" echo "NEMO_EVALUATOR_LOG_DIR: $NEMO_EVALUATOR_LOG_DIR" ``` (references-overview)= # References Comprehensive reference documentation for NeMo Evaluator APIs, functions, and configuration options. ## CLI vs. Programmatic Usage The NeMo Evaluator SDK supports two usage patterns: 1. **CLI Usage** (Recommended): Use `nemo-evaluator` and/or `nemo-evaluator-launcher` binaries which parses command line arguments 2. **Programmatic Usage**: Use Python API with configuration objects **When to Use Which:** - **CLI**: For command-line tools, scripts, and simple automation - **Programmatic**: For building custom applications, workflows, and integration with other systems ## API References ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`command-palette;1.5em;sd-mr-1` NeMo Evaluator Launcher CLI :link: api/nemo-evaluator-launcher/cli :link-type: doc Comprehensive command-line interface reference with all commands, options, and examples. ::: :::{grid-item-card} {octicon}`terminal;1.5em;sd-mr-1` NeMo Evaluator Launcher API :link: api/nemo-evaluator-launcher/api :link-type: doc Complete Python API reference for programmatic evaluation workflows and job management. ::: :::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Configuration Schema :link: ../libraries/nemo-evaluator-launcher/configuration/index :link-type: doc Configuration reference for NeMo Evaluator Launcher with examples for all executors and deployment types. ::: :::{grid-item-card} {octicon}`command-palette;1.5em;sd-mr-1` NeMo Evaluator CLI :link: api/nemo-evaluator/cli :link-type: doc Comprehensive command-line interface reference with all commands, options, and examples. ::: :::{grid-item-card} {octicon}`terminal;1.5em;sd-mr-1` NeMo Evaluator Python API :link: api/nemo-evaluator/api/index :link-type: doc Complete Python API reference for programmatic evaluation workflows and job management. ::: :::: # Python API The NeMo Evaluator Launcher provides a Python API for programmatic access to evaluation functionality. This allows you to integrate evaluations into your Python workflows, Jupyter notebooks, and automated pipelines. ## Installation ```bash pip install nemo-evaluator-launcher # With optional exporters pip install nemo-evaluator-launcher[mlflow,wandb,gsheets] ``` ## Core Functions ### Running Evaluations ```python from nemo_evaluator_launcher.api import RunConfig, run_eval # Run evaluation with configuration config = RunConfig.from_hydra( config="examples/local_basic.yaml", hydra_overrides=[ "execution.output_dir=my_results" ] ) invocation_id = run_eval(config) # Returns invocation ID for tracking print(f"Started evaluation: {invocation_id}") ``` ### Listing Available Tasks ```python from nemo_evaluator_launcher.api import get_tasks_list # Get all available evaluation tasks tasks = get_tasks_list() # Each task contains: [task_name, endpoint_type, harness, container] for task in tasks[:5]: task_name, endpoint_type, harness, container = task print(f"Task: {task_name}, Type: {endpoint_type}") ``` ### Checking Job Status ```python from nemo_evaluator_launcher.api import get_status # Check status of a specific invocation or job status = get_status(["abc12345"]) # Returns list of status dictionaries with keys: invocation, job_id, status, progress, data for job_status in status: print(f"Job {job_status['job_id']}: {job_status['status']}") ``` ## Configuration Management ### Creating Configuration with Hydra ```python from nemo_evaluator_launcher.api import RunConfig from omegaconf import OmegaConf # Load default configuration config = RunConfig.from_hydra() print(OmegaConf.to_yaml(config)) ``` ### Loading Existing Configuration ```python from nemo_evaluator_launcher.api import RunConfig # Load a specific configuration file config = RunConfig.from_hydra( config="examples/local_basic.yaml" ) ``` ### Configuration with Overrides ```python import tempfile from nemo_evaluator_launcher.api import RunConfig, run_eval # Create configuration with both Hydra overrides and dictionary overrides config = RunConfig.from_hydra( hydra_overrides=[ "execution.output_dir=" + tempfile.mkdtemp() ], dict_overrides={ "target": { "api_endpoint": { "url": "https://integrate.api.nvidia.com/v1/chat/completions", "model_id": "meta/llama-3.2-3b-instruct", "api_key_name": "NGC_API_KEY" } }, "evaluation": [ { "name": "ifeval", "overrides": { "config.params.limit_samples": 10 } } ] } ) # Run evaluation invocation_id = run_eval(config) ``` ### Exploring Deployment Options ```python from nemo_evaluator_launcher.api import RunConfig from omegaconf import OmegaConf # Load configuration with different deployment backend config = RunConfig.from_hydra( hydra_overrides=["deployment=vllm"] ) print(OmegaConf.to_yaml(config)) ``` ## Jupyter Notebook Integration ```python # Cell 1: Setup import tempfile from omegaconf import OmegaConf from nemo_evaluator_launcher.api import RunConfig, get_status, get_tasks_list, run_eval # Cell 2: List available tasks tasks = get_tasks_list() print("Available tasks:") for task in tasks[:10]: # Show first 10 print(f" - {task[0]} ({task[1]})") # Cell 3: Create and run evaluation config = RunConfig.from_hydra( hydra_overrides=[ "execution.output_dir=" + tempfile.mkdtemp() ], dict_overrides={ "target": { "api_endpoint": { "url": "https://integrate.api.nvidia.com/v1/chat/completions", "model_id": "meta/llama-3.2-3b-instruct", "api_key_name": "NGC_API_KEY" } }, "evaluation": [ { "name": "ifeval", "overrides": { "config.params.limit_samples": 10 } } ] } ) invocation_id = run_eval(config) print(f"Started evaluation: {invocation_id}") # Cell 4: Check status status_list = get_status([invocation_id]) status = status_list[0] print(f"Status: {status['status']}") print(f"Output directory: {status['data']['output_dir']}") ``` ## See Also - [CLI Reference](index.md) - Command-line interface documentation - [Configuration](configuration/index.md) - Configuration system overview - [Exporters](exporters/index.md) - Result export options # NeMo Evaluator Launcher CLI Reference (nemo-evaluator-launcher) The NeMo Evaluator Launcher provides a command-line interface for running evaluations, managing jobs, and exporting results. The CLI is available through `nemo-evaluator-launcher` command. ## Global Options ```bash nemo-evaluator-launcher --help # Show help nemo-evaluator-launcher --version # Show version information ``` ## Commands Overview ```{list-table} :header-rows: 1 :widths: 20 80 * - Command - Description * - `run` - Run evaluations with specified configuration * - `status` - Check status of jobs or invocations * - `info` - Show detailed job(s) information * - `kill` - Kill a job or invocation * - `ls` - List tasks or runs * - `export` - Export evaluation results to various destinations * - `version` - Show version information ``` ## run - Run Evaluations Execute evaluations using Hydra configuration management. ### Basic Usage ```bash # Using example configurations nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml # With output directory override nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \ -o execution.output_dir=/path/to/results ``` ### Configuration Options ```bash # Using custom config directory nemo-evaluator-launcher run --config my_configs/my_evaluation.yaml # Multiple overrides (Hydra syntax) nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \ -o execution.output_dir=results \ -o target.api_endpoint.model_id=my-model \ -o +config.params.limit_samples=10 ``` ### Config Loading Modes The `--config-mode` parameter controls how configuration files are loaded: - **`hydra`** (default): Uses Hydra configuration system. Hydra handles configuration composition, overrides, and validation. - **`raw`**: Loads the config file directly without Hydra processing. Useful for loading pre-generated complete configuration files. ```bash # Default: Hydra mode (config file is processed by Hydra) nemo-evaluator-launcher run --config my_config.yaml # Explicit Hydra mode nemo-evaluator-launcher run --config my_config.yaml --config-mode=hydra # Raw mode: load config file directly (bypasses Hydra) nemo-evaluator-launcher run --config complete_config.yaml --config-mode=raw ``` **Note:** When using `--config-mode=raw`, the `--config` parameter is required, and `-o/--override` cannot be used. (launcher-cli-dry-run)= ### Dry Run Preview the full resolved configuration without executing: ```bash nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml --dry-run ``` ### Test Runs Run with limited samples for testing: ```bash nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \ -o +config.params.limit_samples=10 ``` ### Task Filtering Run only specific tasks from your configuration using the `-t` flag: ```bash # Run a single task (local_basic.yaml has ifeval, gpqa_diamond, mbpp) nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml -t ifeval # Run multiple specific tasks nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml -t ifeval -t mbpp # Combine with other options nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml -t ifeval -t mbpp --dry-run ``` **Notes:** - Tasks must be defined in your configuration file under `evaluation.tasks` - If any requested task is not found in the configuration, the command will fail with an error listing available tasks - Task filtering preserves all task-specific overrides and `nemo_evaluator_config` settings ### Examples by Executor **Local Execution:** ```bash nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \ -o execution.output_dir=./local_results ``` **Slurm Execution:** ```bash nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/slurm_vllm_basic.yaml \ -o execution.output_dir=/shared/results ``` **Lepton AI Execution:** ```bash # With model deployment nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/lepton_nim.yaml # Using existing endpoint nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/lepton_basic.yaml ``` ## status - Check Job Status Check the status of running or completed evaluations. ### Status Basic Usage ```bash # Check status of specific invocation (returns all jobs in that invocation) nemo-evaluator-launcher status abc12345 # Check status of specific job nemo-evaluator-launcher status abc12345.0 # Output as JSON nemo-evaluator-launcher status abc12345 --json ``` ### Output Formats **Table Format (default):** ```text Job ID | Status | Executor Info | Location abc12345.0 | running | container123 | /task1/... abc12345.1 | success | container124 | /task2/... ``` **JSON Format (with --json flag):** ```json [ { "invocation": "abc12345", "job_id": "abc12345.0", "status": "running", "data": { "container": "eval-container", "output_dir": "/path/to/results" } }, { "invocation": "abc12345", "job_id": "abc12345.1", "status": "success", "data": { "container": "eval-container", "output_dir": "/path/to/results" } } ] ``` ## info - Job information and navigation Display detailed job information, including metadata, configuration, and paths to logs/artifacts with descriptions of key result files. Supports copying results locally from both local and remote jobs. ### Basic usage ```bash # Show job info for one or more IDs (job or invocation) nemo-evaluator-launcher info nemo-evaluator-launcher info ``` ### Show configuration ```bash nemo-evaluator-launcher info --config ``` ### Show paths ```bash # Show artifact locations nemo-evaluator-launcher info --artifacts # Show log locations nemo-evaluator-launcher info --logs ``` ### Copy files locally ```bash # Copy logs nemo-evaluator-launcher info --copy-logs [DIR] # Copy artifacts nemo-evaluator-launcher info --copy-artifacts [DIR] ``` ### Example (Slurm) ```text nemo-evaluator-launcher info Job .0 ├── Executor: slurm ├── Created: ├── Task: ├── Artifacts: user@host:/shared/...//task_name/artifacts (remote) │ └── Key files: │ ├── results.yml - Benchmark scores, task results and resolved run configuration. │ ├── eval_factory_metrics.json - Response + runtime stats (latency, tokens count, memory) │ ├── metrics.json - Harness/benchmark metric and configuration │ ├── report.html - Request-Response Pairs samples in HTML format (if enabled) │ ├── report.json - Report data in json format, if enabled ├── Logs: user@host:/shared/...//task_name/logs (remote) │ └── Key files: │ ├── client-{SLURM_JOB_ID}.out - Evaluation container/process output │ ├── slurm-{SLURM_JOB_ID}.out - SLURM scheduler stdout/stderr (batch submission, export steps). │ ├── server-{SLURM_JOB_ID}.out - Model server logs when a deployment is used. ├── Slurm Job ID: ``` ## kill - Kill Jobs Stop running evaluations. ### Kill Basic Usage ```bash # Kill entire invocation nemo-evaluator-launcher kill abc12345 # Kill specific job nemo-evaluator-launcher kill abc12345.0 ``` The command outputs JSON with the results of the kill operation. ## ls - List Resources List available tasks or runs. ### List Tasks ```bash # List all available evaluation tasks nemo-evaluator-launcher ls tasks # List tasks with JSON output nemo-evaluator-launcher ls tasks --json ``` **Output Format:** Tasks display grouped by harness and container, showing the task name and required endpoint type: ```text =================================================== harness: lm_eval container: nvcr.io/nvidia/nemo:24.01 task endpoint_type --------------------------------------------------- arc_challenge chat hellaswag completions winogrande completions --------------------------------------------------- 3 tasks available =================================================== ``` ### List Runs ```bash # List recent evaluation runs nemo-evaluator-launcher ls runs # Limit number of results nemo-evaluator-launcher ls runs --limit 10 # Filter by executor nemo-evaluator-launcher ls runs --executor local # Filter by date nemo-evaluator-launcher ls runs --since "2024-01-01" nemo-evaluator-launcher ls runs --since "2024-01-01T12:00:00" # Filter by retrospecitve period # - days nemo-evaluator-launcher ls runs --since 2d # - hours nemo-evaluator-launcher ls runs --since 6h ``` **Output Format:** ```text invocation_id earliest_job_ts num_jobs executor benchmarks abc12345 2024-01-01T10:00:00 3 local ifeval,gpqa_diamond,mbpp def67890 2024-01-02T14:30:00 2 slurm hellaswag,winogrande ``` ## export - Export Results Export evaluation results to various destinations. ### Export Basic Usage ```bash # Export to local files (JSON format) nemo-evaluator-launcher export abc12345 --dest local --format json # Export to specific directory nemo-evaluator-launcher export abc12345 --dest local --format json --output-dir ./results # Specify custom filename nemo-evaluator-launcher export abc12345 --dest local --format json --output-filename results.json ``` ### Export Options ```bash # Available destinations nemo-evaluator-launcher export abc12345 --dest local # Local file system nemo-evaluator-launcher export abc12345 --dest mlflow # MLflow tracking nemo-evaluator-launcher export abc12345 --dest wandb # Weights & Biases nemo-evaluator-launcher export abc12345 --dest gsheets # Google Sheets # Format options (for local destination only) nemo-evaluator-launcher export abc12345 --dest local --format json nemo-evaluator-launcher export abc12345 --dest local --format csv # Include logs when exporting nemo-evaluator-launcher export abc12345 --dest local --format json --copy-logs # Filter metrics by name nemo-evaluator-launcher export abc12345 --dest local --format json --log-metrics score --log-metrics accuracy # Copy all artifacts (not just required ones) nemo-evaluator-launcher export abc12345 --dest local --only-required False ``` ### Exporting Multiple Invocations ```bash # Export several runs together nemo-evaluator-launcher export abc12345 def67890 ghi11111 --dest local --format json # Export several runs with custom output nemo-evaluator-launcher export abc12345 def67890 --dest local --format csv \ --output-dir ./all-results --output-filename combined.csv ``` ### Cloud Exporters For cloud destinations like MLflow, W&B, and Google Sheets, configure credentials through environment variables or their respective configuration files before using the export command. Refer to each exporter's documentation for setup instructions. ## version - Version Information Display version and build information. ```bash # Show version nemo-evaluator-launcher version # Alternative nemo-evaluator-launcher --version ``` ## Environment Variables The CLI respects environment variables for logging and task-specific authentication: ```{list-table} :header-rows: 1 :widths: 30 50 20 * - Variable - Description - Default * - `LOG_LEVEL` - Logging level for the launcher (DEBUG, INFO, WARNING, ERROR, CRITICAL) - `WARNING` * - `LOG_DISABLE_REDACTION` - Disable credential redaction in logs (set to 1, true, or yes) - Not set ``` ### Task-Specific Environment Variables Some evaluation tasks require API keys or tokens. These are configured in your evaluation YAML file under `env_vars` and must be set before running: ```bash # Set task-specific environment variables export HF_TOKEN="hf_..." # For Hugging Face datasets export NGC_API_KEY="nvapi-..." # For NVIDIA API endpoints # Run evaluation nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml ``` The specific environment variables required depend on the tasks and endpoints you're using. Refer to the example configuration files for details on which variables are needed. ## Configuration File Examples The NeMo Evaluator Launcher includes several example configuration files that demonstrate different use cases. These files are located in the `examples/` directory of the package: To use these examples: ```bash # Copy an example to your local directory cp examples/local_basic.yaml my_config.yaml # Edit the configuration as needed # Then run with your config nemo-evaluator-launcher run --config ./my_config.yaml ``` Refer to the {ref}`configuration documentation ` for detailed information on all available configuration options. ## Troubleshooting ### Configuration Issues **Configuration Errors:** ```bash # Validate configuration without running nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/my_config.yaml --dry-run ``` **Permission Errors:** ```bash # Check file permissions ls -la examples/my_config.yaml # Use absolute paths nemo-evaluator-launcher run --config /absolute/path/to/configs/my_config.yaml ``` **Network Issues:** ```bash # Test endpoint connectivity curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "test", "messages": [{"role": "user", "content": "Hello"}]}' ``` ### Debug Mode ```bash # Set log level to DEBUG for detailed output export LOG_LEVEL=DEBUG nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml # Or use single-letter shorthand export LOG_LEVEL=D nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml # Logs are written to ~/.nemo-evaluator/logs/ ``` ### Getting Help ```bash # Command-specific help nemo-evaluator-launcher run --help nemo-evaluator-launcher info --help nemo-evaluator-launcher ls --help nemo-evaluator-launcher export --help # General help nemo-evaluator-launcher --help ``` ## See Also - [Python API](api.md) - Programmatic interface - {ref}`gs-quickstart-launcher` - Getting started guide - {ref}`executors-overview` - Execution backends - {ref}`exporters-overview` - Export destinations ``nemo_evaluator.adapters.adapter_config`` ========================================== .. currentmodule:: nemo_evaluator.adapters.adapter_config .. automodule:: nemo_evaluator.adapters.adapter_config :members: :undoc-members: ``nemo_evaluator.adapters`` =========================== Interceptors and PostEvalHooks are important part of NeMo Evaluator SDK. They expand functionality of each harness, providing a standardized way of enabling features in your evaluation runs. Behind each interceptor and post-eval-hook stands a specific class that implements its logic. However, these classes are referenced only to provide their configuration options which are reflected in ``Params`` of each class and to point from which classes one should inherit. From the usage perspective, one should always use configuration classes (see :ref:`configuration`) to add them to evaluations. No interceptor should be directly instantiated. Interceptors are defined in a chain. They go under ``target.api_endpoint.adapter_config`` and can be defined as follow:: adapter_config = AdapterConfig( interceptors=[ InterceptorConfig( name="system_message", enabled=True, config={"system_message": "You are a helpful assistant."} ), InterceptorConfig(name="request_logging", enabled=True), InterceptorConfig( name="caching", enabled=True, config={"cache_dir": "./cache", "reuse_cached_responses": True} ), InterceptorConfig(name="reasoning", enabled=True), InterceptorConfig(name="endpoint") ] ) .. _configuration: Configuration -------------- .. currentmodule:: nemo_evaluator.adapters.adapter_config .. autosummary:: :nosignatures: :recursive: DiscoveryConfig InterceptorConfig PostEvalHookConfig AdapterConfig .. .. automodule:: nemo_evaluator.adapters.adapter_config .. :members: .. :undoc-members: Interceptors ------------- .. currentmodule:: nemo_evaluator.adapters.interceptors .. autosummary:: :nosignatures: :recursive: CachingInterceptor EndpointInterceptor PayloadParamsModifierInterceptor ProgressTrackingInterceptor RaiseClientErrorInterceptor RequestLoggingInterceptor ResponseLoggingInterceptor ResponseReasoningInterceptor ResponseStatsInterceptor SystemMessageInterceptor PostEvalHooks ------------- .. currentmodule:: nemo_evaluator.adapters.reports .. autosummary:: :nosignatures: :recursive: PostEvalReportHook Interfaces -------------- .. currentmodule:: nemo_evaluator.adapters.types .. autosummary:: :nosignatures: :recursive: RequestInterceptor ResponseInterceptor RequestToResponseInterceptor PostEvalHook .. .. automodule:: nemo_evaluator.adapters .. :members: .. :undoc-members: .. toctree:: :hidden: adapter-config interceptors types .. _interceptor_reference : ``nemo_evaluator.adapters.interceptors`` ======================================== .. currentmodule:: nemo_evaluator.adapters.interceptors .. automodule:: nemo_evaluator.adapters.interceptors :members: :undoc-members: ``nemo_evaluator.adapters.types`` ================================= Interceptor Interfaces ---------------------- .. currentmodule:: nemo_evaluator.adapters.types .. automodule:: nemo_evaluator.adapters.types :members: :undoc-members: something .. _modelling-inout: ``nemo_evaluator.api.api_dataclasses`` ====================================== NeMo Evaluator Core operates on strictly defined input and output which are modelled through pydantic dataclasses. Whether you use Python API or CLI, the reference below serves as a map of configuration options and output format. .. currentmodule:: nemo_evaluator.api.api_dataclasses Modeling Target --------------- .. autosummary:: :nosignatures: :recursive: ApiEndpoint EndpointType EvaluationTarget Modeling Evaluation ------------------- .. autosummary:: :nosignatures: :recursive: EvaluationConfig ConfigParams Modeling Result --------------- .. autosummary:: :nosignatures: :recursive: EvaluationResult GroupResult MetricResult Score ScoreStats TaskResult .. automodule:: nemo_evaluator.api.api_dataclasses :members: :undoc-members: ``nemo_evaluator.api`` ====================================== The central point of evaluation is ``evaluate()`` function that takes standarized input and returns standarized output. See :ref:`modelling-inout` to learn how to instantiate standardized input and consume standardized output. Below is an example of how one might configure and run evaluation via Python API:: from nemo_evaluator.core.evaluate import evaluate from nemo_evaluator.api.api_dataclasses import ( EvaluationConfig, EvaluationTarget, ConfigParams, ApiEndpoint ) # Create evaluation configuration eval_config = EvaluationConfig( type="simple_evals.mmlu_pro", output_dir="./results", params=ConfigParams( limit_samples=100, temperature=0.1 ) ) # Create target configuration target_config = EvaluationTarget( api_endpoint=ApiEndpoint( url="https://integrate.api.NVIDIA.com/v1/chat/completions", model_id="meta/llama-3.2-3b-instruct", type="chat", api_key="MY_API_KEY" # Name of the environment variable that stores api_key ) ) # Run evaluation result = evaluate(eval_config, target_config) .. automodule:: nemo_evaluator.api :members: :undoc-members: :member-order: bysource .. toctree:: :hidden: api-dataclasses nemo-evaluator.adapters <../adapters/adapters> nemo-evaluator.sandbox <../sandbox/index> ``nemo_evaluator.sandbox`` ====================================== Sandbox implementations used by evaluation harnesses that need a tmux-like interactive session. This module is designed to keep dependencies **optional**: - The ECS Fargate implementation only imports AWS SDKs (``boto3``/``botocore``) when actually used. - Using the ECS sandbox also requires the AWS CLI (``aws``) and ``session-manager-plugin`` on the host. Usage (ECS Fargate) ------------------- Typical usage is: - configure :class:`~nemo_evaluator.sandbox.ecs_fargate.EcsFargateConfig` - :meth:`~nemo_evaluator.sandbox.ecs_fargate.EcsFargateSandbox.spin_up` a sandbox context - create an interactive :class:`~nemo_evaluator.sandbox.base.NemoSandboxSession` Example:: from nemo_evaluator.sandbox import EcsFargateConfig, EcsFargateSandbox cfg = EcsFargateConfig( region="us-west-2", cluster="my-ecs-cluster", task_definition="my-task-def:1", container_name="eval", subnets=["subnet-abc"], security_groups=["sg-xyz"], s3_bucket="my-staging-bucket", ) with EcsFargateSandbox.spin_up( cfg=cfg, task_id="task-001", trial_name="trial-0001", run_id="run-2026-01-12", ) as sandbox: session = sandbox.create_session("main") session.send_keys(["echo hello", "Enter"], block=True) print(session.capture_pane()) Prerequisites / Notes --------------------- - The harness host must have **AWS CLI** and **session-manager-plugin** installed. - If you use S3-based fallbacks (large uploads / long commands), configure ``s3_bucket``. .. automodule:: nemo_evaluator.sandbox :members: :undoc-members: :member-order: bysource (nemo-evaluator-cli)= # NeMo Evaluator CLI Reference (nemo-evaluator) This document provides a comprehensive reference for the `nemo-evaluator` command-line interface, which is the primary way to interact with NeMo Evaluator from the terminal. ## Prerequisites - **Container way**: Use evaluation containers mentioned in {ref}`nemo-evaluator-containers` - **Package way**: ```bash pip install nemo-evaluator ``` To run evaluations, you also need to install an evaluation framework package (for example, `nvidia-simple-evals`): ```bash pip install nvidia-simple-evals ``` ## Overview The CLI provides a unified interface for managing evaluations and frameworks. It's built on top of the Python API and provides full feature parity with it. ## Command Structure ```bash nemo-evaluator [command] [options] ``` ## Available Commands ### `ls` - List Available Evaluations List all available evaluation types and frameworks. ```bash nemo-evaluator ls ``` **Output Example:** ``` nvidia-simple-evals: * mmlu_pro ... human_eval: * human_eval ``` ### `run_eval` - Run Evaluation Execute an evaluation with the specified configuration. ```bash nemo-evaluator run_eval [options] ``` To see the list of options, run: ```bash nemo-evaluator run_eval --help ``` **Required Options:** - `--eval_type`: Type of evaluation to run - `--model_id`: Model identifier - `--model_url`: API endpoint URL - `--model_type`: Endpoint type (chat, completions, vlm, embedding) - `--output_dir`: Output directory for results **Optional Options:** - `--api_key_name`: Environment variable name for API key - `--run_config`: Path to YAML configuration file - `--overrides`: Comma-separated parameter overrides - `--dry_run`: Show configuration without running - `--debug`: Enable debug mode (deprecated, use NEMO_EVALUATOR_LOG_LEVEL) **Example Usage:** ```bash # Basic evaluation nemo-evaluator run_eval \ --eval_type mmlu_pro \ --model_id "meta/llama-3.2-3b-instruct" \ --model_url "https://integrate.api.nvidia.com/v1/chat/completions" \ --model_type chat \ --api_key_name MY_API_KEY \ --output_dir ./results # With parameter overrides nemo-evaluator run_eval \ --eval_type mmlu_pro \ --model_id "meta/llama-3.2-3b-instruct" \ --model_url "https://integrate.api.nvidia.com/v1/chat/completions" \ --model_type chat \ --api_key_name MY_API_KEY \ --output_dir ./results \ --overrides "config.params.limit_samples=100,config.params.temperature=0.1" # Dry run to see configuration nemo-evaluator run_eval \ --eval_type mmlu_pro \ --model_id "meta/llama-3.2-3b-instruct" \ --model_url "https://integrate.api.nvidia.com/v1/chat/completions" \ --model_type chat \ --api_key_name MY_API_KEY \ --output_dir ./results \ --dry_run ``` For execution with run configuration: ```bash # Using YAML configuration file nemo-evaluator run_eval \ --eval_type mmlu_pro \ --output_dir ./results \ --run_config ./config/eval_config.yml ``` To check the structure of the run configuration, refer to the [Run Configuration](#run-configuration) section below. (run-configuration)= ## Run Configuration Run configurations are stored in YAML files with the following structure: ```yaml config: type: mmlu_pro params: limit_samples: 10 target: api_endpoint: url: https://integrate.api.nvidia.com/v1/chat/completions model_id: meta/llama-3.2-3b-instruct type: chat api_key: MY_API_KEY adapter_config: interceptors: - name: "request_logging" - name: "caching" enabled: true config: cache_dir: "./cache" - name: "endpoint" - name: "response_logging" enabled: true config: output_dir: "./cache/responses" ``` Run configurations can be specified in YAML files and executed with following syntax: ```bash nemo-evaluator run_eval \ --run_config config.yml \ --output_dir `mktemp -d` ``` (parameter-overrides)= ## Parameter Overrides Parameter overrides use a dot-notation format to specify configuration paths: ```bash # Basic parameter overrides --overrides "config.params.limit_samples=100,config.params.temperature=0.1" # Adapter configuration overrides --overrides "target.api_endpoint.adapter_config.interceptors.0.config.output_dir=./logs" # Multiple complex overrides --overrides "config.params.limit_samples=100,config.params.max_tokens=512,target.api_endpoint.adapter_config.use_caching=true" ``` ### Override Format ``` section.subsection.parameter=value ``` **Examples:** - `config.params.limit_samples=100` - `target.api_endpoint.adapter_config.use_caching=true` ## Handle Errors ### Debug Mode Enable debug mode for detailed error information: ```bash # Set environment variable (recommended) export NEMO_EVALUATOR_LOG_LEVEL=DEBUG # Or use deprecated debug flag nemo-evaluator run_eval --debug [options] ``` ## Examples ### Complete Evaluation Workflow ```bash # 1. List available evaluations nemo-evaluator ls # 2. Run evaluation nemo-evaluator run_eval \ --eval_type mmlu_pro \ --model_id "meta/llama-3.2-3b-instruct" \ --model_url "https://integrate.api.nvidia.com/v1/chat/completions" \ --model_type chat \ --api_key_name MY_API_KEY \ --output_dir ./results \ --overrides "config.params.limit_samples=100" # 3. Show results ls -la ./results/ ``` ### Batch Evaluation Script ```bash #!/bin/bash # Batch evaluation script models=("meta/llama-3.1-8b-instruct" "meta/llama-3.1-70b-instruct") eval_types=("mmlu_pro" "gsm8k") for model in "${models[@]}"; do for eval_type in "${eval_types[@]}"; do echo "Running $eval_type on $model..." nemo-evaluator run_eval \ --eval_type "$eval_type" \ --model_id "$model" \ --model_url "https://integrate.api.nvidia.com/v1/chat/completions" \ --model_type chat \ --api_key_name MY_API_KEY \ --output_dir "./results/${model//\//_}_${eval_type}" \ --overrides "config.params.limit_samples=50" echo "Completed $eval_type on $model" done done echo "All evaluations completed!" ``` ### Framework Development ```bash # Setup new framework nemo-evaluator-example my_custom_eval . # This creates the basic structure: # core_evals/my_custom_eval/ # ├── framework.yml # ├── output.py # └── __init__.py # Edit framework.yml to configure your evaluation # Edit output.py to implement result parsing # Test your framework nemo-evaluator run_eval \ --eval_type my_custom_eval \ --model_id "test-model" \ --model_url "https://api.example.com/v1/chat/completions" \ --model_type chat \ --api_key_name MY_API_KEY \ --output_dir ./results ``` ## Framework Setup Command ### `nemo-evaluator-example` - Setup Framework Set up NVIDIA framework files in a destination folder. ```bash nemo-evaluator-example [package_name] [destination] ``` **Arguments:** - `package_name`: Python package-like name for the framework - `destination`: Destination folder where to create framework files **Example Usage:** ```bash # Setup framework in specific directory nemo-evaluator-example my_package /path/to/destination # Setup framework in current directory nemo-evaluator-example my_package . ``` **What it creates:** - `core_evals/my_package/framework.yml` - Framework configuration - `core_evals/my_package/output.py` - Output parsing logic - `core_evals/my_package/__init__.py` - Package initialization ## Environment Variables ### Logging Configuration ```bash # Set log level (recommended over --debug flag) export NEMO_EVALUATOR_LOG_LEVEL=DEBUG ``` --- orphan: true --- (evaluation-utils-reference)= # Evaluation Utilities Reference Complete reference for evaluation discovery and utility functions in NeMo Evaluator. ## nemo_evaluator.show_available_tasks() Discovers and displays all available evaluation tasks across installed evaluation frameworks. ### Function Signature ```python def show_available_tasks() -> None ``` ### Returns | Type | Description | |------|-------------| | `None` | Prints available tasks to stdout | ### Description This function scans all installed `core_evals` packages and prints a hierarchical list of available evaluation tasks organized by framework. Use this function to discover which benchmarks and tasks are available in your environment. The function automatically detects: - **Installed frameworks**: lm-evaluation-harness, simple-evals, bigcode, BFCL - **Available tasks**: All tasks defined in each framework's configuration - **Installation status**: Displays message if no evaluation packages are installed ### Usage Examples #### Basic Task Discovery ```python from nemo_evaluator import show_available_tasks # Display all available evaluations show_available_tasks() # Example output: # lm-evaluation-harness: # * mmlu # * gsm8k # * arc_challenge # * hellaswag # simple-evals: # * AIME_2025 # * humaneval # * drop # bigcode: # * mbpp # * humaneval # * apps ``` #### Programmatic Task Discovery For programmatic access to task information, use the launcher API: ```python from nemo_evaluator_launcher.api.functional import get_tasks_list # Get structured task information tasks = get_tasks_list() for task in tasks: task_name, endpoint_type, harness, container = task print(f"Task: {task_name}, Type: {endpoint_type}, Framework: {harness}") ``` To filter tasks using the CLI: ```bash # List all tasks nemo-evaluator-launcher ls tasks # Filter for specific tasks nemo-evaluator-launcher ls tasks | grep mmlu ``` #### Check Installation Status ```python from nemo_evaluator import show_available_tasks # Check if evaluation packages are installed print("Available evaluation frameworks:") show_available_tasks() # If no packages installed, you'll see: # NO evaluation packages are installed. ``` ### Installation Requirements To use this function, install evaluation framework packages: ```bash # Install all frameworks pip install nvidia-lm-eval nvidia-simple-evals nvidia-bigcode-eval nvidia-bfcl # Or install selectively pip install nvidia-lm-eval # LM Evaluation Harness pip install nvidia-simple-evals # Simple Evals pip install nvidia-bigcode-eval # BigCode benchmarks pip install nvidia-bfcl # Berkeley Function Calling Leaderboard ``` ### Error Handling The function handles missing packages: ```python from nemo_evaluator import show_available_tasks # Safely check for available tasks try: show_available_tasks() except ImportError as e: print(f"Error: {e}") print("Install evaluation frameworks: pip install nvidia-lm-eval") ``` --- ## Integration with Evaluation Workflows ### Pre-Flight Task Verification Verify task availability before running evaluations: ```python from nemo_evaluator_launcher.api.functional import get_tasks_list def verify_task_available(task_name: str) -> bool: """Check if a specific task is available.""" tasks = get_tasks_list() return any(task[0] == task_name for task in tasks) # Usage if verify_task_available("mmlu"): print("✓ MMLU is available") else: print("✗ MMLU not found. Install evaluation framework packages") ``` ### Filter Tasks by Endpoint Type Use task discovery to filter by endpoint type: ```python from nemo_evaluator_launcher.api.functional import get_tasks_list # Get all chat endpoint tasks tasks = get_tasks_list() chat_tasks = [task[0] for task in tasks if task[1] == "chat"] completions_tasks = [task[0] for task in tasks if task[1] == "completions"] print(f"Chat tasks: {chat_tasks[:5]}") # Show first five print(f"Completions tasks: {completions_tasks[:5]}") ``` ### Framework Selection When a task is provided by more than one framework, use explicit framework specification in your configuration: ```python from nemo_evaluator.api.api_dataclasses import EvaluationConfig, ConfigParams # Explicit framework specification config = EvaluationConfig( type="lm-evaluation-harness.mmlu", # Instead of just "mmlu" params=ConfigParams(task="mmlu") ) ``` --- ## Troubleshooting ### Problem: "NO evaluation packages are installed" **Solution**: ```bash # Install evaluation frameworks pip install nvidia-lm-eval nvidia-simple-evals nvidia-bigcode-eval nvidia-bfcl # Verify installation python -c "from nemo_evaluator import show_available_tasks; show_available_tasks()" ``` ### Problem: Task not appearing in list **Solution**: ```bash # Install the required framework package pip install nvidia-lm-eval # Verify installation python -c "from nemo_evaluator import show_available_tasks; show_available_tasks()" ``` ### Problem: Task conflicts between frameworks When a task name is provided by more than one framework (for example, both `lm-evaluation-harness` and `simple-evals` provide `mmlu`), use explicit framework specification: **Solution**: ```bash # Use explicit framework.task format in your configuration overrides nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \ -o 'evaluation.tasks=["lm-evaluation-harness.mmlu"]' ``` --- ## Related Functions ### NeMo Evaluator Launcher API For programmatic access with structured results: ```python from nemo_evaluator_launcher.api.functional import get_tasks_list # Returns list of tuples: (task_name, endpoint_type, framework, container) tasks = get_tasks_list() ``` ### CLI Commands ```bash # List all tasks nemo-evaluator-launcher ls tasks # List recent evaluation runs nemo-evaluator-launcher ls runs # Get detailed help nemo-evaluator-launcher --help ``` --- **Source**: `packages/nemo-evaluator/src/nemo_evaluator/core/entrypoint.py:105-123` **API Export**: `nemo_evaluator/__init__.py` exports `show_available_tasks` for public use **Related**: See {ref}`gs-quickstart` for evaluation setup and {ref}`eval-benchmarks` for task descriptions # Frequently Asked Questions ## **What benchmarks and harnesses are supported?** The docs list hundreds of benchmarks across multiple harnesses, available via curated NGC evaluation containers and the unified Launcher. Reference: {ref}`eval-benchmarks` :::{tip} Discover available tasks with ```bash nemo-evaluator-launcher ls tasks ``` ::: --- ## **How do I set log dir and verbose logging?** Set these environment variables for logging configuration: ```bash # Set log level (INFO, DEBUG, WARNING, ERROR, CRITICAL) export LOG_LEVEL=DEBUG # or (legacy, still supported) export NEMO_EVALUATOR_LOG_LEVEL=DEBUG ``` Reference: {ref}`nemo-evaluator-logging`. --- ## **Can I run distributed or on a scheduler?** Yes. Launcher supports multiple executors. For optimal performance, the SLURM executor is recommended. It schedules and executes jobs across cluster nodes, enabling parallel, large‑scale evaluation runs while preserving reproducibility via containerized benchmarks. See {ref}`executor-slurm` for details. --- ## **Can I point Evaluator at my own endpoint?** Yes. Provide your OpenAI‑compatible endpoint. The "none" deployment option means no model deployment is performed as part of the evaluation job. Instead, you provide an existing OpenAI-compatible endpoint. The launcher handles running evaluation tasks while connecting to your existing endpoint. ```yaml target: api_endpoint: model_id: meta/llama-3.1-8b-instruct # Model identifier (required) url: https://your-endpoint.com/v1/chat/completions # Endpoint URL (required) api_key_name: API_KEY # Environment variable name (recommended) ``` Reference: {ref}`deployment-none`. --- **Can I test my endpoint for OpenAI compatibility?** Yes. Preview the full resolved configuration without executing using `--dry-run` : ```bash nemo-evaluator-launcher run \ --config packages/nemo-evaluator-launcher/examples/local_basic.yaml --dry-run ``` Reference: {ref}`launcher-cli-dry-run`. --- ## **Can I store and retrieve per-sample results, not just the summary?** Yes. Capture full request/response artifacts and retrieve them from the run's artifacts folder. Enable detailed logging with `nemo_evaluator_config`: ```yaml evaluation: # Request + response logging (example at 1k each) nemo_evaluator_config: target: api_endpoint: adapter_config: use_request_logging: True max_saved_requests: 1000 use_response_logging: True max_saved_responses: 1000 ``` These enable the **RequestLoggingInterceptor** and **ResponseLoggingInterceptor** so each prompt/response pair is saved alongside the evaluation job. Retrieve artifacts after the run: ```bash nemo-evaluator-launcher export --dest local --output-dir ./artifacts --copy-logs ``` Look under `./artifacts/` for `results.yml`, reports, logs, and saved request/response files. Reference: {ref}`interceptor-request-logging`. --- ## **Where do I find evaluation results?** After a run completes, copy artifacts locally: ```bash nemo-evaluator-launcher info --copy-artifacts ./artifacts ``` Inside `./artifacts/` you'll see the run config, `results.yaml` (main output file), HTML/JSON reports, logs, and cached request/response files, if caching was used. Where the output is structured: ```bash / │ ├── eval_factory_metrics.json │ ├── report.html │ ├── report.json │ ├── results.yml │ ├── run_config.yml │ └── / ``` Reference: {ref}`evaluation-output`. --- ## **Can I export a consolidated JSON of scores?** Yes. JSON is included in the standard output exporter, along with automatic exporters for MLflow, Weights & Biases, and Google Sheets. ```bash nemo-evaluator-launcher export --dest local --format json ``` This creates `processed_results.json` (you can also pass multiple invocation IDs to merge). **Exporter docs:** Local files, W&B, MLflow, GSheets are listed under **Launcher → Exporters** in the docs. Reference: {ref}`exporters-overview`. --- ## **What's the difference between Launcher and Core?** * **Launcher (`nemo-evaluator-launcher`)**: Unified CLI with config/exec backends (local/Slurm/Lepton), container orchestration, and exporters. Best for most users. See {ref}`lib-launcher`. * **Core (`nemo-evaluator`)**: Direct access to the evaluation engine and adapters—useful for custom programmatic pipelines and advanced interceptor use. See {ref}`lib-core`. --- ## **Can I add a new benchmark?** Yes. Use a **Framework Definition File (FDF)**—a YAML that declares framework metadata, default commands/params, and one or more evaluation tasks. Minimal flow: 1. Create an FDF with `framework`, `defaults`, and `evaluations` sections. 2. Point the launcher/Core at your FDF and run. 3. (Recommended) Package as a container for reproducibility and shareability. See {ref}`extending-evaluator`. **Skeleton FDF (excerpt):** ```yaml framework: name: my-custom-eval pkg_name: my_custom_eval defaults: command: >- my-eval-cli --model {{target.api_endpoint.model_id}} --task {{config.params.task}} --output {{config.output_dir}} evaluations: - name: my_task_1 defaults: config: params: task: my_task_1 ``` See the "Framework Definition File (FDF)" page for the full example and field reference. Reference: {ref}`framework-definition-file`. --- ## **Why aren't exporters included in the main wheel?** Exporters target **external systems** (e.g., W&B, MLflow, Google Sheets). Each of those adds heavy/optional dependencies and auth integrations. To keep the base install lightweight and avoid forcing unused deps on every user, exporters ship as **optional extras**: ```bash # Only what you need pip install "nemo-evaluator-launcher[wandb]" pip install "nemo-evaluator-launcher[mlflow]" pip install "nemo-evaluator-launcher[gsheets]" # Or everything pip install "nemo-evaluator-launcher[all]" ``` **Exporter docs:** Local files, W&B, MLflow, GSheets are listed under {ref}`exporters-overview`. --- ## **How is input configuration managed?** NeMo Evaluator uses **Hydra** for configuration management, allowing flexible composition, inheritance, and command-line overrides. Each evaluation is defined by a YAML configuration file that includes four primary sections: ```yaml defaults: - execution: local - deployment: none - _self_ execution: output_dir: results target: api_endpoint: model_id: meta/llama-3.2-3b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY evaluation: - name: gpqa_diamond - name: ifeval ``` This structure defines **where to run**, **how to serve the model**, **which model or endpoint to evaluate**, and **what benchmarks to execute**. You can start from a provided example config or compose your own using Hydra's `defaults` list to combine deployment, execution, and benchmark modules. Reference: {ref}`configuration-overview`. --- ## **Can I customize or override configuration values?** Yes. You can override any field in the YAML file directly from the command line using the `-o` flag: ```bash # Override output directory nemo-evaluator-launcher run --config your_config.yaml \ -o execution.output_dir=my_results # Override multiple fields nemo-evaluator-launcher run --config your_config.yaml \ -o target.api_endpoint.url="https://new-endpoint.com/v1/chat/completions" \ -o target.api_endpoint.model_id=openai/gpt-4o ``` Overrides are merged dynamically at runtime—ideal for testing new endpoints, swapping models, or changing output destinations without editing your base config. :::{tip} Always start with a dry run to validate your configuration before launching a full evaluation: ```bash nemo-evaluator-launcher run --config your_config.yaml --dry-run ``` ::: Reference: {ref}`configuration-overview`. --- ## **How do I choose the right deployment and execution configuration?** NeMo Evaluator separates **deployment** (how your model is served) from **execution** (where your evaluations are run). These are configured in the `defaults` section of your YAML file: ```yaml defaults: - execution: local # Where to run: local, lepton, or slurm - deployment: none # How to serve the model: none, vllm, sglang, nim, trtllm, generic ``` **Deployment Options — How your model is served** | Option | Description | Best for | | ----- | ----- | ----- | | `none` | Uses an existing API endpoint (e.g., NVIDIA API Catalog, OpenAI, Anthropic). No deployment needed. | External APIs or already-deployed services | | `vllm` | High-performance inference server for LLMs with tensor parallelism and caching. | Fast local/cluster inference, production workloads | | `sglang` | Lightweight structured generation server optimized for throughput. | Evaluating structured or long-form text generation | | `nim` | NVIDIA Inference Microservice (NIM) – optimized for enterprise-grade serving with autoscaling and telemetry. | Enterprise, production, and reproducible benchmarks | | `trtllm` | TensorRT-LLM backend using GPU-optimized kernels. | Lowest latency and highest GPU efficiency | | `generic` | Use a custom serving stack of your choice. | Custom frameworks or experimental endpoints | **Execution Platforms — Where evaluations run** | Platform | Description | Use case | | ----- | ----- | ----- | | `local` | Runs Docker-based evaluation locally. | Development, testing, or small-scale benchmarking | | `lepton` | Runs on NVIDIA Lepton for on-demand GPU execution. | Scalable, production-grade evaluations | | `slurm` | Uses your HPC cluster's job scheduler. | Research clusters or large batch evaluations | **Example:** ```yaml defaults: - execution: lepton - deployment: vllm ``` This configuration launches the model with **vLLM serving** and runs benchmarks remotely on **Lepton GPUs**. When in doubt: * Use `deployment: none` + `execution: local` for your **first run** (quickest setup). * Use `vllm` or `nim` once you need **scalability and speed**. Always test first: ```bash nemo-evaluator-launcher run --config your_config.yaml --dry-run ``` Reference: {ref}`configuration-overview`. --- ## **Can I use Evaluator without internet access?** Yes. NeMo Evaluator uses datasets and model checkpoints from [Hugging Face Hub](https://huggingface.co/docs/hub/en/index). If a requested dataset or model is not available locally, it is downloaded from the Hub at runtime. When working in an environment without internet access, configure a cache directory and pre-populate it with all required data before launching the evaluation. See the [example configuration](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator-launcher/examples/slurm_vllm_advanced_hf_caching.yaml) with HF caching: ```{literalinclude} ../../packages/nemo-evaluator-launcher/examples/slurm_vllm_advanced_hf_caching.yaml :language: yaml :start-after: "# [docs-start-snippet]" :end-before: "# [docs-end-snippet]" ``` Modify the example with actual paths for the mounts and run: ```bash nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/slurm_vllm_advanced_hf_caching.yaml ``` --- orphan: true --- (troubleshooting-index)= # Troubleshooting Comprehensive troubleshooting guide for {{ product_name_short }} evaluations, organized by problem type and complexity level. This section provides systematic approaches to diagnose and resolve evaluation issues. Start with the quick diagnostics below to verify your basic setup, then navigate to the appropriate troubleshooting category based on where your issue occurs in the evaluation workflow. ## Quick Start Before diving into specific problem areas, run these basic checks to verify your evaluation environment: ::::{tab-set} :::{tab-item} Launcher Quick Check ```bash # Verify launcher installation and basic functionality nemo-evaluator-launcher --version # List available tasks nemo-evaluator-launcher ls tasks # Validate configuration without running nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml --dry-run # Check recent runs nemo-evaluator-launcher ls runs ``` ::: :::{tab-item} Model Endpoint Check ```python import requests # Check health endpoint (adjust based on your deployment) # vLLM/SGLang/NIM: use /health # NeMo/Triton: use /v1/triton_health health_response = requests.get("http://0.0.0.0:8080/health", timeout=5) print(f"Health Status: {health_response.status_code}") # Test completions endpoint test_payload = { "prompt": "Hello", "model": "megatron_model", "max_tokens": 5 } response = requests.post("http://0.0.0.0:8080/v1/completions/", json=test_payload) print(f"Completions Status: {response.status_code}") ``` ::: :::{tab-item} Core API Check ```python from nemo_evaluator import show_available_tasks try: print("Available frameworks and tasks:") show_available_tasks() except ImportError as e: print(f"Missing dependency: {e}") ``` ::: :::: ## Troubleshooting Categories Choose the category that best matches your issue for targeted solutions and debugging steps. ::::{grid} 1 1 1 1 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`download;1.5em;sd-mr-1` Setup & Installation :link: setup-issues/index :link-type: doc Installation problems, authentication setup, and model deployment issues to get {{ product_name_short }} running. ::: :::{grid-item-card} {octicon}`play;1.5em;sd-mr-1` Runtime & Execution :link: runtime-issues/index :link-type: doc Configuration validation and launcher management during evaluation execution. ::: :::: ## Getting Help ### Log Collection When reporting issues, include: 1. System Information: ```bash python --version pip list | grep nvidia nvidia-smi ``` 2. Configuration Details: ```python print(f"Task: {eval_cfg.type}") print(f"Endpoint: {target_cfg.api_endpoint.url}") print(f"Model: {target_cfg.api_endpoint.model_id}") ``` 3. Error Messages: Full stack traces and error logs ### Community Resources - **GitHub Issues**: [{{ product_name_short }} Issues](https://github.com/NVIDIA-NeMo/Eval/issues) - **Discussions**: [GitHub Discussions](https://github.com/NVIDIA-NeMo/Eval/discussions) - **Documentation**: {ref}`template-home` ### Professional Support For enterprise support, contact: [nemo-toolkit@nvidia.com](mailto:nemo-toolkit@nvidia.com) (configuration-issues)= # Configuration Issues Solutions for configuration parameters, tokenizer setup, and endpoint configuration problems. ## Log-Probability Evaluation Issues ### Problem: Log-probability evaluation fails **Required Configuration**: ```python from nemo_evaluator import EvaluationConfig, ConfigParams config = EvaluationConfig( type="arc_challenge", params=ConfigParams( extra={ "tokenizer": "/path/to/checkpoint/context/nemo_tokenizer", "tokenizer_backend": "huggingface" } ) ) ``` **Common Issues**: - Missing tokenizer path - Incorrect tokenizer backend - Tokenizer version mismatch ### Tokenizer Configuration **Verify Tokenizer Path**: ```python import os tokenizer_path = "/path/to/checkpoint/context/nemo_tokenizer" if os.path.exists(tokenizer_path): print(" Tokenizer path exists") else: print(" Tokenizer path not found") # Check alternative locations ``` ## Chat vs. Completions Configuration Before troubleshooting endpoint issues, verify your endpoint supports the required OpenAI API format using our {ref}`deployment-testing-compatibility` guide. ### Problem: Chat evaluation fails with base model :::{admonition} Issue :class: error Base models don't have chat templates ::: :::{admonition} Solution :class: tip Use completions endpoint instead: ```python from nemo_evaluator import ApiEndpoint, EvaluationConfig, EndpointType # Change from chat to completions api_endpoint = ApiEndpoint( url="http://0.0.0.0:8080/v1/completions/", type=EndpointType.COMPLETIONS ) # Use completion-based tasks config = EvaluationConfig(type="mmlu") ``` ::: ### Endpoint Configuration Examples **For Completions (Base Models)**: ```python from nemo_evaluator import EvaluationTarget, ApiEndpoint, EndpointType target_cfg = EvaluationTarget( api_endpoint=ApiEndpoint( url="http://0.0.0.0:8080/v1/completions/", type=EndpointType.COMPLETIONS, model_id="megatron_model" ) ) ``` **For Chat (Instruct Models)**: ```python from nemo_evaluator import EvaluationTarget, ApiEndpoint, EndpointType target_cfg = EvaluationTarget( api_endpoint=ApiEndpoint( url="http://0.0.0.0:8080/v1/chat/completions/", type=EndpointType.CHAT, model_id="megatron_model" ) ) ``` ## Timeout and Parallelism Issues ### Problem: Evaluation hangs, times out or crashes with "Too many requests" error **Diagnosis**: - Check `parallelism` setting (start with 1) - Monitor resource usage - Verify network connectivity **Solutions**: ```python from nemo_evaluator import ConfigParams # Reduce concurrency params = ConfigParams( parallelism=1, # Start with single-threaded limit_samples=10, # Test with small sample request_timeout=600 # Increase timeout for large models (seconds) ) ``` ## Configuration Validation ### Pre-Evaluation Checks ```python from nemo_evaluator import show_available_tasks # Verify task exists print("Available tasks:") show_available_tasks() # Test endpoint connectivity with curl before running evaluation: # curl -X POST http://0.0.0.0:8080/v1/completions/ \ # -H "Content-Type: application/json" \ # -d '{"prompt": "test", "model": "megatron_model", "max_tokens": 1}' ``` ### Common Configuration Issues - Wrong endpoint type (using `EndpointType.CHAT` for base models or `EndpointType.COMPLETIONS` for instruct models) - Missing tokenizer (log-probability tasks require explicit tokenizer configuration in `params.extra`) - High parallelism (starting with `parallelism > 1` can mask underlying issues; use `parallelism=1` for initial debugging) - Incorrect model ID (model ID must match what the deployment expects) - Missing output directory (ensure output path exists and is writable) ### Task-Specific Configuration **MMLU (Choice-Based)**: ```python from nemo_evaluator import EvaluationConfig, ConfigParams config = EvaluationConfig( type="mmlu", params=ConfigParams( extra={ "tokenizer": "/path/to/tokenizer", "tokenizer_backend": "huggingface" } ) ) ``` **Generation Tasks**: ```python from nemo_evaluator import EvaluationConfig, ConfigParams config = EvaluationConfig( type="hellaswag", params=ConfigParams( max_new_tokens=100, limit_samples=50 ) ) ``` --- orphan: true --- # Runtime and Execution Issues Solutions for problems that occur during evaluation execution, including configuration validation and launcher management. ## Common Runtime Problems When evaluations fail during execution, start with these diagnostic steps: ::::{tab-set} :::{tab-item} Configuration Check ```bash # Validate configuration before running nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml --dry-run # Test minimal configuration python -c " from nemo_evaluator import EvaluationConfig, ConfigParams config = EvaluationConfig(type='mmlu', params=ConfigParams(limit_samples=1)) print('Configuration valid') " ``` ::: :::{tab-item} Endpoint Test ```python import requests # Test model endpoint connectivity response = requests.post( "http://0.0.0.0:8080/v1/completions/", json={"prompt": "test", "model": "megatron_model", "max_tokens": 1} ) print(f"Endpoint status: {response.status_code}") ``` ::: :::{tab-item} Resource Monitor ```bash # Monitor system resources during evaluation nvidia-smi -l 1 # GPU usage htop # CPU/Memory usage ``` ::: :::: ## Runtime Categories Choose the category that matches your runtime issue: ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Configuration Issues :link: configuration :link-type: doc Config parameter validation, tokenizer setup, and endpoint configuration problems. ::: :::{grid-item-card} {octicon}`terminal;1.5em;sd-mr-1` Launcher Issues :link: launcher :link-type: doc NeMo Evaluator Launcher-specific problems including job management and multi-backend execution. ::: :::: :::{toctree} :caption: Runtime Issues :hidden: Configuration Launcher ::: # Launcher Issues Troubleshooting guide for NeMo Evaluator Launcher-specific problems including configuration validation, job management, and multi-backend execution issues. ## Configuration Issues ### Configuration Validation Errors **Problem**: Configuration fails validation before execution **Solution**: Use dry-run to validate configuration: ```bash # Validate configuration without running nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml --dry-run ``` **Common Issues**: ::::{dropdown} Missing Required Fields :icon: code-square ``` Error: Missing required field 'execution.output_dir' ``` **Fix**: Add output directory to config or override: ```bash nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml \ -o execution.output_dir=./results ``` :::: ::::{dropdown} Invalid Task Names :icon: code-square ``` Error: Unknown task 'invalid_task'. Available tasks: hellaswag, arc_challenge, ... ``` **Fix**: List available tasks and use correct names: ```bash nemo-evaluator-launcher ls tasks ``` :::: ::::{dropdown} Configuration Conflicts :icon: code-square ``` Error: Cannot specify both 'api_key' and 'api_key_name' in target.api_endpoint ``` **Fix**: Use only one authentication method in configuration. :::: ### Hydra Configuration Errors **Problem**: Hydra fails to resolve configuration composition **Common Errors**: ``` MissingConfigException: Cannot find primary config 'missing_config' ``` **Solutions**: 1. **Verify Config Directory**: ```bash # List available configs ls examples/ # Ensure config file exists ls examples/local_basic.yaml ``` 2. **Check Config Composition**: ```yaml # Verify defaults section in config file defaults: - execution: local - deployment: none - _self_ ``` 3. **Use Absolute Paths**: ```bash nemo-evaluator-launcher run --config /absolute/path/to/configs/my_config.yaml ``` ## Job Management Issues ### Job Status Problems **Problem**: Cannot check job status or jobs appear stuck **Diagnosis**: ```bash # Check job status nemo-evaluator-launcher status # List all runs nemo-evaluator-launcher ls runs # Check specific job nemo-evaluator-launcher status ``` **Common Issues**: 1. **Invalid Invocation ID**: ``` Error: Invocation 'abc123' not found ``` **Fix**: Use correct invocation ID from run output or list recent runs: ```bash nemo-evaluator-launcher ls runs ``` 2. **Stale Job Database**: **Fix**: Check execution database location and permissions: ```bash # Database location ls -la ~/.nemo-evaluator/exec-db/exec.v1.jsonl ``` ### Job Termination Issues **Problem**: Cannot kill running jobs **Solutions**: ```bash # Kill entire invocation nemo-evaluator-launcher kill # Kill specific job nemo-evaluator-launcher kill ``` **Executor-Specific Issues**: - **Local**: Jobs run in Docker containers - ensure Docker daemon is running - **Slurm**: Check Slurm queue status with `squeue` - **Lepton**: Verify Lepton workspace connectivity ## Multi-Backend Execution Issues ::::{dropdown} Local Executor Problems :icon: code-square **Problem**: Docker-related execution failures **Common Issues**: 1. **Docker Not Running**: ``` Error: Cannot connect to Docker daemon ``` **Fix**: Start Docker daemon: ```bash # macOS/Windows: Start Docker Desktop # Linux: sudo systemctl start docker ``` 2. **Container Pull Failures**: ``` Error: Failed to pull container image ``` **Fix**: Check network connectivity and container registry access. :::: ::::{dropdown} Slurm Executor Problems :icon: code-square **Problem**: Jobs fail to submit to Slurm cluster **Diagnosis**: ```bash # Check Slurm cluster status sinfo squeue -u $USER # Check partition availability sinfo -p ``` **Common Issues**: 1. **Invalid Partition**: ``` Error: Invalid partition name 'gpu' ``` **Fix**: Use correct partition name: ```bash # List available partitions sinfo -s ``` 2. **Resource Unavailable**: ``` Error: Insufficient resources for job ``` **Fix**: Adjust resource requirements: ```yaml execution: num_nodes: 1 gpus_per_node: 2 walltime: "2:00:00" ``` :::: ::::{dropdown} Lepton Executor Problems :icon: code-square **Problem**: Lepton deployment or execution failures **Diagnosis**: ```bash # Check Lepton authentication lep workspace list # Test connection lep deployment list ``` **Common Issues**: 1. **Authentication Failure**: ``` Error: Invalid Lepton credentials ``` **Fix**: Re-authenticate with Lepton: ```bash lep login -c : ``` 2. **Deployment Timeout**: ``` Error: Deployment failed to reach Ready state ``` **Fix**: Check Lepton workspace capacity and deployment status. :::: ## Export Issues ### Export Failures **Problem**: Results export fails to destination **Diagnosis**: ```bash # List completed runs nemo-evaluator-launcher ls runs # Try export nemo-evaluator-launcher export --dest local --format json ``` **Common Issues**: 1. **Missing Dependencies**: ``` Error: MLflow not installed ``` **Fix**: Install required exporter dependencies: ```bash pip install nemo-evaluator-launcher[mlflow] ``` 2. **Authentication Issues**: ``` Error: Invalid W&B credentials ``` **Fix**: Configure authentication for export destination: ```bash # W&B wandb login ``` ## Advanced Debugging Techniques ### Injecting Custom Command Into Evaluation Container :::{note} Do not use this functionality for running at scale, because it a) reduces the reproducility of evaluations; b) introduces security issues (remote command execution). ::: For various debugging or testing purposes, one can supply a field `pre_cmd` under the following configuration positions: ```yaml ... evaluation: pre_cmd: | any script that will be executed inside of the container before running evaluation it can be multiline tasks: - name: pre_cmd: one can override this command ``` For security reasons (running configs from untrusted sources), if `pre_cmd` is non-empty, the `nemo-evaluator-launcher` will fail unless `NEMO_EVALUATOR_TRUST_PRE_CMD=1` environment variable is supplied. ## Getting Help ### Debug Information Collection When reporting launcher issues, include: 1. **Configuration Details**: ```bash # Show resolved configuration nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/.yaml --dry-run ``` 2. **System Information**: ```bash # Launcher version nemo-evaluator-launcher --version # System info python --version docker --version # For local executor sinfo # For Slurm executor lep workspace list # For Lepton executor ``` 3. **Job Information**: ```bash # Job status nemo-evaluator-launcher status # Recent runs nemo-evaluator-launcher ls runs ``` 4. **Log Files**: - Local executor: Check `//logs/stdout.log` - Slurm executor: Check job output files in output directory - Lepton executor: Check Lepton job logs via Lepton CLI For complex issues, see the [Python API documentation](../../libraries/nemo-evaluator-launcher/api). (authentication)= # Authentication Solutions for HuggingFace token issues and dataset access permissions. ## Common Authentication Issues ### Problem: `401 Unauthorized` for Gated Datasets **Solution**: ```bash # Set HuggingFace token export HF_TOKEN=your_huggingface_token # Or authenticate using CLI huggingface-cli login # Verify authentication huggingface-cli whoami ``` **In Python**: ```python import os os.environ["HF_TOKEN"] = "your_token_here" ``` ### Problem: `403 Forbidden` for Specific Datasets **Solution**: 1. Request access to the gated dataset on HuggingFace 2. Wait for approval from dataset maintainers 3. Ensure your token has the required permissions ## Datasets Requiring Authentication The following datasets require `HF_TOKEN` and access approval: - **GPQA Diamond** (and variants): [Request access](https://huggingface.co/datasets/Idavidrein/gpqa) - **Aegis v2**: Required for safety evaluation tasks - **HLE**: Human-like evaluation tasks :::{note} Most standard benchmarks (MMLU, HellaSwag, ARC, etc.) do not require authentication. ::: --- orphan: true --- # Setup and Installation Issues Solutions for getting {{ product_name_short }} up and running, including installation problems, authentication setup, and model deployment issues. ## Common Setup Problems Before diving into specific issues, verify your basic setup with these quick checks: ::::{tab-set} :::{tab-item} Installation Check ```bash # Verify core packages are installed pip list | grep nvidia # Check for missing evaluation frameworks python -c "from nemo_evaluator import show_available_tasks; show_available_tasks()" ``` ::: :::{tab-item} Authentication Check ```bash # Verify HuggingFace token huggingface-cli whoami # Test token access python -c "import os; print('HF_TOKEN set:', bool(os.environ.get('HF_TOKEN')))" ``` ::: :::{tab-item} Deployment Check ```bash # Check if deployment server is running # Use /health for vLLM, SGLang, NIM deployments # Use /v1/triton_health for NeMo/Triton deployments curl -I http://0.0.0.0:8080/health # Verify GPU availability nvidia-smi ``` ::: :::: ## Setup Categories Choose the category that matches your setup issue: ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 :::{grid-item-card} {octicon}`download;1.5em;sd-mr-1` Installation Issues :link: installation :link-type: doc Module import errors, missing dependencies, and framework installation problems. ::: :::{grid-item-card} {octicon}`key;1.5em;sd-mr-1` Authentication Setup :link: authentication :link-type: doc HuggingFace tokens, dataset access permissions, and gated model authentication. ::: :::: :::{toctree} :caption: Setup Issues :hidden: Installation Authentication ::: (installation-issues)= # Installation Issues Solutions for import errors, missing dependencies, and framework installation problems. ## Common Import and Installation Problems ### Problem: `ModuleNotFoundError: No module named 'core_evals'` **Solution**: ```bash # Install missing core evaluation framework pip install nvidia-lm-eval # For additional frameworks pip install nvidia-simple-evals nvidia-bigcode-eval nvidia-bfcl ``` ### Problem: `Framework for task X not found` **Diagnosis**: ```python from nemo_evaluator import show_available_tasks # Display all available tasks print("Available tasks:") show_available_tasks() ``` Or use the CLI: ```bash nemo-evaluator-launcher ls tasks ``` **Solution**: ```bash # Install the framework containing the missing task pip install nvidia- # Restart Python session to reload frameworks ``` ### Problem: `Multiple frameworks found for task X` **Solution**: ```python # Use explicit framework specification config = EvaluationConfig( type="lm-evaluation-harness.mmlu", # Instead of just "mmlu" # ... other config ) ```