About Building Multiple-Choice Question Benchmarks#

This section describes how to build a custom multiple-choice question (MCQ) benchmark as Apache Parquet files with the nemotron steps run byob/mcq command. You supply domain text files under input_dir, and the pipeline samples few-shot exemplars from a Hugging Face benchmark named in your configuration, such as cais/mmlu. The configuration specifies subject filters such as high_school_mathematics.

The benchmark step prepares seed rows, generates and judges questions, runs optional deduplication and distractor stages, and writes benchmark.parquet. An optional translation stage reads an existing benchmark and writes another benchmark.parquet with the same column layout.

Tip

New to this flow? Follow Getting Started with Building MCQ Benchmarks once, then use the grids and tables below to jump to how-to guides, concepts, or reference pages.

When to Use#

The nemotron steps run byob/mcq command enables the following outcomes.

  • Questions grounded in your own documents, paired with few-shot items from a public benchmark subject you declare in configuration.

  • A repeatable Parquet artifact, one experiment folder under your configured output_dir, plus intermediate caches when you iterate.

  • Optional translation with forward passes, backtranslation, and metric thresholds before you export another Parquet benchmark.

Pipeline Summary#

At a high level, the benchmark step performs the following work.

  1. Prepare: sample few-shot examples and align them with chunks from your corpus into a seed dataset.

  2. Generate: run the staged MCQ pipeline from generation through filtering into benchmark_raw.parquet and benchmark.parquet.

  3. Translate, optional: translate questions and options, score backtranslation quality, and export a new benchmark.parquet.

Documentation Series#

Tutorial

Install the byob extra, run the sample tiny configuration with local paths, and inspect Parquet outputs. The tiny fixture pairs cais/mmlu high school mathematics few-shots with a one-line input file related to algebra.

Getting Started with Building MCQ Benchmarks
How-To Guides

Prepare data, tune models in YAML, customize prompts, and resume with skip_until.

How-To Guides
Concepts

How prepare, generate, and translate stages fit together and what each configuration block does.

Concepts
Reference

Supported Hugging Face datasets, Parquet outputs, and YAML fields.

Reference

All Documentation#

Guide

What you will do

Getting Started with Building MCQ Benchmarks

Run nemotron steps run byob/mcq with tiny and inspect outputs

Guide

What you will do

Prepare Your Own Domain Data

Lay out input_dir and target_source_mapping

Using Your Own Domain Data

Lay out per-target .txt corpora under input_dir

Configure Model Endpoints for BYOB

Point generation, judgement, and filter models at your endpoints

Prompt Tuning for Benchmarks

Override prompts with a YAML file

Skip Stages When Iterating

Resume after intermediate Parquet caches

Guide

What you will learn

Pipeline Overview

Stage order for prepare, generate, and translate

Data Preparation for Multiple-Choice Question Benchmarks

Seeds, chunking, and the prepare step

Getting the Right Questions From the Source Benchmark

source_subjects and target_source_mapping

Question Generation

Data Designer batched generation

Quality Validation

Judgement, deduplication, distractors, coverage, outliers

Easiness and Hallucination Filtering

Easiness and hallucination filters

Translation

Curator translation and backtranslation metrics

Guide

What you will find

Output Files

Paths under output_dir / expt_name

Troubleshooting

Symptom-to-fix index for BYOB runs

Supported Hugging Face Benchmarks

Allowed hf_dataset values and default subsets

Generation Configuration Reference

Generation YAML keys

Translation Configuration Reference

Translation YAML keys

What You Need#

  • A Nemotron clone with dependencies installed, including the byob extra from uv sync --extra byob.

  • Model credentials and endpoints that match the generation_model_config, judge_model_config, and related blocks in your YAML, as described in Configure Model Endpoints for BYOB.

  • Network access to download the configured Hugging Face benchmark split unless it is already cached on disk.

Quick Start#

  1. Follow Getting Started with Building MCQ Benchmarks if you have not run the step yet.

  2. Read Prepare Your Own Domain Data when you are ready to point the pipeline at your own corpus and mapping.

  3. Open Generation Configuration Reference or Translation Configuration Reference when you need field-level YAML detail.

Limitations and Considerations#

  • Cost: generation, judgement, expansion, validity checks, and filters call remote models whenever you configure them to do so.

  • Time: full runs depend on corpus size, model latency, and which optional stages stay enabled.

  • Rate limits: hosted APIs may throttle parallel requests that you set under inference_parameters.

  • Curator mount: checked-in configurations mount NeMo Curator from Git for translation and deduplication-related paths, so remote profiles must expose that tree the same way your environment expects.