About Building Multiple-Choice Question Benchmarks#

This section describes how to build a custom multiple-choice question (MCQ) benchmark as Apache Parquet files with the nemotron steps run byob/mcq command. You supply domain text files under input_dir, and the pipeline samples few-shot exemplars from a Hugging Face benchmark named in your configuration, such as cais/mmlu. The configuration specifies subject filters such as high_school_mathematics.

The benchmark step prepares seed rows, generates and judges questions, runs optional deduplication and distractor stages, and writes benchmark.parquet. An optional translation stage reads an existing benchmark and writes another benchmark.parquet with the same column layout.

Tip

New to this flow? Follow Getting Started with Building MCQ Benchmarks once, then use the grids and tables below to jump to how-to guides, concepts, or reference pages.

When to Use#

The nemotron steps run byob/mcq command enables the following outcomes.

Questions grounded in your own documents, paired with few-shot items from a public benchmark subject you declare in configuration.
A repeatable Parquet artifact, one experiment folder under your configured output_dir, plus intermediate caches when you iterate.
Optional translation with forward passes, backtranslation, and metric thresholds before you export another Parquet benchmark.

Pipeline Summary#

At a high level, the benchmark step performs the following work.

Prepare: sample few-shot examples and align them with chunks from your corpus into a seed dataset.
Generate: run the staged MCQ pipeline from generation through filtering into benchmark_raw.parquet and benchmark.parquet.
Translate, optional: translate questions and options, score backtranslation quality, and export a new benchmark.parquet.

Documentation Series#

Tutorial

Install the byob extra, run the sample tiny configuration with local paths, and inspect Parquet outputs. The tiny fixture pairs cais/mmlu high school mathematics few-shots with a one-line input file related to algebra.

hands-on

Getting Started with Building MCQ Benchmarks

How-To Guides

Prepare data, tune models in YAML, customize prompts, and resume with skip_until.

task-based

How-To Guides

Concepts

How prepare, generate, and translate stages fit together and what each configuration block does.

concept-focused

Concepts

Reference

Supported Hugging Face datasets, Parquet outputs, and YAML fields.

specification

Reference

All Documentation#

Tutorial

Guide	What you will do
Getting Started with Building MCQ Benchmarks	Run `nemotron steps run byob/mcq` with `tiny` and inspect outputs

How-To Guides

Guide	What you will do
Prepare Your Own Domain Data	Lay out `input_dir` and `target_source_mapping`
Using Your Own Domain Data	Lay out per-target `.txt` corpora under `input_dir`
Configure Model Endpoints for BYOB	Point generation, judgement, and filter models at your endpoints
Prompt Tuning for Benchmarks	Override prompts with a YAML file
Skip Stages When Iterating	Resume after intermediate Parquet caches

Concepts

Guide	What you will learn
Pipeline Overview	Stage order for prepare, generate, and translate
Data Preparation for Multiple-Choice Question Benchmarks	Seeds, chunking, and the prepare step
Getting the Right Questions From the Source Benchmark	`source_subjects` and `target_source_mapping`
Question Generation	Data Designer batched generation
Quality Validation	Judgement, deduplication, distractors, coverage, outliers
Easiness and Hallucination Filtering	Easiness and hallucination filters
Translation	Curator translation and backtranslation metrics

Reference

Guide	What you will find
Output Files	Paths under `output_dir` / `expt_name`
Troubleshooting	Symptom-to-fix index for BYOB runs
Supported Hugging Face Benchmarks	Allowed `hf_dataset` values and default subsets
Generation Configuration Reference	Generation YAML keys
Translation Configuration Reference	Translation YAML keys

What You Need#

A Nemotron clone with dependencies installed, including the byob extra from uv sync --extra byob.
Model credentials and endpoints that match the generation_model_config, judge_model_config, and related blocks in your YAML, as described in Configure Model Endpoints for BYOB.
Network access to download the configured Hugging Face benchmark split unless it is already cached on disk.

Quick Start#

Follow Getting Started with Building MCQ Benchmarks if you have not run the step yet.
Read Prepare Your Own Domain Data when you are ready to point the pipeline at your own corpus and mapping.
Open Generation Configuration Reference or Translation Configuration Reference when you need field-level YAML detail.

Limitations and Considerations#

Cost: generation, judgement, expansion, validity checks, and filters call remote models whenever you configure them to do so.
Time: full runs depend on corpus size, model latency, and which optional stages stay enabled.
Rate limits: hosted APIs may throttle parallel requests that you set under inference_parameters.
Curator mount: checked-in configurations mount NeMo Curator from Git for translation and deduplication-related paths, so remote profiles must expose that tree the same way your environment expects.