Datasets#
BYOB benchmarks load evaluation data from local JSONL files, CSV/TSV files, or HuggingFace datasets. This page covers supported formats, remote dataset URIs, field mapping, and path resolution.
JSONL Format#
The default and recommended format is JSONL (JSON Lines). Each line is a self-contained JSON object representing one evaluation sample.
{"question": "Is the sky blue?", "answer": "yes"}
{"question": "Is water dry?", "answer": "no"}
Every object must be a JSON dictionary. Arrays or scalar values on a line will raise a ValueError.
Tip
Files with .json extensions are also treated as JSONL (line-delimited JSON), not as a single JSON array.
CSV and TSV Files#
BYOB also supports CSV and TSV files. The first row is treated as column headers using Python’s csv.DictReader.
@benchmark(
name="csv-bench",
dataset="data.csv",
prompt="Q: {question}\nA:",
target_field="answer",
)
Format detection is based on file extension: .csv for comma-separated, .tsv for tab-separated.
HuggingFace Datasets#
Use hf:// URIs to load datasets directly from the HuggingFace Hub. The dataset is downloaded at compile time and cached locally as JSONL.
URI Format#
hf://org/dataset?split=validation
hf://org/dataset/config?split=test
BYOB also accepts selected HuggingFace load_dataset options as query
parameters:
hf://org/dataset?split=test&trust_remote_code=true
hf://org/dataset?split=validation&filter_field=language&filter_value=hi
hf://org/dataset?split=test&data_files=file.json&field=examples
Examples#
@benchmark(
name="medmcqa",
dataset="hf://openlifescienceai/medmcqa?split=validation",
prompt="Q: {question}\nA: {a}\nB: {b}\nC: {c}\nD: {d}\nAnswer:",
target_field="cop",
field_mapping={"opa": "a", "opb": "b", "opc": "c", "opd": "d"},
)
@benchmark(
name="boolq",
dataset="hf://google/boolq?split=validation",
prompt="Passage: {passage}\nQuestion: {question}\nAnswer (true/false):",
target_field="answer",
)
Load a gated or custom-code dataset by allowing remote code execution:
@benchmark(
name="indommlu",
dataset="hf://indolem/IndoMMLU?split=test&trust_remote_code=true",
prompt="{question}\n\n{options}\n\nAnswer:",
target_field="answer",
)
Filter a multilingual dataset to one language:
@benchmark(
name="boolq-hi",
dataset="hf://sarvamai/boolq-indic?split=validation&filter_field=language&filter_value=hi",
prompt="Passage: {passage}\nQuestion: {question}\nAnswer:",
target_field="answer",
)
Load a nested JSON field from a specific dataset file:
@benchmark(
name="flores-hi",
dataset="hf://google/IndicGenBench_flores_in?split=test&data_files=flores_en_hi_test.json&field=examples",
prompt="Translate to Hindi: {source}",
target_field="target",
)
Note
HuggingFace dataset fetching requires the datasets pip package. Install it with pip install datasets.
Caching#
Downloaded datasets are cached at ~/.cache/nemo_evaluator/hf_datasets/ by default. Repeated fetches of the same URI reuse the cached JSONL file.
Split Selection#
When no split is specified, the HuggingFace datasets library defaults are used. If the result is a DatasetDict (multiple splits), the first available split is selected automatically.
Query Parameters#
Parameter |
Description |
|---|---|
|
Split passed to |
|
Passes |
|
Filters rows after loading, keeping rows where |
|
Additional row filters applied in order. |
|
Passes |
|
Passes |
Warning
If you put hf:// URIs with & query parameters in shell command
templates, quote the dataset argument:
--dataset "{{config.params.extra.dataset.path}}"
Otherwise the shell treats & as a background-command separator.
extra.dataset.* namespace#
BYOB exposes two dataset-related keys under config.params.extra.dataset.*
that can be overridden at run time without rebuilding the benchmark:
Key |
CLI flag |
Description |
|---|---|---|
|
|
Dataset file path or |
|
|
Few-shot example count (lm-eval-harness parity). Pass |
All other dataset-related options (field_mapping, choices, choices_field,
fewshot_dataset, fewshot_prefix, fewshot_split, etc.) are baked into the
benchmark at compile time from the @benchmark(...) decorator and are not
runtime-overridable — change them in your benchmark module and recompile with
nemo-evaluator-byob compile.
Overriding the dataset at run time#
To swap path or num_fewshot for a single run, set the corresponding key
under config.params.extra.dataset.* via the launcher’s run_config or CLI.
The launcher deep-merges via OmegaConf, so sibling keys (and unrelated keys
under extra such as benchmark_module, requirements, …) are preserved.
nemo-evaluator-launcher run --config my_config.yaml \
-o 'evaluation.tasks.<task_name>.nemo_evaluator_config.config.params.extra.dataset.path=hf://other/foo?split=test' \
-o 'evaluation.tasks.<task_name>.nemo_evaluator_config.config.params.extra.dataset.num_fewshot=0'
Or in a run_config YAML:
evaluation:
tasks:
- name: <task_name>
nemo_evaluator_config:
config:
params:
extra:
dataset:
path: hf://other/foo?split=test
num_fewshot: 5
Few-shot Examples#
BYOB resolves the few-shot example pool with this precedence:
fewshot_dataset— explicit URI/path. Use this when the few-shot source needs filters,data_files, configs, or any other URI options that cannot be expressed by a split name (e.g.hf://my-org/foo?data_files=train.json&filter_field=lang&filter_value=hi).fewshot_split— split name reused with the primaryhf://dataset. Used only whenfewshot_datasetis unset or fails to load.Tail of the primary dataset — last-resort fallback. Logs a loud warning because the few-shot pool overlaps with rows being evaluated, risking gold-answer leakage into the prompt.
Examples#
Few-shot from a different split of the same HuggingFace dataset:
@benchmark(
name="mmlu-mini",
dataset="hf://my-org/mmlu?split=test",
prompt="Question: {question}\nAnswer:",
target_field="answer",
num_fewshot=5,
fewshot_split="dev",
)
Few-shot from a completely different dataset URI (filters, data_files, etc.):
@benchmark(
name="boolq-hi",
dataset="hf://sarvamai/boolq-indic?split=validation&filter_field=language&filter_value=hi",
prompt="Passage: {passage}\nQuestion: {question}\nAnswer:",
target_field="answer",
num_fewshot=4,
fewshot_dataset="hf://sarvamai/boolq-indic?split=train&filter_field=language&filter_value=hi",
)
Add a static introduction before the few-shot examples:
@benchmark(
name="indommlu",
dataset="hf://indolem/IndoMMLU?split=test&trust_remote_code=true",
prompt="{question}\n\n{options}\n\nAnswer:",
target_field="answer",
num_fewshot=5,
fewshot_split="train",
fewshot_prefix="The following are multiple-choice questions. Choose the best answer.\n\n",
)
The final prompt sent to the model is:
<fewshot_prefix><example_1><fewshot_separator>...<example_N><fewshot_separator><test_prompt>
Tip
At run time you can force a true 0-shot evaluation against a benchmark
that declares a non-zero num_fewshot by passing --num-fewshot 0 on
the nemo-evaluator run_eval CLI. The flag is None by default; an
explicit 0 overrides the benchmark default.
Field Mapping#
Use field_mapping to rename dataset columns so they match the {placeholder} names in your prompt template. The mapping is applied after loading the dataset and before prompt rendering.
Syntax#
field_mapping={"source_column": "target_name"}
Keys present in a record are renamed. Keys absent from a record are silently ignored. Keys not listed in the mapping are passed through unchanged.
Example#
The MedMCQA dataset uses column names like opa, opb, opc, opd for answer options. Map them to shorter names for cleaner prompt templates:
@benchmark(
name="medmcqa",
dataset="hf://openlifescienceai/medmcqa?split=validation",
prompt="Q: {question}\nA: {a}\nB: {b}\nC: {c}\nD: {d}\nAnswer:",
target_field="cop",
field_mapping={"opa": "a", "opb": "b", "opc": "c", "opd": "d"},
)
Path Resolution#
Dataset paths are resolved relative to the benchmark file’s directory, following the same pattern the @benchmark decorator uses internally.
Path type |
Resolution |
|---|---|
Absolute ( |
Used as-is |
Relative ( |
Resolved from the directory containing the benchmark |
HuggingFace URI ( |
Downloaded and cached locally |
The internal resolution is equivalent to:
import os
resolved = os.path.join(os.path.dirname(__file__), "data.jsonl")
Warning
When running nemo-evaluator-byob from a different directory than the benchmark file, relative paths still resolve from the benchmark file’s location, not the current working directory.
Prompt Placeholder Matching#
The {field} placeholders in your prompt string must match keys in the dataset after any field_mapping has been applied. Mismatched placeholders cause a KeyError at evaluation time.
Use --dry-run to validate that your prompt, dataset, and field mapping are consistent before compiling:
nemo-evaluator-byob my_benchmark.py --dry-run
Tip
The dry-run output shows the dataset path, sample count, and any declared requirements – a quick way to catch configuration errors early.
See Also#
Bring Your Own Benchmark (BYOB) – BYOB overview and quickstart
Benchmark Decorator – Benchmark decorator parameters
CLI Reference – CLI reference for compilation and validation