Using Your Own Domain Data#
Use this tutorial to learn how to build your own benchmark with your own domain text.
You will create a parent folder, your input_dir, with one subdirectory per key in target_source_mapping, each holding UTF-8 .txt files the prepare stage can sample.
When you finish the steps below, the prepare stage can build stems from your corpora together with your Hugging Face subject wiring.
Each directory name directly under
input_dirmust match a key intarget_source_mapping, for examplebanking,police, ormaths.Split long material across several
.txtfiles when you want different documents to drive different queries.Set
input_dirin YAML to the parent directory souv run nemotron steps run byob/mcqresolves relative paths from the shell working directory, or use an absolute path for a fixed location.
Step 1: Create Target Directories#
Create one folder per entry you plan under target_source_mapping.
# Example layout next to your workspace; adjust paths to match your machine.
mkdir -p ./data/byob/banking
mkdir -p ./data/byob/police
mkdir -p ./data/byob/maths
Path tips:
Prefer absolute paths in YAML when several people reuse the same file from different working directories.
Relative paths resolve from the shell working directory where you invoke
uv run nemotron steps run byob/mcq.
The pipeline resolves input_dir relative to that working directory.
If you need a fixed location regardless of where people run the command, set input_dir to an absolute path in YAML.
Do not use Hugging Face subject names as directory names; use the same strings you use as keys under target_source_mapping.
Few-shot subjects such as high_school_mathematics still belong under source_subjects and under each target’s subjects field.
Step 2: Add Text Files#
Place UTF-8 text under each target directory. Use several smaller files instead of one huge blob when you want different documents to drive different queries.
cat > ./data/byob/banking/intro.txt << 'EOF'
Modern banking in India originated in the mid-18th century...
EOF
Read the sample src/nemotron/steps/byob/data/tiny_input/maths/tiny.txt file for tone and length.
Step 3: Point YAML at the Parent Directory#
Set input_dir to the parent of those directories and wire few-shot subjects your Hugging Face split actually contains.
input_dir: ./data/byob
hf_dataset: cais/mmlu
subset: all
split: test
source_subjects:
- high_school_mathematics
target_source_mapping:
banking:
subjects:
- business_ethics
- econometrics
police:
subjects:
- jurisprudence
- international_law
maths:
subjects:
- high_school_mathematics
Allowed hf_dataset values and default subset / split behavior are listed in Supported Hugging Face Benchmarks.
Best Practices#
Content Quality#
Aim for a few thousand characters per document when you want diverse stems without exhausting context.
Prefer complete explanations over fragments so judges and filters see coherent context.
Strip or replace personally identifiable information before you run in shared environments.
Organization#
Keep one domain taxonomy per directory; do not mix unrelated subjects under the same target key.
Split long manuals into several
.txtfiles soqueries_per_target_subject_documentcan visit different offsets.
Performance and Chunking#
Pilot with a handful of documents per target before scaling out.
Use
chunking_config.window_sizewhen you need sliding windows over very long text;nullkeeps each file as one unit. Field detail is in Generation Configuration Reference.
Next Steps#
Run prepare alone or the full pipeline after you align YAML: Prepare Your Own Domain Data.
Inspect
seed.parquetand downstream paths in Output Files.