For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
  • Getting Started
    • Welcome
    • Contributing
  • Concepts
    • Columns
    • Seed Datasets
    • Agent Rollout Ingestion
    • Custom Columns
    • Validators
    • Processors
    • Person Sampling
    • Traces
    • Architecture & Performance
    • Deployment Options
    • Security
  • Tutorials
    • Overview
    • The Basics
    • Structured Outputs, Jinja Expressions, and Conditional Generation
    • Seeding with an External Dataset
    • Providing Images as Context
    • Generating Images
    • Image-to-Image Editing
  • Recipes
    • Recipe Cards
  • Plugins
    • Overview
    • Example Plugin
    • FileSystemSeedReader Plugins
    • Discover
  • Code Reference
    • Overview
  • Dev Notes
    • Overview
    • Have It Your Way
    • VLM Long Document Understanding
    • Push Datasets to Hugging Face Hub
    • Text-to-SQL for Nemotron Super
    • Async All the Way Down
    • Owning the Model Stack
      • Data Designer Got Skills
      • Search Agent
      • Structured Outputs from Nemotron
      • Deep Research Trajectories
      • Design Principles
      • RQA Dataset
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Data Designer
On this page
  • Step 1: Building the Retrieval MCP Server
  • Step 2: Building the Corpus
  • Step 3: The Data Designer Pipeline
  • Step 4: Rejection Sampling with an LLM Judge
  • Multi-Turn Tool Calling: Rough Edges in the Open Model Ecosystem
  • Results
  • Closing Remarks
  • Try For Yourself
Dev NotesOlder Dev Notes

Deep Research Trajectories with NeMo Data Designer and MCP Tool Use

||View as Markdown|
Previous

Structured Outputs for Nemotron: Teaching Models to Produce Valid JSON, YAML, and XML

Next

Designing Data Designer: Why SDG Is a Systems Problem

Eric TramelResearcher at NVIDIA

Data Designer v0.5.0’s MCP tool-use support lets you generate multi-turn research trajectories, the kind of data needed to train deep research agents that iteratively search, read, and synthesize evidence before answering a question.


OpenResearcher benchmark results across deep research tasks. Source: Li, Jiang, Ma et al., 2026.

Deep research agents like OpenResearcher (Li, Jiang, Ma et al., 2026) and Universal Deep Research (Belcak & Molchanov, 2025) generate long reasoning chains interleaved with tool calls: formulating queries, retrieving documents, reading passages, refining hypotheses, and eventually synthesizing an answer. Training these agents requires trajectory data capturing the full multi-turn interaction between a model and its tools: every search, every document opened, every dead end explored.

OpenResearcher demonstrated something worth paying attention to: synthetic trajectories generated against a local retriever (BM25 over a static corpus, no web APIs) are sufficient to train Nemotron Nano 3 to outperform GPT-4.1 on deep research benchmarks. The data format (complete tool-use traces showing how a model moves through an information space) matters more than model scale. Nemotron Nano 3, with only 3B active parameters, beats models orders of magnitude larger on multi-hop research tasks.

This post shows how to generate that same kind of training data using Data Designer’s MCP tool-use capabilities. We build a retriever as an MCP server, construct a corpus with known-good evidence, run a teacher model through the full research process, and use an LLM judge for rejection sampling. The result is a pipeline that produces high-quality research trajectories you can use for supervised fine-tuning or as a starting point for RL.

Here’s what one of those trajectories looks like, a 4-hop question answered correctly by Claude Opus 4.5 using the pipeline described below. Each line is a tool call; parallel calls within the same turn are grouped.

Example trajectory: 4-hop question, 31 turns, 49 tool calls31 turns · 50 calls
Q: Who was the mother of the person under whom the colonizer of the winner of a border war with Libya in the 1st century BC reached its greatest extent?
Reference: Marcia
⭐ = golden passage (contains evidence for the answer)
T1
🔍searchborder war Libya 1st century BC
T2
🔍searchancient Egypt Libya war conflict 1st century BC
🔍searchCyrene Libya ancient war border
T3
🔍searchPtolemaic Egypt Cyrenaica war conflict
🔍searchEgypt border war ancient 1st century BC winner
T4
📄openmd_001400
T5
🔍searchEgypt Marmarica border war ancient Roman
🔍searchMarmaric War Egypt Libya ancient
T6
🔍searchGreek colony Libya border war Cyrene colonization
🔍searchThera Cyrene colony Greek colonization ancient
T7
📄openmd_000167
T8
🔍search1st century BC war border Libya Egypt conflict
🔍searchChad Libya border war Aouzou Strip conflict
T9
📄openmd_001676 ⭐
T10
🔍searchRoman Empire war Libya Numidia Africa ancient
🔍searchJugurthine War Numidia Rome Libya
T11
📄openfw_021746
T12
🔍searchEgypt ancient war tribes Libya Marmaridae border conflict
🔍searchCornelius Gallus Egypt prefect war border
T13
📄openmd_001054 ⭐
T14
🔍searchRoman Empire greatest extent Trajan mother
🔍searchTrajan Roman Empire greatest territorial extent
T15
📄openmd_002075 ⭐
📄openmd_002510 ⭐
T16
🔍searchEgypt Roman province colony Augustus annexed
🔍searchTrajan mother Marcia family
T17
📄openmd_002117
🔍searchTrajan mother Marcia parents father
T18
🔍searchUlpia Marciana Trajan sister mother family
🔍searchEgypt war Libya Marmaricae ancient Roman
T19
🔍searchMarcia Trajan Roman empress mother emperor
🔍searchEgypt Marmaric war ancient Rome Cyrene 1st century
T20
📄openmd_000722
T21
🔍searchPtolemaic Egypt border war Cyrene Cyrenaica territory
🔍searchEgypt vs Libya ancient war Ptolemy Rome colony
T22
🔍searchToyota War Chad Libya border winner conflict
🔍searchChad Aouzou Strip war Libya France colonized
T23
🔍search1st century BC Egypt border war defeat victory Rome colonize
🔍searchancient Egypt war won Libya 1st century BC Rome province
T24
📄openmd_002076 ⭐
T25
🔍searchFrench Empire greatest extent Napoleon mother
🔍searchFrance Vichy border war Libya Chad colonial
T26
🔍searchEgypt war 25 BC 30 BC Marmaridae tribe Roman conquest
🔍searchCornelius Balbus Garamantes Libya war Roman conquest
T27
🔍searchancient Crete war Libya border Greek colonize
🔍searchCyrene founded colonized by whom mother city Thera
T28
🔍searchEgypt Roman province colonized 1st century BC Trajan mother Marcia
T29
🔍searchMarcia Trajan mother Roman emperor name Ulpius
T30
🔍searchEgypt war border Libya Cyrenaica defeated 1st century BC Roman province
T31
✓answer
Based on my extensive research, I have gathered the following evidence:

1. The Roman Empire reached its greatest territorial extent under Trajan (117 CE) — confirmed by multiple documents
2. Trajan's mother was from the Gens Marcia — the Trajan document states he "on his mother's side from the Gens Marcia, of an Italic family of Sabine origin"
3. Egypt was colonized/annexed by Rome after Cleopatra's death (30 BC), making it a Roman province
4. Egypt had border conflicts with Libya (ancient Marmaricae/Cyrenaica region) in the 1st century BC

The answer is Marcia.

Step 1: Building the Retrieval MCP Server

OpenResearcher’s key design choice is a three-tool browser interface rather than a single retrieval call. The paper argues (and their ablations confirm) that separating search, document opening, and in-document search forces the model to develop genuine research strategies: skimming results, diving into promising documents, hunting for specific evidence within them. A single monolithic “retrieve” tool collapses this entire workflow into one step, which produces shorter and less useful training trajectories.

We implement the same three tools as an MCP server that Data Designer can invoke during generation. Our retriever uses BM25S for fast lexical search over the corpus:

1from mcp.server.fastmcp import FastMCP
2
3mcp_server = FastMCP("corpus-retriever")
4
5@mcp_server.tool()
6def search(query: str, top_k: int = 10) -> dict:
7 """Search for candidate documents to explore."""
8 # BM25S search over the corpus, returns ranked results with snippets
9 ...
10
11@mcp_server.tool(name="open")
12def open_document(doc_id: str) -> dict:
13 """Open a document for detailed inspection with cursor-numbered chunks."""
14 # Returns content formatted as [1] paragraph... [2] paragraph...
15 ...
16
17@mcp_server.tool()
18def find(doc_id: str, query: str) -> dict:
19 """Find matching passages inside a document by keyword."""
20 # Returns matching chunks with cursor positions
21 ...
22
23if __name__ == "__main__":
24 mcp_server.run()

search returns a ranked list of document IDs with short snippets, enough for the model to decide which documents look promising. open returns the full document content, split into cursor-numbered chunks so the model can reference specific passages. find does targeted keyword search within a single document, letting the model locate specific evidence without reading the entire thing. The cursor-based chunking across open and find gives the model a way to scan long documents incrementally, the way a human researcher would scan a paper for the relevant section rather than reading it cover to cover.

The server runs as a local stdio process, which means Data Designer launches and manages it automatically. No external services, no API keys for retrieval, no rate limits.


Step 2: Building the Corpus

The corpus design follows directly from OpenResearcher’s most striking ablation result. They tested what happens when you vary the retrieval corpus while keeping the reasoning model fixed (GPT-OSS-120B). The results, from the OpenResearcher Appendix:

CorpusBrowseComp-Plus Accuracy
Golden passages only (BrowseComp-Plus corpus)56.0%
15M FineWeb + golden passages31.2%
15M FineWeb only0.71%

Without golden passages (documents known to contain evidence for the question), accuracy drops to nearly zero. The model can’t learn research strategies from trajectories where every search is a dead end.

The original OpenResearcher corpus uses 15M documents from FineWeb as distractors alongside 10K golden passages. For this demonstration, we use a lighter-weight approach: we construct the corpus from multi-hop QA datasets: HotpotQA (2-hop questions requiring two pieces of linked evidence) and MuSiQue (2-4 hop questions composed from single-hop sub-questions). Each question comes with annotated supporting passages, the specific paragraphs that contain the evidence needed to answer it. Golden passages go into the corpus alongside non-supporting passages from the same datasets as distractors, at roughly a 1:9 ratio. The model has to search through noise to find the signal, which is exactly the skill we want the training data to teach.

The key constraint is that golden passages must be findable but not obvious. If the corpus is too small or the golden passages are too easy to identify, the trajectories won’t transfer to real-world research where evidence is sparse. The distractor ratio controls this difficulty, and the paper’s ablations give us a good starting point for tuning it.


Step 3: The Data Designer Pipeline

With the retriever server and corpus ready, the Data Designer pipeline ties everything together. We configure a teacher model, point it at the MCP retriever, and let it research each question from scratch. For this demo we hosted our own inference server, but anyone can try this pipeline using Nemotron Nano 3 on build.nvidia.com with a free API key using the model configuration shown below.

1import data_designer.config as dd
2from data_designer.interface import DataDesigner
3
4# Search rollout model for trajectory generation
5config = dd.DataDesignerConfigBuilder()
6config.add_model_config(
7 dd.ModelConfig(
8 alias="search_rollout_model",
9 model="nvidia/nemotron-3-nano-30b-a3b",
10 provider="nvidia",
11 inference_parameters=dd.ChatCompletionInferenceParams(
12 temperature=1.0,
13 top_p=0.95,
14 max_tokens=16384,
15 ),
16 )
17)

The temperature and top_p settings matter here. We want diverse research strategies across seeds (different query formulations, different document exploration orders) so that rejection sampling has a rich pool to select from. Setting temperature to 1.0 with top_p at 0.95 gives enough variation that the same question can produce meaningfully different trajectories across seeds.

The MCP tool configuration tells Data Designer which server to use and how many tool-call turns to allow:

1# MCP retriever tool configuration
2tool_config = dd.ToolConfig(
3 tool_alias="knowledge-base",
4 providers=["corpus-retriever"],
5 max_tool_call_turns=150,
6)
7config.add_tool_config(tool_config)

We set max_tool_call_turns high (150) because deep research trajectories can be long. Our longest observed trajectory used 25 tool calls across 53 messages. Capping too low would truncate the most interesting research chains.

The seed dataset contains the research questions alongside reference answers (which we’ll use for rejection sampling in Step 4):

1config.with_seed_dataset(
2 dd.LocalFileSeedSource(path="questions.jsonl"),
3)
4
5config.add_column(
6 dd.ExpressionColumnConfig(
7 name="research_question",
8 expr="{{ question }}",
9 )
10)

The core of the pipeline is the research column, where the teacher model receives a question and a system prompt instructing it to use the retriever tools:

1SYSTEM_PROMPT = """You are a thorough research assistant. You have access to three tools \
2for navigating a knowledge base:
3- search(query, top_k): Find candidate documents relevant to your query
4- open(doc_id): Open a document to read its full content in numbered chunks
5- find(doc_id, query): Locate specific passages within a document by keyword
6
7Your task is to research the given question by searching for relevant documents, \
8reading their content, and synthesizing an answer from the evidence you find. \
9Be systematic: formulate search queries, explore promising results, and gather \
10evidence before answering. Cite specific passages when possible."""
11
12config.add_column(
13 dd.LLMTextColumnConfig(
14 name="research_answer",
15 prompt="Research and answer thoroughly:\n\n{{ research_question }}",
16 model_alias="search_rollout_model",
17 system_prompt=SYSTEM_PROMPT,
18 tool_alias="knowledge-base",
19 with_trace=dd.TraceType.ALL_MESSAGES,
20 extract_reasoning_content=True,
21 )
22)

Two settings are doing the important work here. with_trace=dd.TraceType.ALL_MESSAGES captures the entire interaction (every tool call, every tool response, every intermediate reasoning step) into a separate trace column in ChatML format. This is the training data: the full trajectory of how the model moved through the information space. extract_reasoning_content=True pulls out the model’s internal chain-of-thought separately, so you can include or exclude it depending on your training setup.


Step 4: Rejection Sampling with an LLM Judge

Not every trajectory leads to a correct answer. OpenResearcher’s approach is straightforward. Generate multiple trajectories per question, score them for correctness, and keep only the ones that got the right answer. We implement this with Data Designer’s LLMJudgeColumnConfig, using a separate (smaller) model as the judge:

1# Judge model for rejection sampling
2config.add_model_config(
3 dd.ModelConfig(
4 alias="judge",
5 model="nvidia/nemotron-3-nano-30b-a3b",
6 provider="nvidia",
7 )
8)
9
10config.add_column(
11 dd.LLMJudgeColumnConfig(
12 name="correctness",
13 model_alias="judge",
14 prompt=(
15 "Question: {{ research_question }}\n"
16 "Reference answer: {{ answer }}\n"
17 "Generated answer: {{ research_answer }}\n"
18 "Does the generated answer correctly address the question?"
19 ),
20 scores=[
21 dd.Score(
22 name="correct",
23 description="Is the answer factually correct?",
24 options={
25 1: "Correct",
26 0: "Incorrect",
27 },
28 ),
29 ],
30 )
31)

The judge compares the generated answer against the reference answer from the seed dataset. Using a smaller model as judge is deliberate. We don’t need the judge to reason about the question, just to compare two answers for factual agreement. This keeps costs down when scoring thousands of trajectories.

In practice, you’d generate multiple trajectories per question (varying the random seed) and filter to correctness.correct == 1. The incorrect trajectories aren’t wasted; they can serve as negative examples for preference-based training methods like DPO.


Multi-Turn Tool Calling: Rough Edges in the Open Model Ecosystem

The pipeline described above is straightforward in principle. In practice, getting multi-turn tool calling to work reliably with open-weight models served through vLLM turned out to be the hardest part of this project.

We tested two open-weight models on a self-hosted vLLM (v0.15.1) instance: GPT-OSS-120B and Kimi K2.5. Both failed to produce usable research trajectories, for related but distinct reasons.

GPT-OSS-120B uses a “Harmony” output format that routes text through named channels (reasoning, final answer, tool calls). When tools are involved, vLLM’s parser consistently routes the model’s output to the wrong channel: everything lands in reasoning_content while the content field stays empty. This happens at all reasoning_effort levels. The model does the research (calls tools, reads documents, formulates queries) but the final synthesized answer never appears where the serving layer expects it. This is a known issue in vLLM’s Harmony format handling. Here’s the final message from a typical trajectory. The model has been researching for 5 tool calls but produces no answer:

1{
2 "role": "assistant",
3 "content": [{"type": "text", "text": ""}],
4 "reasoning_content": "It seems that the knowledge base may have a page about
5 Colin Bateman that includes his biography. Possibly the 'md_001100' entry is
6 about a footballer, not the author. The author Colin Bateman likely ...",
7 "tool_calls": null
8}

The model’s reasoning shows it has the answer (it identified Colin Bateman as the author), but the content field is empty and no tool call is emitted. The trajectory ends here with nothing to show for it.

Kimi K2.5 exhibits a different failure mode. With its thinking mode enabled, it has the same channel-routing problem as GPT-OSS. With thinking mode disabled, the model produces content text, but after the first tool result, it narrates what it plans to do next rather than emitting another tool call. The serving layer sees text content without tool calls and treats it as the final answer, terminating the research loop after a single search:

1{
2 "role": "assistant",
3 "content": "I found that 'Cycle of Violence' was written by Colin Bateman,
4 described as a 'Northern Irish author'. Now let me search for more details
5 about his birthplace to confirm his birth country.",
6 "reasoning_content": "The search results clearly show that 'Cycle of Violence'
7 was written by Colin Bateman, a Northern Irish author...",
8 "tool_calls": null
9}

The model intends to keep researching (“let me search for more details”) but describes the action instead of calling the tool. The framework sees content, no tool calls, and stops. We tried multiple tokenizer modes, prompt variations, and vLLM configurations; open issues on the model’s HuggingFace page confirm this is a broader compatibility gap.

The original OpenResearcher codebase handles this by bypassing vLLM’s tool call parser entirely. They hit the raw /completions endpoint (openai_generator.py), parse <tool_call> XML tags from the output with regex, and continue looping until the model emits an explicit answer marker like <answer> or final answer: (deploy_agent.py).

The open-source tool-calling stack is growing and maturing quickly, but multi-turn tool use with reasoning models is still a rough edge. For now, the practical path is to use models with battle-tested tool-calling support through their native APIs, which is what we do in the results below.


Results

We ran 64 questions uniformly sampled across 2, 3, and 4-hop difficulty levels from MuSiQue, with 50K FineWeb web documents as distractors (a 1:100 golden-to-distractor ratio). We tested two models, Claude Opus 4.5 (via API) and Nemotron Nano 3 (30B total / 3B active params, self-hosted via vLLM with reasoning disabled).

Claude Opus 4.5Nemotron Nano 3
Samples64 (55 completed)64 (61 completed)
Overall accuracy41/55 (75%)32/61 (52%)
2-hop accuracy18/23 (78%)13/23 (57%)
3-hop accuracy15/18 (83%)11/22 (50%)
4-hop accuracy8/14 (57%)8/16 (50%)
Avg tool calls16.811.8
Max tool calls5763
Avg messages per trajectory40.426.5
Max messages per trajectory117129

Opus is 22 points more accurate, but Nano runs roughly 5x faster on self-hosted hardware. Both models show tool usage scaling with hop count. Nano uses fewer tools but achieves lower accuracy, with the largest gap on 2-hop questions (78% vs 57%). Splitting by correctness reveals the same pattern in both models: incorrect trajectories are longer.

Claude Opus 4.5:

OutcomeHopsCountAvg Tool CallsAvg MessagesAvg Answer Length
Correct2187.318.91,072 chars
31514.935.71,372 chars
4821.050.61,705 chars
All4112.831.21,305 chars
Incorrect2521.048.61,534 chars
3325.763.01,795 chars
4636.085.21,903 chars
All1428.467.41,748 chars

Nemotron Nano 3:

OutcomeHopsCountAvg Tool CallsAvg MessagesAvg Answer Length
Correct2136.516.1773 chars
31112.728.5708 chars
488.019.01,600 chars
All329.021.1957 chars
Incorrect21010.123.2799 chars
31118.039.01,163 chars
4816.235.5848 chars
All2914.832.6951 chars

Correct trajectories are shorter at every hop level for both models. Incorrect trajectories are roughly twice as long because the model keeps searching when it can’t find evidence, then writes a longer answer to compensate. This anti-correlation between trajectory length and correctness is consistent across model scales, which means trajectory length alone could serve as a lightweight filter during rejection sampling.


Closing Remarks

Thanks to the OpenResearcher team for their work showing that synthetic research trajectories over local retrieval can train small models to compete with much larger ones. Their results suggest we’re only beginning to understand how LLMs interact with search tools and how the structure of those interactions shapes what models learn. We’re excited to see where the community takes synthetic data research using NeMo Data Designer as both the models and the tooling continue to improve.


Try For Yourself

Full source: openresearcher_demo.py
1import data_designer.config as dd
2from data_designer.interface import DataDesigner
3
4# Models
5config = dd.DataDesignerConfigBuilder()
6config.add_model_config(
7 dd.ModelConfig(
8 alias="search_rollout_model",
9 model="nvidia/nemotron-3-nano-30b-a3b",
10 provider="nvidia",
11 inference_parameters=dd.ChatCompletionInferenceParams(
12 temperature=1.0,
13 top_p=0.95,
14 max_tokens=16384,
15 ),
16 )
17)
18config.add_model_config(
19 dd.ModelConfig(
20 alias="judge",
21 model="nvidia/nemotron-3-nano-30b-a3b",
22 provider="nvidia",
23 )
24)
25
26# MCP retriever
27tool_config = dd.ToolConfig(
28 tool_alias="knowledge-base",
29 providers=["corpus-retriever"],
30 max_tool_call_turns=150,
31)
32config.add_tool_config(tool_config)
33
34# Seed questions with reference answers
35config.with_seed_dataset(
36 dd.LocalFileSeedSource(path="questions.jsonl"),
37)
38
39config.add_column(
40 dd.ExpressionColumnConfig(
41 name="research_question",
42 expr="{{ question }}",
43 )
44)
45
46# Research trajectory generation
47config.add_column(
48 dd.LLMTextColumnConfig(
49 name="research_answer",
50 prompt="Research and answer thoroughly:\n\n{{ research_question }}",
51 model_alias="search_rollout_model",
52 system_prompt=SYSTEM_PROMPT,
53 tool_alias="knowledge-base",
54 with_trace=dd.TraceType.ALL_MESSAGES,
55 extract_reasoning_content=True,
56 )
57)
58
59# Rejection sampling judge
60config.add_column(
61 dd.LLMJudgeColumnConfig(
62 name="correctness",
63 model_alias="judge",
64 prompt=(
65 "Question: {{ research_question }}\n"
66 "Reference answer: {{ answer }}\n"
67 "Generated answer: {{ research_answer }}\n"
68 "Does the generated answer correctly address the question?"
69 ),
70 scores=[
71 dd.Score(
72 name="correct",
73 description="Is the answer factually correct?",
74 options={
75 1: "Correct",
76 0: "Incorrect",
77 },
78 ),
79 ],
80 )
81)
82
83# Run
84mcp_provider = dd.LocalStdioMCPProvider(
85 name="corpus-retriever",
86 command="uv",
87 args=["run", "retriever_mcp.py", "serve"],
88 env={"CORPUS_PATH": "corpus.jsonl"},
89)
90data_designer = DataDesigner(mcp_providers=[mcp_provider])
91results = data_designer.create(
92 config_builder=config,
93 num_records=1000,
94 dataset_name="research-trajectories",
95)
Full source: prepare_corpus.py
1# /// script
2# requires-python = ">=3.10"
3# dependencies = ["datasets", "huggingface_hub", "pyarrow"]
4# ///
5
6"""Prepare a retrieval corpus and question set for the OpenResearcher demo.
7
8Builds corpus.jsonl and questions.jsonl from two sources:
9
10 1. MuSiQue — multi-hop QA dataset (2/3/4-hop) with golden passages
11 2. FineWeb — web documents as distractors (matches the OpenResearcher paper)
12
13Golden passages (documents containing evidence for the answer) are mixed with
14FineWeb distractors at roughly 1:100 ratio, so the model must search through
15noise to find the signal.
16
17Usage:
18 uv run prepare_corpus.py
19"""
20
21from __future__ import annotations
22
23import json
24import random
25from pathlib import Path
26from urllib.parse import urlparse
27
28# ---------------------------------------------------------------------------
29# Configuration
30# ---------------------------------------------------------------------------
31
32NUM_QUESTIONS = 192 # 64 per hop level (2, 3, 4)
33NUM_FINEWEB_DISTRACTORS = 50_000
34FINEWEB_SHARD = 0
35OUTPUT_DIR = "data"
36
37# ---------------------------------------------------------------------------
38# MuSiQue extraction
39# ---------------------------------------------------------------------------
40
41def prepare_musique(num_questions: int) -> tuple[list[dict], list[dict]]:
42 """Load MuSiQue and extract multi-hop questions with golden passages.
43
44 Samples uniformly across hop counts (2, 3, 4) so the dataset has balanced
45 difficulty. Golden passages (is_supporting=True) go into the corpus;
46 non-golden passages from the same examples serve as additional distractors.
47
48 Returns:
49 (questions, corpus_docs) where corpus_docs have is_golden=True/False.
50 """
51 from datasets import load_dataset
52
53 print("Loading MuSiQue (train split)...")
54 dataset = load_dataset("bdsaglam/musique", split="train")
55
56 # Bucket answerable examples by hop count
57 hop_buckets: dict[int, list[dict]] = {}
58 for example in dataset:
59 if not example.get("answerable", False):
60 continue
61 num_hops = len(example.get("question_decomposition", []))
62 if num_hops < 2:
63 continue
64 hop_buckets.setdefault(num_hops, []).append(example)
65
66 # Sample uniformly: equal questions per hop level
67 available_hops = sorted(hop_buckets.keys())
68 per_hop = num_questions // len(available_hops)
69 selected_examples = []
70 for h in available_hops:
71 bucket = hop_buckets[h]
72 n = min(per_hop, len(bucket))
73 selected_examples.extend(random.sample(bucket, n))
74
75 print(f" Selected {len(selected_examples)} questions across hops {available_hops}")
76
77 # Build questions and corpus docs
78 questions: list[dict] = []
79 golden_titles: dict[str, str] = {}
80 nongolden_titles: dict[str, str] = {}
81
82 for example in selected_examples:
83 num_hops = len(example["question_decomposition"])
84 questions.append({
85 "id": f"mq_{len(questions):06d}",
86 "question": example["question"],
87 "answer": example["answer"],
88 "source": "musique",
89 "num_hops": num_hops,
90 "seed_id": 0,
91 })
92
93 for para in example.get("paragraphs", []):
94 title = para.get("title", "").strip()
95 content = para.get("paragraph_text", "").strip()
96 if not title or not content:
97 continue
98 if para.get("is_supporting", False):
99 if len(content) > len(golden_titles.get(title, "")):
100 golden_titles[title] = content
101 else:
102 if len(content) > len(nongolden_titles.get(title, "")):
103 nongolden_titles[title] = content
104
105 # Golden passages
106 corpus_docs = [
107 {"title": t, "content": c, "source": "musique", "is_golden": True}
108 for t, c in sorted(golden_titles.items())
109 ]
110 # Non-golden passages (skip titles already in golden set)
111 corpus_docs.extend(
112 {"title": t, "content": c, "source": "musique", "is_golden": False}
113 for t, c in sorted(nongolden_titles.items())
114 if t not in golden_titles
115 )
116
117 print(f" Golden passages: {len(golden_titles)}")
118 print(f" Non-golden passages: {len(corpus_docs) - len(golden_titles)}")
119 return questions, corpus_docs
120
121
122# ---------------------------------------------------------------------------
123# FineWeb distractor caching
124# ---------------------------------------------------------------------------
125
126def cache_fineweb(shard_index: int, max_docs: int) -> list[dict]:
127 """Download a FineWeb parquet shard and extract English documents.
128
129 Uses huggingface_hub for direct shard download (faster than load_dataset)
130 and pyarrow for memory-efficient row-group-at-a-time reading.
131
132 Returns:
133 List of distractor documents with title (domain) and content (text).
134 """
135 from huggingface_hub import hf_hub_download
136 import pyarrow.parquet as pq
137
138 filename = f"sample/10BT/{shard_index:03d}_00000.parquet"
139 print(f"Downloading FineWeb shard: {filename}")
140 parquet_path = hf_hub_download(
141 repo_id="HuggingFaceFW/fineweb",
142 repo_type="dataset",
143 filename=filename,
144 )
145
146 pf = pq.ParquetFile(parquet_path)
147 print(f" {pf.metadata.num_rows:,} rows in shard")
148
149 docs: list[dict] = []
150 for rg_idx in range(pf.metadata.num_row_groups):
151 table = pf.read_row_group(rg_idx, columns=["text", "url", "language", "token_count"])
152 batch = table.to_pydict()
153
154 for text, url, lang, tok_count in zip(
155 batch["text"], batch["url"], batch["language"], batch["token_count"]
156 ):
157 if lang != "en" or tok_count < 50:
158 continue
159 text = text.strip()
160 if not text:
161 continue
162
163 # Use domain as title
164 try:
165 domain = urlparse(url).netloc.removeprefix("www.")
166 except Exception:
167 domain = "unknown"
168
169 docs.append({
170 "title": domain,
171 "content": text,
172 "source": "fineweb",
173 "is_golden": False,
174 })
175 if len(docs) >= max_docs:
176 break
177
178 if len(docs) >= max_docs:
179 break
180
181 print(f" Extracted {len(docs):,} English documents (min 50 tokens)")
182 return docs
183
184
185# ---------------------------------------------------------------------------
186# Main
187# ---------------------------------------------------------------------------
188
189def main() -> None:
190 output_dir = Path(OUTPUT_DIR)
191 output_dir.mkdir(parents=True, exist_ok=True)
192
193 # Extract MuSiQue questions and golden passages
194 questions, corpus_docs = prepare_musique(NUM_QUESTIONS)
195
196 # Download FineWeb distractors
197 fineweb_docs = cache_fineweb(FINEWEB_SHARD, NUM_FINEWEB_DISTRACTORS)
198 corpus_docs.extend(fineweb_docs)
199
200 # Deduplicate by title (keep longest content)
201 title_to_best: dict[str, dict] = {}
202 for doc in corpus_docs:
203 title = doc["title"]
204 if title not in title_to_best or len(doc["content"]) > len(title_to_best[title]["content"]):
205 title_to_best[title] = doc
206
207 corpus = list(title_to_best.values())
208 random.shuffle(corpus)
209
210 # Assign stable IDs
211 prefix_map = {"musique": "md", "fineweb": "fw"}
212 source_counters: dict[str, int] = {}
213 for doc in corpus:
214 prefix = prefix_map.get(doc["source"], "xx")
215 idx = source_counters.get(doc["source"], 0)
216 doc["id"] = f"{prefix}_{idx:06d}"
217 source_counters[doc["source"]] = idx + 1
218
219 # Write corpus.jsonl
220 corpus_path = output_dir / "corpus.jsonl"
221 with open(corpus_path, "w") as f:
222 for doc in corpus:
223 f.write(json.dumps(doc, ensure_ascii=False) + "\n")
224
225 # Write questions.jsonl
226 random.shuffle(questions)
227 questions_path = output_dir / "questions.jsonl"
228 with open(questions_path, "w") as f:
229 for q in questions:
230 f.write(json.dumps(q, ensure_ascii=False) + "\n")
231
232 # Summary
233 golden = sum(1 for d in corpus if d["is_golden"])
234 nongolden = len(corpus) - golden
235 print(f"\nCorpus: {len(corpus):,} docs ({golden} golden, {nongolden} distractors)")
236 print(f"Questions: {len(questions)}")
237 print(f"Output: {corpus_path.resolve()}")
238 print(f" {questions_path.resolve()}")
239
240
241if __name__ == "__main__":
242 main()
Full source: retriever_mcp.py
1# /// script
2# requires-python = ">=3.10"
3# dependencies = ["mcp", "bm25s", "PyStemmer"]
4# ///
5
6"""MCP Server: BM25S Corpus Retriever for OpenResearcher-style Deep Research
7
8A single-file MCP server that indexes a JSONL corpus and exposes BM25S
9lexical search via three browser tools:
10
11 - search(query, top_k): ranked document discovery
12 - open(doc_id): full document inspection with cursor-numbered chunks
13 - find(doc_id, query): in-document evidence lookup
14
15Corpus format (JSONL, one document per line):
16 {"id": "wiki_123", "title": "Christopher Nolan", "content": "Christopher Edward Nolan is a..."}
17
18Server mode (used by Data Designer):
19 CORPUS_PATH=corpus.jsonl uv run retriever_mcp.py serve
20"""
21
22from __future__ import annotations
23
24import argparse
25import json
26import os
27import re
28import sys
29
30import bm25s
31from mcp.server.fastmcp import FastMCP
32
33MCP_SERVER_NAME = "corpus-retriever"
34
35# Global state — populated at server startup
36_bm25_retriever: bm25s.BM25 | None = None
37_corpus: list[dict[str, str]] = []
38_id_to_index: dict[str, int] = {}
39
40mcp_server = FastMCP(MCP_SERVER_NAME)
41
42
43def load_corpus(corpus_path: str) -> list[dict[str, str]]:
44 """Load a JSONL corpus file into a list of document dicts."""
45 docs: list[dict[str, str]] = []
46 with open(corpus_path, "r", encoding="utf-8") as f:
47 for line_num, line in enumerate(f, 1):
48 line = line.strip()
49 if not line:
50 continue
51 try:
52 doc = json.loads(line)
53 except json.JSONDecodeError as e:
54 print(f"Warning: skipping malformed JSON at line {line_num}: {e}", file=sys.stderr)
55 continue
56 if "id" not in doc or "content" not in doc:
57 print(f"Warning: skipping line {line_num}, missing 'id' or 'content'", file=sys.stderr)
58 continue
59 docs.append({
60 "id": str(doc["id"]),
61 "title": str(doc.get("title", "")),
62 "content": str(doc["content"]),
63 })
64 return docs
65
66
67def build_index(docs: list[dict[str, str]]) -> bm25s.BM25:
68 """Build a BM25S index over title + content for each document."""
69 corpus_texts = [f"{d['title']} {d['content']}" for d in docs]
70 corpus_tokens = bm25s.tokenize(corpus_texts, stopwords="en")
71 retriever = bm25s.BM25()
72 retriever.index(corpus_tokens)
73 return retriever
74
75
76def initialize(corpus_path: str) -> None:
77 """Load corpus and build index into global state."""
78 global _bm25_retriever, _corpus, _id_to_index
79 print(f"Loading corpus from {corpus_path}...", file=sys.stderr)
80 _corpus = load_corpus(corpus_path)
81 if not _corpus:
82 print("Warning: corpus is empty", file=sys.stderr)
83 return
84 _id_to_index = {doc["id"]: idx for idx, doc in enumerate(_corpus)}
85 print(f"Building BM25S index over {len(_corpus)} documents...", file=sys.stderr)
86 _bm25_retriever = build_index(_corpus)
87 print(f"Index ready. {len(_corpus)} documents indexed.", file=sys.stderr)
88
89
90def _chunk_content(content: str) -> list[str]:
91 """Split document content into cursor-addressable chunks."""
92 paragraph_chunks = [c.strip() for c in re.split(r"\n\s*\n+", content) if c.strip()]
93 if len(paragraph_chunks) > 1:
94 return paragraph_chunks
95 line_chunks = [line.strip() for line in content.splitlines() if line.strip()]
96 if line_chunks:
97 return line_chunks
98 stripped = content.strip()
99 return [stripped] if stripped else []
100
101
102@mcp_server.tool()
103def search(query: str, top_k: int = 10) -> dict:
104 """Search for candidate documents to explore.
105
106 Args:
107 query: Search query string.
108 top_k: Maximum number of ranked results (default: 10).
109 """
110 global _bm25_retriever, _corpus
111 if _bm25_retriever is None or not _corpus:
112 return {"error": "Search index not initialized", "results": []}
113 query_tokens = bm25s.tokenize([query], stopwords="en")
114 k = max(1, min(top_k, len(_corpus)))
115 results, scores = _bm25_retriever.retrieve(query_tokens, k=k)
116 search_results: list[dict] = []
117 for i in range(results.shape[1]):
118 doc_idx = results[0, i]
119 score = float(scores[0, i])
120 if score <= 0:
121 continue
122 doc = _corpus[doc_idx]
123 snippet = doc["content"][:500]
124 if len(doc["content"]) > 500:
125 snippet += "..."
126 search_results.append({
127 "id": doc["id"],
128 "title": doc["title"],
129 "snippet": snippet,
130 "score": round(score, 4),
131 })
132 return {"results": search_results, "query": query, "total": len(search_results)}
133
134
135@mcp_server.tool(name="open")
136def open_document(doc_id: str) -> dict:
137 """Open a document for detailed inspection with cursor-numbered chunks.
138
139 Args:
140 doc_id: The document ID (from search results).
141 """
142 global _corpus, _id_to_index
143 if not _corpus:
144 return {"error": "Corpus not loaded"}
145 idx = _id_to_index.get(doc_id)
146 if idx is None:
147 return {"error": f"Document not found: {doc_id}"}
148 doc = _corpus[idx]
149 chunks = _chunk_content(doc["content"])
150 numbered_chunks = [{"cursor": i + 1, "text": chunk} for i, chunk in enumerate(chunks)]
151 formatted = "\n".join(f"[{e['cursor']}] {e['text']}" for e in numbered_chunks)
152 return {
153 "id": doc["id"],
154 "title": doc["title"],
155 "content": formatted,
156 "chunks": numbered_chunks,
157 "total_chunks": len(numbered_chunks),
158 }
159
160
161@mcp_server.tool()
162def find(doc_id: str, query: str) -> dict:
163 """Find matching passages inside a document by keyword.
164
165 Args:
166 doc_id: Document ID to search within.
167 query: Text to find (case-insensitive substring and keyword matching).
168 """
169 global _corpus, _id_to_index
170 if not _corpus:
171 return {"error": "Corpus not loaded", "matches": []}
172 idx = _id_to_index.get(doc_id)
173 if idx is None:
174 return {"error": f"Document not found: {doc_id}", "matches": []}
175 query_text = query.strip().lower()
176 if not query_text:
177 return {"error": "Query must be non-empty", "matches": []}
178 doc = _corpus[idx]
179 chunks = _chunk_content(doc["content"])
180 query_terms = [term for term in re.findall(r"\w+", query_text) if term]
181 matches: list[dict] = []
182 for i, chunk in enumerate(chunks, start=1):
183 haystack = chunk.lower()
184 if query_text in haystack or (query_terms and all(t in haystack for t in query_terms)):
185 matches.append({"cursor": i, "text": chunk})
186 return {
187 "doc_id": doc["id"],
188 "title": doc["title"],
189 "query": query,
190 "matches": matches,
191 "total_matches": len(matches),
192 }
193
194
195def serve() -> None:
196 """Run as MCP server subprocess (called by Data Designer)."""
197 corpus_path = os.environ.get("CORPUS_PATH", "corpus.jsonl")
198 initialize(corpus_path)
199 mcp_server.run()
200
201
202if __name__ == "__main__":
203 parser = argparse.ArgumentParser(description="BM25S corpus retriever MCP server")
204 subparsers = parser.add_subparsers(dest="command")
205 subparsers.add_parser("serve", help="Run the MCP server (reads CORPUS_PATH from env)")
206 stats_parser = subparsers.add_parser("stats", help="Print corpus statistics")
207 stats_parser.add_argument("--corpus-path", default="corpus.jsonl")
208 args = parser.parse_args()
209 if args.command == "serve":
210 serve()
211 elif args.command == "stats":
212 docs = load_corpus(args.corpus_path)
213 total_chars = sum(len(d["content"]) for d in docs)
214 print(f"Corpus: {args.corpus_path}")
215 print(f"Documents: {len(docs)}")
216 print(f"Total content: {total_chars:,} chars (~{total_chars // 4:,} tokens)")
217 else:
218 parser.print_help()

Key Resources:

  1. NeMo Data Designer on GitHub
  2. OpenResearcher on GitHub
  3. OpenResearcher blog post
  4. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
  5. MuSiQue: Multi-hop Questions via Single-hop Question Composition
  6. BM25S: Fast lexical search in Python