Data Designer v0.5.0’s MCP tool-use support lets you generate multi-turn research trajectories, the kind of data needed to train deep research agents that iteratively search, read, and synthesize evidence before answering a question.

Deep research agents like OpenResearcher (Li, Jiang, Ma et al., 2026) and Universal Deep Research (Belcak & Molchanov, 2025) generate long reasoning chains interleaved with tool calls: formulating queries, retrieving documents, reading passages, refining hypotheses, and eventually synthesizing an answer. Training these agents requires trajectory data capturing the full multi-turn interaction between a model and its tools: every search, every document opened, every dead end explored.
OpenResearcher demonstrated something worth paying attention to: synthetic trajectories generated against a local retriever (BM25 over a static corpus, no web APIs) are sufficient to train Nemotron Nano 3 to outperform GPT-4.1 on deep research benchmarks. The data format (complete tool-use traces showing how a model moves through an information space) matters more than model scale. Nemotron Nano 3, with only 3B active parameters, beats models orders of magnitude larger on multi-hop research tasks.
This post shows how to generate that same kind of training data using Data Designer’s MCP tool-use capabilities. We build a retriever as an MCP server, construct a corpus with known-good evidence, run a teacher model through the full research process, and use an LLM judge for rejection sampling. The result is a pipeline that produces high-quality research trajectories you can use for supervised fine-tuning or as a starting point for RL.
Here’s what one of those trajectories looks like, a 4-hop question answered correctly by Claude Opus 4.5 using the pipeline described below. Each line is a tool call; parallel calls within the same turn are grouped.
OpenResearcher’s key design choice is a three-tool browser interface rather than a single retrieval call. The paper argues (and their ablations confirm) that separating search, document opening, and in-document search forces the model to develop genuine research strategies: skimming results, diving into promising documents, hunting for specific evidence within them. A single monolithic “retrieve” tool collapses this entire workflow into one step, which produces shorter and less useful training trajectories.
We implement the same three tools as an MCP server that Data Designer can invoke during generation. Our retriever uses BM25S for fast lexical search over the corpus:
search returns a ranked list of document IDs with short snippets, enough for the model to decide which documents look promising. open returns the full document content, split into cursor-numbered chunks so the model can reference specific passages. find does targeted keyword search within a single document, letting the model locate specific evidence without reading the entire thing. The cursor-based chunking across open and find gives the model a way to scan long documents incrementally, the way a human researcher would scan a paper for the relevant section rather than reading it cover to cover.
The server runs as a local stdio process, which means Data Designer launches and manages it automatically. No external services, no API keys for retrieval, no rate limits.
The corpus design follows directly from OpenResearcher’s most striking ablation result. They tested what happens when you vary the retrieval corpus while keeping the reasoning model fixed (GPT-OSS-120B). The results, from the OpenResearcher Appendix:
Without golden passages (documents known to contain evidence for the question), accuracy drops to nearly zero. The model can’t learn research strategies from trajectories where every search is a dead end.
The original OpenResearcher corpus uses 15M documents from FineWeb as distractors alongside 10K golden passages. For this demonstration, we use a lighter-weight approach: we construct the corpus from multi-hop QA datasets: HotpotQA (2-hop questions requiring two pieces of linked evidence) and MuSiQue (2-4 hop questions composed from single-hop sub-questions). Each question comes with annotated supporting passages, the specific paragraphs that contain the evidence needed to answer it. Golden passages go into the corpus alongside non-supporting passages from the same datasets as distractors, at roughly a 1:9 ratio. The model has to search through noise to find the signal, which is exactly the skill we want the training data to teach.
The key constraint is that golden passages must be findable but not obvious. If the corpus is too small or the golden passages are too easy to identify, the trajectories won’t transfer to real-world research where evidence is sparse. The distractor ratio controls this difficulty, and the paper’s ablations give us a good starting point for tuning it.
With the retriever server and corpus ready, the Data Designer pipeline ties everything together. We configure a teacher model, point it at the MCP retriever, and let it research each question from scratch. For this demo we hosted our own inference server, but anyone can try this pipeline using Nemotron Nano 3 on build.nvidia.com with a free API key using the model configuration shown below.
The temperature and top_p settings matter here. We want diverse research strategies across seeds (different query formulations, different document exploration orders) so that rejection sampling has a rich pool to select from. Setting temperature to 1.0 with top_p at 0.95 gives enough variation that the same question can produce meaningfully different trajectories across seeds.
The MCP tool configuration tells Data Designer which server to use and how many tool-call turns to allow:
We set max_tool_call_turns high (150) because deep research trajectories can be long. Our longest observed trajectory used 25 tool calls across 53 messages. Capping too low would truncate the most interesting research chains.
The seed dataset contains the research questions alongside reference answers (which we’ll use for rejection sampling in Step 4):
The core of the pipeline is the research column, where the teacher model receives a question and a system prompt instructing it to use the retriever tools:
Two settings are doing the important work here. with_trace=dd.TraceType.ALL_MESSAGES captures the entire interaction (every tool call, every tool response, every intermediate reasoning step) into a separate trace column in ChatML format. This is the training data: the full trajectory of how the model moved through the information space. extract_reasoning_content=True pulls out the model’s internal chain-of-thought separately, so you can include or exclude it depending on your training setup.
Not every trajectory leads to a correct answer. OpenResearcher’s approach is straightforward. Generate multiple trajectories per question, score them for correctness, and keep only the ones that got the right answer. We implement this with Data Designer’s LLMJudgeColumnConfig, using a separate (smaller) model as the judge:
The judge compares the generated answer against the reference answer from the seed dataset. Using a smaller model as judge is deliberate. We don’t need the judge to reason about the question, just to compare two answers for factual agreement. This keeps costs down when scoring thousands of trajectories.
In practice, you’d generate multiple trajectories per question (varying the random seed) and filter to correctness.correct == 1. The incorrect trajectories aren’t wasted; they can serve as negative examples for preference-based training methods like DPO.
The pipeline described above is straightforward in principle. In practice, getting multi-turn tool calling to work reliably with open-weight models served through vLLM turned out to be the hardest part of this project.
We tested two open-weight models on a self-hosted vLLM (v0.15.1) instance: GPT-OSS-120B and Kimi K2.5. Both failed to produce usable research trajectories, for related but distinct reasons.
GPT-OSS-120B uses a “Harmony” output format that routes text through named channels (reasoning, final answer, tool calls). When tools are involved, vLLM’s parser consistently routes the model’s output to the wrong channel: everything lands in reasoning_content while the content field stays empty. This happens at all reasoning_effort levels. The model does the research (calls tools, reads documents, formulates queries) but the final synthesized answer never appears where the serving layer expects it. This is a known issue in vLLM’s Harmony format handling. Here’s the final message from a typical trajectory. The model has been researching for 5 tool calls but produces no answer:
The model’s reasoning shows it has the answer (it identified Colin Bateman as the author), but the content field is empty and no tool call is emitted. The trajectory ends here with nothing to show for it.
Kimi K2.5 exhibits a different failure mode. With its thinking mode enabled, it has the same channel-routing problem as GPT-OSS. With thinking mode disabled, the model produces content text, but after the first tool result, it narrates what it plans to do next rather than emitting another tool call. The serving layer sees text content without tool calls and treats it as the final answer, terminating the research loop after a single search:
The model intends to keep researching (“let me search for more details”) but describes the action instead of calling the tool. The framework sees content, no tool calls, and stops. We tried multiple tokenizer modes, prompt variations, and vLLM configurations; open issues on the model’s HuggingFace page confirm this is a broader compatibility gap.
The original OpenResearcher codebase handles this by bypassing vLLM’s tool call parser entirely. They hit the raw /completions endpoint (openai_generator.py), parse <tool_call> XML tags from the output with regex, and continue looping until the model emits an explicit answer marker like <answer> or final answer: (deploy_agent.py).
The open-source tool-calling stack is growing and maturing quickly, but multi-turn tool use with reasoning models is still a rough edge. For now, the practical path is to use models with battle-tested tool-calling support through their native APIs, which is what we do in the results below.
We ran 64 questions uniformly sampled across 2, 3, and 4-hop difficulty levels from MuSiQue, with 50K FineWeb web documents as distractors (a 1:100 golden-to-distractor ratio). We tested two models, Claude Opus 4.5 (via API) and Nemotron Nano 3 (30B total / 3B active params, self-hosted via vLLM with reasoning disabled).
Opus is 22 points more accurate, but Nano runs roughly 5x faster on self-hosted hardware. Both models show tool usage scaling with hop count. Nano uses fewer tools but achieves lower accuracy, with the largest gap on 2-hop questions (78% vs 57%). Splitting by correctness reveals the same pattern in both models: incorrect trajectories are longer.
Claude Opus 4.5:
Nemotron Nano 3:
Correct trajectories are shorter at every hop level for both models. Incorrect trajectories are roughly twice as long because the model keeps searching when it can’t find evidence, then writes a longer answer to compensate. This anti-correlation between trajectory length and correctness is consistent across model scales, which means trajectory length alone could serve as a lightweight filter during rejection sampling.
Thanks to the OpenResearcher team for their work showing that synthetic research trajectories over local retrieval can train small models to compete with much larger ones. Their results suggest we’re only beginning to understand how LLMs interact with search tools and how the structure of those interactions shapes what models learn. We’re excited to see where the community takes synthetic data research using NeMo Data Designer as both the models and the tooling continue to improve.
openresearcher_demo.pyprepare_corpus.pyretriever_mcp.pyKey Resources: