Search Agent SFT Data: Teaching LLMs to Browse the Web

Dhruv NathawaniResearcher at NVIDIA

Training search agents requires trajectory data --- the full multi-turn interaction showing how a model searches, reads, reasons, and answers. We built a four-stage pipeline that generates synthetic search trajectories from Wikidata knowledge graph paths, converts them into BrowseComp-style riddles using NeMo Data Designer, generates multi-step search rollouts with live web search via Tavily, and post-processes the results into SFT-ready training data.

Why This Matters: The Agentic Shift

The industry is moving from models that answer questions to agents that take actions. Real-world AI applications orchestrate multiple steps --- searching the web, querying databases, reading documents, calling APIs --- with the LLM as the reasoning engine deciding what to do next.

Consider this question from OpenAI’s BrowseComp benchmark:

Between 1990 and 1994 inclusive, what teams played in a soccer match with a Brazilian referee that had four yellow cards, two for each team where three of the total four were not issued during the first half, and four substitutions, one of which was for an injury in the first 25 minutes of the match.

Answer: Ireland v Romania

You can’t answer this from memory. You need to search, read results, refine your query, search again, and piece it together --- exactly what we want AI agents to do. Training a model for this requires trajectory data: the full record of every search query, every result evaluation, and every reasoning step, not just the final answer.

Creating this data by hand takes 15-30 minutes per example. At the thousands of trajectories needed for SFT, that’s months of annotation work. We needed a way to generate it synthetically.

End-to-End Pipeline Architecture

                                    SEARCH AGENT SFT PIPELINE
                                    =========================
             ┌─────────────────────────────────────────────────────────────────────────────────────┐
             │                       STAGE 1: SEED DATA (Wikidata KG Walks)                        │
             │                                                                                     │
             │   Random walks on the Wikidata knowledge graph                                      │
             │   ├─ Anti-meta filters (no category/template/list-y nodes)                          │
             │   ├─ Hop range: 4 minimum, 8 maximum                                                │
             │   └─ SPARQL queries to fetch neighbors                                              │
             │                                                                                     │
             │   Output: seed JSONL with hops[], seed_entity, final_answer_entity, path_length     │
             │   50,000 seeds generated                                                            │
             └─────────────────────────────────────────┬───────────────────────────────────────────┘
                                                       │
                                                       ▼
             ┌─────────────────────────────────────────────────────────────────────────────────────┐
             │                   STAGE 2: SEARCH RIDDLE GENERATION (Data Designer)                 │
             │                                                                                     │
             │   user_query_draft ──────────► user_query_obfuscated                                │
             │   (chain clues from path,       (BrowseComp-style rewrite:                          │
             │    hide intermediate nodes,      concise, natural, no breadcrumbs,                  │
             │    don't name the answer)        1-2 sentences max)                                 │
             │                                                                                     │
             │   + Heuristic filters: answer leakage, intermediate node leakage, INVALID_PATH      │
             │   50,000 → 37,000 valid seeds → 24,000 valid questions                              │
             └─────────────────────────────────────────┬───────────────────────────────────────────┘
                                                       │
                                                       ▼
             ┌─────────────────────────────────────────────────────────────────────────────────────┐
             │                      STAGE 3: SEARCH TRAJECTORY ROLLOUTS                            │
             │                                                                                     │
             │   Thought-Action-Observation loop with live web search (Tavily API)                 │
             │   ├─ Rollout model: MiniMax-M2 (strong BrowseComp + tool-calling scores)            │
             │   ├─ Average ~12 tool calls per sample                                              │
             │   ├─ Multiple rollouts per question for rejection sampling                          │
             │   └─ 6,974 completed (stop) / 177 truncated (length)                                │
             │                                                                                     │
             │   24,000 questions → 7,000 valid trajectories                                       │
             └─────────────────────────────────────────┬───────────────────────────────────────────┘
                                                       │
                                                       ▼
             ┌─────────────────────────────────────────────────────────────────────────────────────┐
             │                      STAGE 4: POST-PROCESSING → SFT DATASET                         │
             │                                                                                     │
             │   ├─ Normalize tool outputs to consistent JSON "tool response" shape                │
             │   ├─ Drop broken/truncated interactions                                             │
             │   ├─ Select best rollout per question (min tool calls among correct)                │
             │   ├─ Write OpenAI-messages style: messages[], tools[], metadata{}                   │
             │   └─ Manual review + LLM spot-checking (Gemini)                                     │
             │                                                                                     │
             │   ~7,000 SFT records for Nemotron Super                                             │
             └─────────────────────────────────────────────────────────────────────────────────────┘

Step 1: Seed Data from Wikidata Knowledge Graph Walks

The core idea: start at a random entity in the Wikidata knowledge graph and perform a random walk through its relations, producing a chain of hops that becomes a multi-hop search problem. Each chain provides a seed_entity (start), a final_answer_entity (destination), and a readable_path describing the edges traversed.

We used Wikidata SPARQL queries to fetch neighbors at each hop. The number of hops is directly proportional to the number of tool calls the model would need to solve questions derived from that path --- more hops means harder riddles.

START ENTITY: NVIDIA (Q182477)
  ⬇ [chief executive officer (P169)]
  NODE: Jensen Huang (Q332838)
  ⬇ [educated at (P69)]
  NODE: Oregon State University (Q861888)
  ⬇ [located in the administrative territorial entity (P131)]
  NODE: Benton County (Q115372)
  ⬇ [named after (P138)]
  NODE: Thomas Hart Benton (Q178712)

START ENTITY: toothache (Q143925)
  ⬇ [risk factor (P564)]
  NODE: smoking (Q662860)
  ⬇ [has effect (P1542)]
  NODE: Crohn's disease (Q1472)
  ⬇ [drug or therapy used for treatment (P2176)]
  NODE: TNF inhibitor (Q1536078)
  ⬇ [(is possible treatment of) (P2175)] reverse relation
  NODE: Behçet's disease (Q911427)
  ⬇ [symptoms and signs (P780)]
  NODE: inflammation (Q101991)
  ⬇ [drug or therapy used for treatment (P2176)]
  NODE: (±)-flurbiprofen (Q419890)
  ⬇ [significant drug interaction (P769)]
  NODE: parecoxib (Q347941)
  ⬇ [significant drug interaction (P769)]
  NODE: ibuprofen (Q186969)

Heuristics to Keep Walks Coherent

Unrestricted random walks go off the rails quickly --- you’d get paths like CEO → Human Being → Civilization → Indus Valley. We applied several filters:

Anti-meta filters: Avoid category nodes, template pages, list-y entities, and other degenerate hops that exist for Wikidata bookkeeping rather than representing real-world relationships.
Hop range: 4 minimum, 8 maximum. Below 4 hops, the questions aren’t difficult enough to require multi-step search. Above 8, the path wanders off-topic and produces unsolvable riddles.
Generic entity filtering: Remove seeds where the final_answer_entity is too abstract (“technology”, “people”, “field”, “concept”). These produce questions where any answer could be correct.

The resulting seed dataset: 50,000 JSONL records, each containing hops[], seed_entity, final_answer_entity, readable_path, and path_length.

A Note on Ground Truth Staleness

An important caveat when using Wikidata as a seed source: the knowledge graph reflects a snapshot in time. Models with current parametric knowledge or live search results may find answers that are factually correct today but disagree with the KG-derived ground truth. For example, a question about “which country contains the headquarters of the owner of U.S. Steel?” has ground truth “United States” from Wikidata --- but U.S. Steel was acquired by Nippon Steel (Japan) in Dec 2023, making “Japan” the correct answer now. This staleness affects both question quality (paths through outdated facts) and evaluation (correct model answers flagged as wrong). We revisit this challenge in the Correctness Challenge section below.

Step 2: Creating Search Riddles with Data Designer

Each seed path needs to be converted into two things: a search question (obfuscated so it doesn’t leak the answer) and a ground truth target entity (the final node in the path). We use two chained LLM columns in Data Designer for this.

Stage 1 --- Draft question: Chain clues from the knowledge path into a multi-hop riddle. Critical rules: don’t name intermediate nodes, don’t name the final answer, skip weak or illogical hops, and output INVALID_PATH if the path is unsalvageable.

Stage 2 --- Obfuscated question: Rewrite the draft in BrowseComp style --- concise, natural, 1-2 sentences max. The solver must figure out what to search rather than following explicit breadcrumbs. No relational breadcrumbing like “X is member of Y; Y is based in Z…”.

1 import data_designer.config as dd
2 from data_designer.interface import DataDesigner
3 
4 config = dd.DataDesignerConfigBuilder(model_configs=[
5     dd.ModelConfig(
6         alias="riddle-gen",
7         model="qwen/qwen3-235b-a22b",
8         provider="nvidia",
9     ),
10 ])
11 
12 config.with_seed_dataset(
13     dd.LocalFileSeedSource(path="search_agent_seeds.parquet"),
14     sampling_strategy=dd.SamplingStrategy.SHUFFLE,
15 )
16 
17 # Stage 1: Draft question from knowledge path
18 config.add_column(dd.LLMTextColumnConfig(
19     name="user_query_draft",
20     model_alias="riddle-gen",
21     prompt=(
22         "You are an expert Search Evaluator designing Grandmaster-Level search tests.\n"
23         "Create a complex, multi-step search riddle based on this knowledge path:\n\n"
24         "{{ readable_path }}\n\n"
25         "Start Entity: {{ seed_entity }}\n"
26         "Final Answer Entity: {{ final_answer_entity }}\n\n"
27         "RULES:\n"
28         "1. DO NOT name the intermediate nodes. Hide them behind descriptions.\n"
29         "2. DO NOT name the Final Answer.\n"
30         "3. Chain the clues logically.\n"
31         "4. If a step is weak or nonsensical, IGNORE IT.\n"
32         "5. Output INVALID_PATH if the path is unsalvageable.\n\n"
33         "Return ONLY the question string (or INVALID_PATH)."
34     ),
35 ))
36 
37 # Stage 2: BrowseComp-style obfuscation
38 config.add_column(dd.LLMTextColumnConfig(
39     name="user_query_obfuscated",
40     model_alias="riddle-gen",
41     prompt=(
42         "Rewrite this search riddle to be MORE obfuscated and natural.\n\n"
43         "Original: {{ user_query_draft }}\n"
44         "Secret path: {{ readable_path }}\n\n"
45         "REQUIREMENTS:\n"
46         "1. DO NOT reveal the step-by-step plan. No breadcrumb chains.\n"
47         "2. DO NOT name intermediate or final entities.\n"
48         "3. 1-2 sentences max. Sound like a real user question.\n"
49         "4. If original == INVALID_PATH, output INVALID_PATH.\n\n"
50         "Return ONLY the rewritten question."
51     ),
52 ))

Example transformation (NVIDIA path):

Draft:      "Starting from NVIDIA, identify the current CEO, then find
             where they received their bachelor's degree, determine which
             county houses that university's main campus, and finally
             identify the nickname of the 19th-century U.S. Senator
             the county is named after."
Obfuscated: "Identify the nickname ('Old ____') of the 19th-century U.S.
             Senator who is the namesake of the specific county that houses
             the main campus of the university where the current CEO of
             NVIDIA received his bachelor's degree."
Answer:     "Old Bullion"

The obfuscated version requires the solver to:

Identify Jensen Huang as NVIDIA’s CEO
Find where he got his bachelor’s degree (Oregon State, not Stanford)
Identify the county (Benton County, OR)
Find who the county is named after (Thomas Hart Benton)
Find his nickname --- forcing one final hop to verify it’s the Senator, not the painter

Step 3: Search Trajectory Rollouts

For each riddle, we generate multi-step tool-using conversation rollouts where the model makes several web-search tool calls, reads tool outputs, reasons about results, and ends with a final answer. The agent operates through a cyclical thought-action-observation loop:

Thought: Analyze the current state, identify the next missing link, formulate a search query
Action (Tool Call): Execute web_search(query="..."), retrieving real data from the web
Observation & Synthesis: Read the results, extract the relevant fact, loop back to step 1

The rollout model was MiniMax-M2, chosen for its strong performance on BrowseComp and tool-calling benchmarks.

Rollout Statistics

$ $ jq -r '.finish_reason[-1] // "MISSING"' shard_*/output-rs0.jsonl-async \
>   | sort | uniq -c | sort -nr
$    6974 stop
$     177 length

6,974 completed (stop) --- the model reached a final answer
177 truncated (length) --- hit the context limit before answering
Average ~12 tool calls per sample

What a Trajectory Looks Like

Each trajectory captures the full search process. Here’s a condensed example (simplified for readability --- production records include OpenAI-spec fields like tool_call_id, id, and type on tool call messages):

1 {
2   "messages": [
3     {
4       "role": "system",
5       "content": "You are an expert search agent. Your goal is to answer the user's question efficiently using verifiable online sources.\n\nExecution Rules:\n1. Search Results...\n2. Planning...\n3. Tool Use...\n4. MAXIMUM 25 Calls...\n5. Verification...\n6. Output Format - The last line must be EXACTLY: \"Final Answer: <The Entity>\"...\n"
6     },
7     {
8       "role": "user",
9       "content": "Problem: Which equatorial coordinate system, using an angular measurement instead of right ascension, can be reached by starting with lawrencium, then the physicist it honors, the university they attended...?"
10     },
11     {
12       "role": "assistant",
13       "content": "<think>\nWe need to parse the clues step by step:\n1. Start with lawrencium... named after Ernest Lawrence...\n2. The university they attended... need to verify...\n</think>",
14       "tool_calls": [{"function": {"name": "web-search", "arguments": "{\"query\": \"Ernest Lawrence attended university\"}"}}]
15     },
16     {
17       "role": "tool",
18       "name": "web-search",
19       "content": "{\"results\": [{\"title\": \"Physics - Ernest Lawrence\", \"content\": \"He attended Manchester High School prior to entering Manchester University...\"}, {\"title\": \"2025 Hall of Fame...\", \"content\": \"...Ernest Lawrence graduated from USD in 1922...\"}]}"
20     },
21     {
22       "role": "assistant",
23       "content": "<think>\nErnest Lawrence attended University of South Dakota (BS) and Yale (PhD)... The clue mentions 'six-month academic schedule' — likely 'semester'...\n</think>",
24       "tool_calls": [{"function": {"name": "web-search", "arguments": "{\"query\": \"semester six month academic schedule\"}"}}]
25     },
26     {
27       "role": "tool",
28       "name": "web-search",
29       "content": "{\"results\": [{\"title\": \"Academic Calendar\", \"content\": \"Many colleges follow a semester plan...\"}]}"
30     },
31     {
32       "role": "assistant",
33       "content": "Final Answer: first equatorial coordinate system"
34     }
35   ],
36   "metadata": {
37     "ground_truth": "first equatorial coordinate system",
38     "num_hops_in_graph": 8,
39     "seed_entity": "lawrencium",
40     "final_answer_entity": "first equatorial coordinate system",
41     "num_tool_calls": 2,
42     "finish_reason": ["tool_calls", "tool_calls", "stop"]
43   },
44   "tools": [
45     {
46       "type": "function",
47       "function": {
48         "name": "web-search",
49         "description": "Search the web for a query.",
50         "parameters": {"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]}
51       }
52     }
53   ]
54 }

Step 4: Post-Processing to SFT Dataset

Raw rollouts need cleanup before they become trainable SFT records:

Normalize tool outputs into a consistent JSON “tool response” shape
Drop broken/truncated interactions (the 177 length records)
Select the best rollout per question (minimum tool calls among correct ones)
Write OpenAI-messages style JSONL with messages[], tools[], and metadata{}
Manual review + LLM spot-checking --- we reviewed as much SFT data as we could manually and used Gemini to spot-check chunks

Production Yield Analysis

                                        PIPELINE YIELD
                                        ==============
             ┌─────────────────────────────────────────────────────────────────────────────────────┐
             │   50,000 ───74%──► 37,000 ───65%──► 24,000 ───29%──► 7,000                          │
             │   Seeds           Valid Seeds       Valid Questions    Valid Trajectories           │
             │                                                                                     │
             │   Total Yield: 14%                                                                  │
             └─────────────────────────────────────────────────────────────────────────────────────┘

Stage	Input	Output	Yield
Seed Creation (Wikidata walks)	50,000	37,000	74%
Riddle / Question Generation (DD)	37,000	24,000	65%
Search Trajectory Rollouts (Tavily)	24,000	7,000	29%
End-to-End	50,000	~7,000	~14%

The 14% yield might seem low, but each surviving record is a verified, multi-turn search trajectory showing a model successfully navigating web search. The alternative --- human annotation at 15-30 minutes per trajectory --- would take months for the same volume.

The Correctness Challenge

Measuring correctness in post-processing was one of the hardest parts of this project, for reasons that go beyond typical evaluation:

1. Questions can have multiple valid answers. A question about “which country contains X” might have a valid answer at multiple levels of granularity, or the entity might have multiple correct associations.

2. Wikidata has stale ground truth. The knowledge graph reflects a snapshot in time. The model’s parametric knowledge or live search results may be more current. For example:

Question: “…city that contains the headquarters of the owner of U.S. Steel?”

Ground truth (from Wikidata): United States

Model answer: Japan

Reality: U.S. Steel was acquired by Nippon Steel (Japan) in Dec 2023. The model’s answer is factually correct today but wrong according to the outdated KG path.

Accuracy Results

We evaluated 657 trajectories against ground truth using fuzzy matching:

------------------------------------------------------------------------------------------------------------
#    | GROUND TRUTH              | #TC | MODEL ANSWER                       | STATUS
------------------------------------------------------------------------------------------------------------
1    | Ramsar Convention         | 3   | Ramsar Convention (the Convention. | ✅ MATCH
2    | United States             | 4   | United States                      | ✅ MATCH
3    | South Korea               | 3   | Uzbekistan                         | ❌ MISS
4    | France                    | 3   | Germany                            | ❌ MISS
5    | Joseph Poelaert           | 4   | Joseph Poelaert                    | ✅ MATCH
...
653  | Bangladesh                | 11  | Bangladesh                         | ✅ MATCH
654  | Portal:Arithmetic         | 10  | Portal:Arithmetic                  | ✅ MATCH
655  | Monumento 6 Gran Vía..    | 11  | Monumento V (the Monumento a los.. | ❌ MISS
656  | Tehran                    | 11  | Constantinople                     | ❌ MISS
657  | United Kingdom            | 11  | Germany                            | ❌ MISS
------------------------------------------------------------------------------------------------------------
📊 RESULTS: 181/657 (27.5%) Correct

The 27.5% accuracy on this sample is for raw, unfiltered trajectories. After the full pipeline (rejection sampling, best-rollout selection, manual review), the final SFT dataset has much higher quality. The low raw accuracy underscores why multi-stage filtering is essential.

Implementing with Data Designer’s MCP Integration

The same pipeline is reproducible with Data Designer’s MCP integration. Three components make this work:

LocalStdioMCPProvider launches a Tavily MCP server as a subprocess:

1 from data_designer.config.mcp import LocalStdioMCPProvider, ToolConfig
2 
3 tavily_provider = LocalStdioMCPProvider(
4     name="tavily",
5     command=sys.executable,
6     args=[str(tavily_server_path)],
7     env={"TAVILY_API_KEY": os.environ["TAVILY_API_KEY"]},
8 )

ToolConfig controls safety and limits:

1 tool_config = ToolConfig(
2     tool_alias="tavily",
3     providers=["tavily"],
4     allow_tools=["tavily_search"],
5     max_tool_call_turns=15,
6     timeout_sec=300.0,
7 )

tool_alias + with_trace on the LLM column enables tool calling and captures the full conversation:

1 config.add_column(dd.LLMTextColumnConfig(
2     name="agent_solution_raw",
3     system_prompt="You are an expert search agent...",
4     prompt="Problem: {{ user_query_obfuscated }}",
5     model_alias="search-agent",
6     tool_alias="tavily",
7     with_trace=dd.TraceType.ALL_MESSAGES,
8 ))

The resulting agent_solution_raw__trace column contains the complete ChatML conversation --- every user message, every tool call with arguments, every tool response with search results, and the final assistant response. This trace is the SFT training data.

Safety controls matter here. allow_tools prevents the model from calling unexpected tools. max_tool_call_turns=15 prevents infinite search loops. timeout_sec=300 prevents hung connections. Without these, a fraction of records would consume unbounded resources.

BrowseComp Benchmark Results

This dataset was shipped as part of Nemotron Super v3 post-training (SFT + RL). On the BrowseComp benchmark (1,266 web browsing problems), Nemotron Super went from 0% to 31.28% accuracy --- approaching GPT-OSS-120B at 33.89%.

Model	BrowseComp Accuracy (%)
Nemotron Super (before synthetic search agent data)	0
Nemotron Super (after synthetic search agent data, SFT + RL)	31.28
GPT-OSS-120B	33.89

Before this work, Nemotron Super had zero web browsing capability --- it had never been trained on tool-use trajectories with search. Including our synthetic search agent dataset in the SFT blend, combined with other RL datasets in later training stages, enabled the model to go from no capability to near-competitive with GPT-OSS-120B on one of the hardest agentic benchmarks. This dev note focuses on the SFT data generation pipeline.

Key Takeaways

Wikidata provides infinite seed diversity. Random walks on a knowledge graph with 100M+ entities produce an inexhaustible supply of multi-hop problems. The hop count directly controls difficulty --- 4 hops for warm-up, 8 for expert-level riddles.
Two-stage obfuscation prevents leakage. Draft questions tend to follow the path structure too closely (breadcrumbing). The obfuscation rewrite produces concise, natural questions that force the solver to figure out what to search.
Low yield is expected and acceptable. 14% end-to-end yield from 50k seeds still produces ~7,000 high-quality trajectories --- enough for meaningful SFT impact. Multi-hop search is genuinely hard, and most generated paths or questions are legitimately unsolvable.
Stale knowledge graphs are a real problem. Wikidata doesn’t update in real-time. Models with current parametric knowledge or live search results will disagree with ground truth on entities that have changed (mergers, leadership changes, geopolitical shifts). Correctness evaluation needs to account for this.
Iterate on seeds, not just prompts. Seed filtering (removing generic answers, constraining hop counts, anti-meta filters) has as much impact on quality as prompt engineering. Filter early, save compute.
Traces are the training data. The full thought-action-observation loop --- every search query formulation, every result evaluation, every reasoning step --- is what teaches tool-use capability. Final answers alone are worthless without the process.

Next Steps

Scale question generation. Generate closer to ~25,000 filtered questions using Data Designer, up from the current 7k trajectories.
Push difficulty higher. Target questions where num_tool_calls consistently exceeds 15+, requiring deeper reasoning chains.
Explore fresher knowledge bases. Wikidata staleness is a real limitation. Investigate more recently updated, freely available knowledge bases for seed generation.
Search RL environment. Use the filtered questions as an RL environment where the model gets reward for correct final answers, complementing the SFT data.

Try For Yourself

The snippet below shows the core pattern: seed data, two-stage riddle generation, and an MCP-enabled agent trajectory with full trace capture.

Minimal example: search agent trajectory pipeline

1 import data_designer.config as dd
2 from data_designer.interface import DataDesigner
3 
4 MODEL_ALIAS = "nvidia-text"
5 
6 # Tavily MCP provider (hosted endpoint, no local server needed)
7 mcp_provider = dd.MCPProvider(
8     name="tavily",
9     endpoint="https://mcp.tavily.com/mcp/?tavilyApiKey=YOUR_KEY",
10     provider_type="streamable_http",
11 )
12 
13 tool_config = dd.ToolConfig(
14     tool_alias="tavily-search",
15     providers=["tavily"],
16     allow_tools=["tavily_search"],
17     max_tool_call_turns=25,
18     timeout_sec=300.0,
19 )
20 
21 config = dd.DataDesignerConfigBuilder(tool_configs=[tool_config])
22 config.with_seed_dataset(
23     dd.LocalFileSeedSource(path="seeds.jsonl"),
24     sampling_strategy=dd.SamplingStrategy.SHUFFLE,
25 )
26 
27 # Stage 2a: Draft question from knowledge path
28 config.add_column(dd.LLMTextColumnConfig(
29     name="user_query_draft", model_alias=MODEL_ALIAS,
30     prompt=(
31         "Create a multi-step search riddle from this knowledge path:\n"
32         "{{ readable_path }}\n"
33         "Start: {{ seed_entity }}. Answer: {{ final_answer_entity }}\n"
34         "Do NOT name intermediate nodes or the answer. Return ONLY the question."
35     ),
36 ))
37 
38 # Stage 2b: BrowseComp-style obfuscation
39 config.add_column(dd.LLMTextColumnConfig(
40     name="user_query_obfuscated", model_alias=MODEL_ALIAS,
41     prompt=(
42         "Rewrite this riddle to be concise and natural (1-2 sentences).\n"
43         "Original: {{ user_query_draft }}\n"
44         "No breadcrumb chains. No entity names. If INVALID_PATH, output INVALID_PATH."
45     ),
46 ))
47 
48 # Stage 3: Agent trajectory with MCP tool calling
49 config.add_column(dd.LLMTextColumnConfig(
50     name="agent_solution_raw", model_alias=MODEL_ALIAS,
51     system_prompt="You are an expert search agent. Use tavily_search to find the answer.",
52     prompt="Problem: {{ user_query_obfuscated }}",
53     tool_alias="tavily-search",
54     with_trace=dd.TraceType.ALL_MESSAGES,
55 ))
56 
57 # Run
58 data_designer = DataDesigner(mcp_providers=[mcp_provider])
59 preview = data_designer.preview(config, num_records=5)
60 preview.display_sample_record()

Full recipe: search_agent.py (self-contained, runnable)

Download Recipe

Download the complete recipe script

1 # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2 # SPDX-License-Identifier: Apache-2.0
3 # /// script
4 # requires-python = ">=3.10"
5 # dependencies = [
6 #     "data-designer",
7 # ]
8 # ///
9 """Nemotron Super Search Agent Recipe: Trajectories with Tavily Web Search
10 
11 Generate multi-turn search agent trajectories where an LLM iteratively
12 searches the web, reads results, reasons about evidence, and synthesizes
13 answers -- the kind of data needed to train BrowseComp-style search agents.
14 
15 This recipe implements the pipeline used to produce ~7,000 high-quality
16 tool-use trajectories for Nemotron Super post-training, starting from
17 50,000 Wikidata knowledge graph seeds.
18 
19 Pipeline architecture:
20 
21     ┌─────────────────────────────────────────────────────────────────────────┐
22     │                   STAGE 1: SEED DATA (Wikidata KG Walks)                │
23     │                                                                         │
24     │  Random walks on the Wikidata knowledge graph produce multi-hop paths.  │
25     │  Each seed has: seed_entity, final_answer_entity, readable_path,        │
26     │  num_hops_in_graph, ground_truth.                                       │
27     │  Built-in demo seeds included; bring your own for production.           │
28     ├─────────────────────────────────────────────────────────────────────────┤
29     │                   STAGE 2: SEARCH RIDDLE GENERATION (LLM)               │
30     │                                                                         │
31     │  user_query_draft ────────► user_query_obfuscated                       │
32     │  (chain clues from path,     (BrowseComp-style rewrite:                 │
33     │   hide intermediate nodes,    concise, natural, no breadcrumbs,         │
34     │   don't name the answer)      1-2 sentences max)                        │
35     ├─────────────────────────────────────────────────────────────────────────┤
36     │                   STAGE 3: SEARCH TRAJECTORY ROLLOUTS (LLM + MCP)       │
37     │                                                                         │
38     │  Thought-Action-Observation loop with live Tavily web search.           │
39     │  ├─ tavily_search tool via hosted MCP endpoint                          │
40     │  ├─ Maximum 25 tool call turns; 300s timeout                            │
41     │  ├─ Full trace captured via with_trace=ALL_MESSAGES                     │
42     │  └─ Structured JSON output: final_answer, supporting_urls,              │
43     │     short_justification                                                 │
44     ├─────────────────────────────────────────────────────────────────────────┤
45     │                   STAGE 4: STRUCTURED FORMATTING (LLM)                  │
46     │                                                                         │
47     │  Normalize raw agent output into clean JSON via LLMStructuredColumn.    │
48     │  Handles markdown fences, trailing text, single-quoted dicts.           │
49     │                                                                         │
50     │  The agent_solution_raw__trace column IS the SFT training data:         │
51     │  complete ChatML conversation with every tool call and response.        │
52     └─────────────────────────────────────────────────────────────────────────┘
53 
54 Prerequisites:
55     - TAVILY_API_KEY environment variable (get a free key at https://tavily.com)
56     - OPENAI_API_KEY environment variable for OpenAI provider model aliases.
57     - NVIDIA_API_KEY environment variable for NVIDIA provider model aliases (default model alias is "nvidia-text").
58 
59 Run:
60     # Basic usage with built-in demo seeds (generates 2 trajectories)
61     uv run search_agent.py
62 
63     # Use a custom seed parquet
64     uv run search_agent.py --seed-path /path/to/seeds.parquet --num-records 10
65 
66     # For help message and available options
67     uv run search_agent.py --help
68 """
69 
70 from __future__ import annotations
71 
72 import json
73 import os
74 import tempfile
75 from pathlib import Path
76 
77 from pydantic import BaseModel, Field
78 
79 import data_designer.config as dd
80 from data_designer.interface import DataDesigner
81 
82 # =============================================================================
83 # Structured Output Schema
84 # =============================================================================
85 
86 
87 class AgentSolution(BaseModel):
88     """Structured output for the search agent's final answer."""
89 
90     final_answer: str = Field(..., min_length=1, description="The final answer entity.")
91     supporting_urls: list[str] = Field(
92         default_factory=list, description="Authoritative URLs used to verify the answer."
93     )
94     short_justification: str = Field(..., min_length=1, description="Brief explanation of reasoning (1-2 sentences).")
95 
96 
97 # =============================================================================
98 # Prompt Templates
99 # =============================================================================
100 
101 QUERY_DRAFT_PROMPT = """\
102 You are an expert Search Evaluator designing Grandmaster-Level search tests.
103 Create a complex, multi-step search riddle based on this knowledge path:
104 
105 {{ readable_path }}
106 
107 Start Entity: {{ seed_entity }}
108 Final Answer Entity: {{ final_answer_entity }}
109 
110 CRITICAL RULES:
111 1. DO NOT name the intermediate nodes. Hide them behind descriptions.
112 2. DO NOT name the Final Answer.
113 3. Chain the clues logically -- describe each step relative to the previous one.
114 4. Audit the logic: if a step is weak or nonsensical, IGNORE IT.
115 5. Salvage and simplify: use only the strongest, most logical hops.
116 6. No hallucinations: do not invent relationships not in the path.
117 7. Aim for 4-8 meaningful hops.
118 
119 VALIDATION - Output "INVALID_PATH" if:
120 - Final answer is generic/abstract (e.g. "technology", "people", "field")
121 - Path has weak/illogical relationships
122 - No coherent question can be formed
123 
124 Return ONLY the question string (or "INVALID_PATH").\
125 """
126 
127 OBFUSCATE_PROMPT = """\
128 Rewrite this search riddle to better match BrowseComp-style tasks.
129 
130 Original Riddle: {{ user_query_draft }}
131 
132 Secret Path (do not leak entities): {{ readable_path }}
133 Start Entity: {{ seed_entity }}
134 Final Answer (do not leak): {{ final_answer_entity }}
135 
136 HARD REQUIREMENTS:
137 1. NEVER reveal the step-by-step plan. No breadcrumb chains.
138    Avoid: "X is member of Y; Y is based in Z; Z is the capital of..."
139    Avoid meta language: "then search...", "next find...", "follow the chain..."
140 2. NEVER mention the final answer or any intermediate entity by name.
141 3. Keep it concise and natural: 1-2 sentences max (3 for very complex paths).
142 4. Use descriptive clues that require reasoning.
143 5. Include at least one disambiguating filter (date, count, or specific attribute).
144 6. If original == "INVALID_PATH", output exactly "INVALID_PATH".
145 
146 Return ONLY the rewritten question string (or "INVALID_PATH").\
147 """
148 
149 AGENT_SYSTEM_PROMPT = """\
150 You are an expert search agent that uses web search to answer questions accurately.
151 
152 You MUST output ONLY valid JSON matching this exact schema:
153 
154 {
155   "final_answer": "string - the specific answer entity",
156   "supporting_urls": ["url1", "url2"],
157   "short_justification": "string - brief 1-2 sentence explanation"
158 }
159 
160 AVAILABLE TOOLS:
161 You have access to ONE tool called "tavily_search" with parameter: query (string, required).
162 
163 TOOL USAGE RULES:
164 1. Exact Tool Name: Always use "tavily_search" (no suffixes or prefixes).
165 2. Exact Args: Only send {"query": "..."} for the tool call.
166 3. Maximum 25 tool calls. Budget your searches wisely.
167 4. Search Strategy:
168    - Start with broad queries to understand the domain
169    - Refine to specific entities/relationships
170    - Cross-verify facts across multiple sources
171    - Use different query formulations for the same information
172 5. No Reasoning Tags: Do NOT use <think> tags or XML formatting.
173 6. No Intermediate Text: Do NOT output explanatory text between tool calls.
174 7. Final Output: After completing your searches, output ONLY the JSON object.
175 
176 EXECUTION FLOW:
177 1. Read the user's question
178 2. Make tool calls using "tavily_search" to gather information
179 3. Verify information across multiple sources
180 4. Once confident, output the JSON result (no additional text)\
181 """
182 
183 FORMATTER_PROMPT = """\
184 You are a JSON normalizer.
185 
186 You will be given a messy model output that should contain a JSON object with:
187 - final_answer (string)
188 - supporting_urls (list of strings)
189 - short_justification (string)
190 
191 Rules:
192 - Return ONLY a JSON object. No markdown. No extra text.
193 - If the input contains code fences, tool chatter, or extra prose, ignore it.
194 - If the input contains invalid JSON, repair it.
195 - supporting_urls must be a list of valid http(s) URLs (dedupe, keep best 1-5).
196 
197 Input:
198 {{ agent_solution_raw }}\
199 """
200 
201 
202 # =============================================================================
203 # Data Designer Configuration
204 # =============================================================================
205 
206 
207 def build_config(model_alias: str) -> tuple[dd.DataDesignerConfigBuilder, dd.MCPProvider]:
208     """Build the Data Designer configuration for search agent trajectory generation.
209 
210     Returns:
211         A tuple of (config_builder, mcp_provider).
212     """
213     tavily_api_key = os.environ.get("TAVILY_API_KEY", "")
214     mcp_provider = dd.MCPProvider(
215         name="tavily",
216         endpoint=f"https://mcp.tavily.com/mcp/?tavilyApiKey={tavily_api_key}",
217         provider_type="streamable_http",
218     )
219 
220     tool_config = dd.ToolConfig(
221         tool_alias="tavily-search",
222         providers=["tavily"],
223         allow_tools=["tavily_search"],
224         max_tool_call_turns=25,
225         timeout_sec=300.0,
226     )
227 
228     config_builder = dd.DataDesignerConfigBuilder(tool_configs=[tool_config])
229 
230     # Stage 2: Draft question from knowledge path
231     config_builder.add_column(
232         dd.LLMTextColumnConfig(
233             name="user_query_draft",
234             model_alias=model_alias,
235             prompt=QUERY_DRAFT_PROMPT,
236         )
237     )
238 
239     # Stage 2: BrowseComp-style obfuscation
240     config_builder.add_column(
241         dd.LLMTextColumnConfig(
242             name="user_query_obfuscated",
243             model_alias=model_alias,
244             prompt=OBFUSCATE_PROMPT,
245         )
246     )
247 
248     # Stage 3: Agent trajectory with MCP tool calling
249     config_builder.add_column(
250         dd.LLMTextColumnConfig(
251             name="agent_solution_raw",
252             model_alias=model_alias,
253             system_prompt=AGENT_SYSTEM_PROMPT,
254             prompt="Problem: {{ user_query_obfuscated }}",
255             tool_alias="tavily-search",
256             with_trace=dd.TraceType.ALL_MESSAGES,
257         )
258     )
259 
260     # Stage 4: Structured JSON formatting
261     config_builder.add_column(
262         dd.LLMStructuredColumnConfig(
263             name="agent_solution",
264             model_alias=model_alias,
265             prompt=FORMATTER_PROMPT,
266             output_format=AgentSolution,
267         )
268     )
269 
270     return config_builder, mcp_provider
271 
272 
273 # =============================================================================
274 # Demo Seed Data
275 # =============================================================================
276 
277 DEMO_SEEDS = [
278     {
279         "seed_entity": "NVIDIA",
280         "final_answer_entity": "Thomas Hart Benton",
281         "readable_path": (
282             "START ENTITY: NVIDIA (Q182477)\n"
283             "  \u2b07 [chief executive officer (P169)]\n"
284             "  NODE: Jensen Huang (Q332838)\n"
285             "  \u2b07 [educated at (P69)]\n"
286             "  NODE: Oregon State University (Q861888)\n"
287             "  \u2b07 [located in the administrative territorial entity (P131)]\n"
288             "  NODE: Benton County (Q115372)\n"
289             "  \u2b07 [named after (P138)]\n"
290             "  NODE: Thomas Hart Benton (Q178712)"
291         ),
292         "num_hops_in_graph": 4,
293         "ground_truth": "Thomas Hart Benton",
294     },
295     {
296         "seed_entity": "Python",
297         "final_answer_entity": "Centrum Wiskunde & Informatica",
298         "readable_path": (
299             "START ENTITY: Python (Q28865)\n"
300             "  \u2b07 [developer (P178)]\n"
301             "  NODE: Guido van Rossum (Q19845)\n"
302             "  \u2b07 [employer (P108)]\n"
303             "  NODE: Centrum Wiskunde & Informatica (Q1060645)"
304         ),
305         "num_hops_in_graph": 2,
306         "ground_truth": "Centrum Wiskunde & Informatica",
307     },
308     {
309         "seed_entity": "toothache",
310         "final_answer_entity": "ibuprofen",
311         "readable_path": (
312             "START ENTITY: toothache (Q143925)\n"
313             "  \u2b07 [risk factor (P564)]\n"
314             "  NODE: smoking (Q662860)\n"
315             "  \u2b07 [has effect (P1542)]\n"
316             "  NODE: Crohn's disease (Q1472)\n"
317             "  \u2b07 [drug or therapy used for treatment (P2176)]\n"
318             "  NODE: TNF inhibitor (Q1536078)\n"
319             "  \u2b07 [is possible treatment of (P2175)]\n"
320             "  NODE: Beh\u00e7et's disease (Q911427)\n"
321             "  \u2b07 [symptoms and signs (P780)]\n"
322             "  NODE: inflammation (Q101991)\n"
323             "  \u2b07 [drug or therapy used for treatment (P2176)]\n"
324             "  NODE: flurbiprofen (Q419890)\n"
325             "  \u2b07 [significant drug interaction (P769)]\n"
326             "  NODE: parecoxib (Q347941)\n"
327             "  \u2b07 [significant drug interaction (P769)]\n"
328             "  NODE: ibuprofen (Q186969)"
329         ),
330         "num_hops_in_graph": 8,
331         "ground_truth": "ibuprofen",
332     },
333 ]
334 
335 
336 def write_demo_seeds(output_dir: Path) -> Path:
337     """Write demo seed data to a JSONL file."""
338     output_dir.mkdir(parents=True, exist_ok=True)
339     seed_path = output_dir / "demo_seeds.jsonl"
340     with open(seed_path, "w", encoding="utf-8") as f:
341         for seed in DEMO_SEEDS:
342             f.write(json.dumps(seed, ensure_ascii=False) + "\n")
343     return seed_path
344 
345 
346 # =============================================================================
347 # Main Entry Point
348 # =============================================================================
349 
350 
351 def parse_args():
352     """Parse command line arguments."""
353     from argparse import ArgumentParser
354 
355     parser = ArgumentParser(description="Generate search agent trajectories using Tavily web search via MCP.")
356     parser.add_argument("--model-alias", type=str, default="nvidia-text", help="Model alias to use for generation")
357     parser.add_argument("--num-records", type=int, default=2, help="Number of trajectories to generate")
358     parser.add_argument("--seed-path", type=str, default=None, help="Path to seed parquet or JSONL file")
359     parser.add_argument("--artifact-path", type=str, default=None, help="Path to save artifacts")
360     return parser.parse_args()
361 
362 
363 def main() -> None:
364     """Main entry point for the demo."""
365     args = parse_args()
366 
367     if os.environ.get("TAVILY_API_KEY") is None:
368         raise RuntimeError("TAVILY_API_KEY must be set. Get a free key at https://tavily.com")
369 
370     if os.environ.get("NVIDIA_API_KEY") is None and args.model_alias.startswith("nvidia"):
371         raise RuntimeError("NVIDIA_API_KEY must be set when using NVIDIA model aliases.")
372 
373     if args.seed_path:
374         seed_path = args.seed_path
375     else:
376         demo_dir = Path(tempfile.mkdtemp(prefix="search_agent_demo_"))
377         seed_path = str(write_demo_seeds(demo_dir))
378         print(f"Using demo seeds in: {demo_dir}")
379 
380     config_builder, mcp_provider = build_config(model_alias=args.model_alias)
381     config_builder.with_seed_dataset(
382         dd.LocalFileSeedSource(path=seed_path),
383         sampling_strategy=dd.SamplingStrategy.SHUFFLE,
384     )
385 
386     data_designer = DataDesigner(artifact_path=args.artifact_path, mcp_providers=[mcp_provider])
387     preview_results = data_designer.preview(config_builder, num_records=args.num_records)
388 
389     print("\n" + "=" * 60)
390     print("GENERATED SEARCH AGENT TRAJECTORIES")
391     print("=" * 60)
392     preview_results.display_sample_record()
393 
394 
395 if __name__ == "__main__":
396     main()

Key Resources: