The data-designer-retrieval-sdg plugin turns source documents into grounded retriever training and BEIR evaluation data: bundle and chunk a corpus, generate multi-hop QA pairs, deduplicate and judge them, then export AutoModel-compatible artifacts.
If you are building a RAG system, you have probably hit this wall: the generator is good, the vector database is fast, the prompt is carefully tuned, and the answer is still wrong because the right passage never made it into context.
That is a retrieval problem. More specifically, it is often a data problem. General-purpose embedding models understand broad semantic similarity, but they do not know the fine-grained distinctions in your product docs, tickets, policies, codebase, manuals, or internal taxonomy. To improve that, you need domain-specific retriever training and evaluation data: realistic queries, positive passages, held-out evals, and enough metadata to know whether the retriever actually found the right evidence.
The hard part is not asking an LLM to write questions about a document. The hard part is keeping every generated question tied to the exact chunk, document, or multi-hop evidence set that a retriever should recover. Many RAG tutorials stop at chunk, embed, retrieve, and prompt. Fine-tuning recipes often begin once labeled query-passage pairs already exist. The gap in between is where developers lose the most time.
The plugin fills that gap. It packages a retrieval SDG toolkit that starts with a directory of documents, generates synthetic query-positive examples with NeMo Data Designer, filters them, and exports them for retriever fine-tuning and BEIR-style evaluation.
This is not just a demo package. The same plugin produced the Retrieval-Synthetic-NVDocs-v1 dataset from NVIDIA public documentation, and it powers the bootstrap SDG stage for both the NeMo embedding fine-tune recipe and reranking fine-tune recipe. It is now available as a standalone Data Designer plugin for generating high-quality, complex, multi-document, multi-hop retrieval data compatible with AutoModel.
This post walks through what the plugin does, why the generated labels matter, and how to make your first small run useful before you scale it up.
The plugin packages a four-stage Data Designer pipeline:
The package contributes two Data Designer extensions:
For local runs, the current package exposes a Python API and a CLI:
This is still Data Designer: users declare the corpus and generation settings; the engine handles dependency ordering, model calls, async scheduling, previews, and dataset output.
For retriever training, chunking is not just preprocessing. The chunk IDs become labels. If a generated query uses chunks 3, 7, and 8, those IDs have to survive generation, filtering, splitting, and export.
The document-chunker seed reader handles that boundary:
Each row includes the original file name, full text, sentence chunks, structured section text, and bundle metadata. The important part is that chunks carry chunk_id values. Those IDs are what later become positive documents in training and qrels.
For questions that span multiple documents, such as “How does the migration guide change the deployment recommendation from the architecture overview?”, enable multi-document bundling:
That gives the model opportunities to generate cross-document questions while still tracking which document each segment came from.
The pipeline first extracts document artifacts - concepts, relationships, themes, entities, processes, insights, technical terms, and contextual factors. Then it asks the model to generate standalone questions grounded in the chunked context.
As a library, the path is compact:
A useful generated example looks like this:
Notice what is different from a generic QA generator:
segment_ids preserve the retrieval labels.That combination is what makes the data useful for retriever training and not just QA evaluation.
Synthetic generators are enthusiastic. Ask for seven questions per document across a large corpus and you will get repeats: the same policy phrased three ways, the same setup requirement asked with slightly different wording, the same “how does X relate to Y” pattern over and over.
This stage has two gates: first remove near-repeated questions, then judge whether the remaining examples are grounded enough to train or evaluate a retriever.
The embedding-dedup column removes near duplicates inside each generated list:
The implementation embeds the question text, computes cosine similarity, and greedily drops items above the threshold. It also implements native agenerate(), so it participates directly in Data Designer’s async scheduler and uses model.agenerate_text_embeddings(...) instead of becoming a separate side job.
This is a small detail that has a large downstream effect: fewer duplicate queries means cleaner training data and more informative held-out evals.
Retriever data quality is easy to overestimate. A generated question might sound fluent but be unsupported. An answer might be correct but require a chunk that was not marked positive. A multi-hop question might only need one hop in practice.
The plugin adds an LLM judge column after deduplication. Each retained QA pair is scored for:
The converter defaults to --quality-threshold 7.0, keeping only pairs whose overall score passes the threshold. It also drops records where the number of judged pairs does not match the number of deduplicated pairs, because silent misalignment is worse than losing a row.
Your first inspection pass should focus on the rejected and borderline examples. If many low-scoring examples share the same failure mode, tune chunk size, document cleanup, model choice, or question complexity before scaling up.
The final conversion step rebuilds a deduplicated corpus from the generated chunks, maps segment_ids to positive document IDs, filters by quality, and writes both training and evaluation formats.
For training:
For evaluation:
This is one of the main reasons the plugin exists. It is easy to generate questions. It is harder to keep training examples, corpus records, and qrels aligned enough that the numbers mean something.
Before scaling, look at a small sample and ask:
segment_ids?Then iterate:
The goal of the first run is not volume. The goal is to learn how your corpus behaves.
Retrieval SDG needs document-specific seed reading, question deduplication, quality judging, and conversion logic. Packaging those pieces as a plugin gives teams a repeatable path from their own corpus to retriever data while preserving declarative Data Designer configs.
The retrieval SDG plugin includes:
Users still write declarative configs:
No registry mutation. No engine internals. No custom chunking pre-process that has to stay manually aligned with supporting evidence.
That is the bigger plugin story: Data Designer provides the orchestration framework, and plugins package domain-specific pieces for custom use cases without bloating the core library.
Do not start by generating a million examples. Pick 20-100 representative documents, run a preview, inspect the labels, and only then scale up.
Install the plugin:
Run a preview:
If the preview looks reasonable, run the full job:
Convert the generated data:
That produces the training and evaluation artifacts you need to keep moving:
Start here:
If your RAG system is failing because the retriever does not understand your domain, this is the action step: create the data that lets you measure and improve it. Bring a folder of documents, run the plugin, inspect the labels, and use the output to train and evaluate the retriever you actually need.