NeMo Curator provides synthetic data generation (SDG) capabilities for creating and augmenting training data using Large Language Models (LLMs). These pipelines integrate with OpenAI-compatible APIs, enabling you to use NVIDIA NIM endpoints, NeMo Curator’s built-in Inference Server (Ray Serve + vLLM), or other inference providers.
Synthetic data generation in NeMo Curator operates in two primary modes:
Create new data from scratch without requiring input documents. The QAMultilingualSyntheticStage demonstrates this pattern—it generates Q&A pairs based on a prompt template without needing seed documents.
Improve or restructure existing data using LLM capabilities. The Nemotron-CC stages exemplify this approach, taking input documents and producing:
Define data generation pipelines declaratively using NeMo Data Designer (NDD). Instead of writing imperative LLM call logic, you configure structured column generation (samplers, expressions, LLM text columns) through a builder API or YAML file. NDD handles execution, batching, and token metric collection. This mode supports both standalone generation and NDD-backed versions of Nemotron-CC stages.
The following diagram shows how SDG pipelines process data through preprocessing, LLM generation, and postprocessing stages:
Before using synthetic data generation, ensure you have:
NVIDIA API Key (for cloud endpoints)
export NVIDIA_API_KEY="your-key"NeMo Curator with text extras
Local inference (optional) — to serve models alongside your pipeline:
Refer to the Inference Server guide for setup details.
Nemotron-CC pipelines use the transformers library for tokenization, which is included in NeMo Curator core dependencies.
Configure OpenAI-compatible clients for NVIDIA APIs and custom endpoints configuration performance
Serve LLMs locally via Ray Serve and vLLM alongside curation pipelines ray-serve local-inference
Generate synthetic Q&A pairs across multiple languages quickstart tutorial
Declarative data generation with structured columns and NDD-backed Nemotron-CC stages ndd declarative
Advanced text transformation and knowledge extraction workflows advanced paraphrasing