ποΈ Architecture & Performance
ποΈ Architecture & Performance
ποΈ Architecture & Performance
Data Designer is an orchestration framework that coordinates synthetic data generation workflows. It is a client of LLM inference serversβit does not host models itself.
This guide explains the architecture, execution model, and how to tune performance for your specific use case.
buffer_size)max_parallel_requests)The default execution path is the async engine, which dispatches work at the cell level and overlaps independent columns β see Async Engine below for its semantics. The legacy sync engine is still available for one transitional release via DATA_DESIGNER_ASYNC_ENGINE=0 and is what this section describes. The configuration knobs documented below (buffer_size, max_parallel_requests, AIMD throttle config, error handling) apply to both engines; the differences are flagged inline.
The sync engine processes datasets in batches, with parallel operations within each batch.
Step 1: Split into batches
Your dataset is divided into batches of buffer_size records. Each batch is processed completely before moving to the next.
Step 2: Process columns sequentially
Within a batch, columns are generated one at a time following the dependency graph. The order depends on column dependenciesβexpression columns may come before LLM columns if the LLM columns depend on them. (The async engine relaxes this: columns whose per-cell dependencies are satisfied can run concurrently with columns earlier in the order.)
Example workflow:
Step 3: Generate cells in parallel
Within each column, cells are processed in parallel up to the configured limit:
At any moment, the number of concurrent LLM requests is:
max_parallel_requests sets the ceiling. The actual limit (current_throttle_limit) is managed at runtime by an AIMD (Additive Increase / Multiplicative Decrease) controller that reacts to rate-limit signals from the inference server:
rampup_seconds is greater than 0, a new throttle domain starts at one concurrent request and increases linearly toward max_parallel_requests over that duration.This means Data Designer automatically finds the right concurrency level for your server without manual tuning.
AIMD adaptive concurrency is fully active on the default async engine. The legacy sync engine is available for one transitional release via DATA_DESIGNER_ASYNC_ENGINE=0; on that path 429s are first retried at the HTTP transport layer and AIMD only engages as a fallback. See Async engine below.
Example: With buffer_size=100 and max_parallel_requests=32, Data Designer can send up to 32 requests in parallel. If rampup_seconds=30, it starts at one request and climbs linearly toward 32 over 30 seconds. If the server returns 429s, startup ramp stops, concurrency drops automatically (e.g., to 24, then 18), and normal AIMD recovery takes over once the server catches up.
buffer_size (RunConfig)Controls how many records are processed per batch.
When to increase: High-capacity inference server, single-model workflows, memory not constrained
When to decrease: Memory-constrained environments, development/debugging, complex multi-model pipelines
Long generation jobs can be resumed from checkpoints by passing resume to DataDesigner.create() or data-designer create --resume.
Resume modes:
ResumeMode.NEVER (default): always start a fresh generation run. If the dataset directory already exists, Data Designer writes to a timestamped directory.ResumeMode.ALWAYS: resume the existing dataset directory. Raises if the checkpoint is incompatible or cannot be resumed safely.ResumeMode.IF_POSSIBLE: resume when the stored config fingerprint matches the current config; otherwise start a fresh timestamped run.Data Designer stores the run configuration in metadata.json (buffer_size, target_num_records, config fingerprint) and builder_config.json. Both engines recover progress the same way: they scan completed batch_*.parquet row groups and read parquet metadata for the row count actually persisted. That keeps resume crash-safe even if a run was interrupted between writing a batch parquet and updating metadata, because the filesystem reflects the durable state even when metadata lags by a step.
Resume has a few important invariants:
buffer_size must match the original run.num_records must be at least the original target; you may extend a run by requesting more records.allow_resize=True columns are not resumable because row boundaries can change.process_after_generation() has run, the dataset is considered terminal for resume. Re-running with the same target returns the existing dataset; extending requires a fresh run.process_after_generation() could start, resume runs after-generation on the existing on-disk dataset (the parquet files are still clean) and marks it terminal afterwards. A crash during process_after_generation() still raises β the parquet files may have been partially rewritten and starting fresh is the only safe option.The DatasetCreationResults returned by a resume invocation reflects the full dataset on disk for anything that reads the artifact directory (load_dataset, count_records, load_analysis, export, push_to_hub). Per-run observability β task_traces, model-usage logs, and telemetry events emitted during the call β is scoped to the resume invocation only; the original runβs in-memory traces are not persisted across process boundaries.
Only resume datasets from trusted artifact directories. Resume reads local metadata.json, builder_config.json, and parquet files to determine checkpoint state.
max_parallel_requests (InferenceParams)Sets the maximum concurrent LLM API calls per model. This is the ceiling that the AIMD throttle controller can ramp up to β the actual concurrency at runtime may be lower if the server signals rate limits.
Default: 4
When to increase: Your inference backend has high throughput capacity, youβre using a cloud API with generous rate limits, or youβre running vLLM/TensorRT-LLM with multiple GPUs. With AIMD, setting an aggressively high value is safer than before β the system will self-correct downward if the server canβt keep up. The salvage queue on the async engine (default) reclaims failed rows; on the sync engine the initial burst of 429s before AIMD stabilizes can drop rows, so start with a more conservative ceiling if youβve opted into sync.
When to decrease: You want to cap resource usage to a known safe level, or you want more predictable/debuggable execution.
Finding the optimal value The right value depends on your inference stack and model. Self-hosted vLLM servers can often handle values as high as 256, 512, or even 1024 depending on your hardware.
With AIMD, a practical approach is to set max_parallel_requests to the upper bound youβre comfortable with and let the throttle controller find the sustainable level automatically. If you see frequent 429 β recovery cycles in the logs, your ceiling is above the serverβs true capacity but the system is handling it. If you never see any throttle activity, you may have room to increase the ceiling further.
Benchmark approach: Run a small dataset (e.g., 100 records) with increasing max_parallel_requests values (4 β 8 β 16 β 32 β β¦) and measure generation time. Stop increasing when the runtime stops decreasingβthatβs when your inference server is saturated.
non_inference_max_parallel_workers (RunConfig)Controls thread pool size for non-LLM operations (samplers, expressions, validators).
Default: 4
When to increase: Many CPU-bound columns (complex expressions, heavy sampling)
Data Designer uses an AIMD (Additive Increase / Multiplicative Decrease) controller to automatically adjust concurrency per model based on rate-limit feedback from the inference server. The defaults work well for most workloads. Override them via ThrottleConfig only when you understand the trade-offs.
Adaptive throttling is fully active on the default async engine, where 429 responses propagate directly to the AIMD controller. On the legacy sync engine (DATA_DESIGNER_ASYNC_ENGINE=0), 429s are first retried at the HTTP transport layer; ThrottleConfig settings only take effect as a fallback if transport retries are exhausted.
How it works in practice When a model endpoint returns HTTP 429, the controller reduces the concurrency limit for that model and pauses briefly. After enough consecutive successes, it begins ramping back up. If the server rate-limits again, the controller records that level as a ceiling and stabilizes just below it, with a small overshoot band to detect when the server can handle more load.
You can observe this in the logs β look for messages like concurrency reduced from X β Y and concurrency increased from X β Y.
Control retry behavior and early shutdown for failed generations.
When to adjust:
max_conversation_restarts to 7, add max_conversation_correction_steps=2disable_early_shutdown=True to see all errorsmax_conversation_restarts to 3The async engine is the default execution path. It dispatches work at the cell level rather than the column level, so independent columns overlap in time and per-(provider, model) AIMD pools tune themselves independently. See the Async All the Way Down dev note for the full architecture.
The inference_parameters.timeout field on a ModelConfig sets the per-request HTTP timeout. The same value also drives the syncβasync bridge that custom columns use when they call model.generate(). There is no separate queue-wait deadline β waits scale with provider speed and AIMDβs adaptive concurrency. Slow self-hosted endpoints (e.g. large models on a single GPU) only need this one knob raised:
A run can finish with fewer records than requested when non-retryable errors drop rows. Inspect len(result.load_dataset()) to detect.
If the rate of non-retryable errors crosses RunConfig.shutdown_error_rate, generation stops early and raises DataDesignerEarlyShutdownError (a subclass of DataDesignerGenerationError). Catch it separately when a typed retry path is appropriate:
DATA_DESIGNER_ASYNC_ENGINE=0 selects the legacy sync engine. This is a deprecated escape hatch for the transitional release and will be removed in a future version. The opt-out also emits a DeprecationWarning at run time so it shows up in your logs.
max_parallel_requests (AIMD will self-correct if you overshoot). Memory issues β decrease buffer_size. Long tails β tune retry settings.