Troubleshoot Data Designer#
Learn how to troubleshoot common issues with the NeMo Data Designer microservice.
Common Configuration Errors#
Invalid Column Dependencies#
Problem: Jobs fail with template variable errors when column dependencies are not properly ordered.
Symptoms:
Error:
Template variable '{{column_name}}' not found
Jobs fail during configuration validation
Solutions:
Order Dependencies: Ensure referenced columns are defined before dependent columns
Validate Template Variables: Check that all
{{variable}}
references match existing column namesUse Preview Mode: Test complex configurations with preview before full job execution
# ❌ Incorrect: product_name references category before it's defined
columns = [
{
"name": "product_name",
"prompt": "Generate a product name for {{category}}"
},
{
"name": "category",
"type": "category",
"params": {"values": ["electronics", "clothing"]}
}
]
# ✅ Correct: Define base columns first
columns = [
{
"name": "category",
"type": "category",
"params": {"values": ["electronics", "clothing"]}
},
{
"name": "product_name",
"prompt": "Generate a product name for {{category}}"
}
]
Invalid Parameter Ranges#
Problem: Configuration validation fails due to invalid parameter ranges or types.
Symptoms:
Error:
Invalid parameter range: min value greater than max value
Error:
Invalid distribution parameters
Solutions:
Validate Number Ranges: Ensure
min < max
for number column typesCheck Distribution Parameters: Verify statistical distribution parameters are valid
Test Parameter Combinations: Use preview mode to validate parameter combinations
# ❌ Incorrect: Invalid range
{
"name": "price",
"type": "number",
"params": {
"min": 100,
"max": 50 # max < min
}
}
# ✅ Correct: Valid range
{
"name": "price",
"type": "number",
"params": {
"min": 10,
"max": 100
}
}
Model Suite Configuration Issues#
Problem: Jobs fail due to incorrect model suite configuration or unavailable models.
Symptoms:
Error:
Model suite 'custom-model' not found
Error:
Failed to connect to model endpoint
Solutions:
Verify Model Suite: Check available model suites with configuration list API
Environment Variables: Ensure
NIM_PROXY_URL
is set for custom modelsModel Availability: Verify custom models are running and accessible
# Check available model suites
curl "${DATA_DESIGNER_BASE_URL}/v1beta1/data-designer/configs" | jq '.data[].model_suite' | sort -u
# Verify custom model endpoint
curl "${NIM_PROXY_URL}/health"
Job Failure Debugging#
Job Status Monitoring#
Problem: Jobs fail without clear error messages or hang indefinitely.
Debugging Steps:
Check Job Status: Use the job status endpoint to get detailed error information
Review Job Logs: Stream job logs to identify failure points
Analyze Error Patterns: Look for common error patterns in logs
# Monitor job with detailed error handling
def monitor_job(job_id):
while True:
job = client.data_designer.jobs.retrieve(job_id)
if job.status == "failed":
print(f"Job failed: {job.error_message}")
# Get detailed logs
logs = client.data_designer.jobs.logs(job_id, lines=100)
for log in logs:
if log.level == "ERROR":
print(f"ERROR: {log.message}")
break
elif job.status == "completed":
print("Job completed successfully")
break
time.sleep(10)
Memory and Resource Errors#
Problem: Jobs fail due to insufficient memory or compute resources.
Symptoms:
Error:
Out of memory error during generation
Error:
Job timeout exceeded
Jobs that start but never complete
Solutions:
Reduce Batch Size: Generate smaller datasets or split large jobs
Optimize Configuration: Simplify complex column dependencies
Monitor Resource Usage: Check cluster resource availability
# Split large jobs into smaller batches
def generate_large_dataset(config, total_rows, batch_size=1000):
results = []
for i in range(0, total_rows, batch_size):
batch_rows = min(batch_size, total_rows - i)
job = client.data_designer.jobs.create(
config=config,
rows=batch_rows,
name=f"batch-{i//batch_size}"
)
results.append(job.id)
return results
Model Generation Errors#
Problem: Jobs fail during the data generation phase due to model issues.
Symptoms:
Error:
Model generation failed for column 'xyz'
Error:
Invalid model response format
Generated data contains unexpected null values
Solutions:
Validate Prompts: Ensure prompts are clear and well-formed
Check Model Response: Review model outputs for format consistency
Adjust Inference Parameters: Modify temperature, top_p, or other generation settings
# Test prompt quality with preview
def test_prompt_quality(config):
preview = client.data_designer.preview(config=config)
for row in preview.data:
for column, value in row.items():
if value is None or value == "":
print(f"Warning: Empty value for column {column}")
if len(str(value)) > 1000:
print(f"Warning: Very long value for column {column}")
Performance Optimization#
Slow Generation Performance#
Problem: Data generation jobs are slower than expected.
Diagnostic Steps:
Profile Configuration: Identify which columns are slow to generate
Optimize Prompts: Simplify complex prompts or reduce dependencies
Adjust Batch Sizes: Experiment with different batch sizes
Solutions:
Parallel Generation: Use multiple smaller jobs instead of one large job
Optimize Column Order: Place fast-generating columns first
Cache Common Values: Use category columns for frequently repeated values
# Optimize performance with parallel generation
import concurrent.futures
import time
def generate_batch(config, rows, batch_id):
job = client.data_designer.jobs.create(
config=config,
rows=rows,
name=f"parallel-batch-{batch_id}"
)
# Wait for completion
while True:
status = client.data_designer.jobs.retrieve(job.id)
if status.status in ["completed", "failed"]:
return job.id
time.sleep(5)
# Generate multiple batches in parallel
def parallel_generation(config, total_rows, num_batches=4):
batch_size = total_rows // num_batches
with concurrent.futures.ThreadPoolExecutor(max_workers=num_batches) as executor:
futures = []
for i in range(num_batches):
future = executor.submit(generate_batch, config, batch_size, i)
futures.append(future)
job_ids = [future.result() for future in futures]
return job_ids
Memory Usage Optimization#
Problem: Jobs consume excessive memory during generation.
Solutions:
Reduce Column Complexity: Simplify column dependencies and prompt length
Use Streaming: Process results in chunks rather than loading all at once
Optimize Data Types: Use appropriate data types for each column
# Stream large results to avoid memory issues
def stream_large_results(job_id, chunk_size=1000):
offset = 0
while True:
results = client.data_designer.jobs.results(
job_id=job_id,
limit=chunk_size,
offset=offset
)
if not results.data:
break
# Process chunk
process_chunk(results.data)
offset += chunk_size
Resource Management#
Cluster Resource Issues#
Problem: Jobs fail due to insufficient cluster resources.
Symptoms:
Error:
Insufficient CPU/Memory resources
Jobs stuck in “created” status
Long wait times for job execution
Solutions:
Check Resource Availability: Monitor cluster resource usage
Optimize Resource Requests: Adjust job resource requirements
Schedule Jobs: Distribute jobs across time to avoid resource contention
# Check cluster resource usage
kubectl top nodes
kubectl top pods -n nemo-microservices
# Check Data Designer service status
kubectl get pods -n nemo-microservices -l app=data-designer
kubectl describe pod <data-designer-pod-name> -n nemo-microservices
Job Queue Management#
Problem: Too many concurrent jobs causing resource exhaustion.
Solutions:
Implement Job Queuing: Manage job execution order
Limit Concurrent Jobs: Set maximum number of parallel jobs
Monitor Queue Status: Track job queue length and processing times
# Simple job queue manager
class JobQueueManager:
def __init__(self, max_concurrent=3):
self.max_concurrent = max_concurrent
self.active_jobs = []
self.pending_jobs = []
def submit_job(self, config, rows, name):
job_request = {
"config": config,
"rows": rows,
"name": name
}
if len(self.active_jobs) < self.max_concurrent:
job = client.data_designer.jobs.create(**job_request)
self.active_jobs.append(job.id)
return job.id
else:
self.pending_jobs.append(job_request)
return None
def check_completed_jobs(self):
completed = []
for job_id in self.active_jobs:
status = client.data_designer.jobs.retrieve(job_id)
if status.status in ["completed", "failed"]:
completed.append(job_id)
# Remove completed jobs and start pending ones
for job_id in completed:
self.active_jobs.remove(job_id)
while self.pending_jobs and len(self.active_jobs) < self.max_concurrent:
job_request = self.pending_jobs.pop(0)
job = client.data_designer.jobs.create(**job_request)
self.active_jobs.append(job.id)
Authentication and Connection Issues#
API Authentication Problems#
Problem: API requests fail with authentication errors.
Symptoms:
Error:
401 Unauthorized
Error:
Invalid API key
Error:
Authentication token expired
Solutions:
Verify API Key: Check that API key is correctly set in environment variables
Check Key Permissions: Ensure API key has necessary permissions
Token Refresh: Implement token refresh logic for long-running applications
# Robust authentication with retry logic
import time
from functools import wraps
def retry_on_auth_error(max_retries=3, delay=1):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except Exception as e:
if "401" in str(e) and attempt < max_retries - 1:
print(f"Authentication error, retrying in {delay} seconds...")
time.sleep(delay)
# Refresh client/token here if needed
continue
raise
return None
return wrapper
return decorator
@retry_on_auth_error()
def create_job_with_retry(config, rows):
return client.data_designer.jobs.create(config=config, rows=rows)
Network Connectivity Issues#
Problem: Intermittent network failures cause job creation or monitoring to fail.
Solutions:
Implement Retry Logic: Add exponential backoff for network requests
Check Network Connectivity: Verify network access to Data Designer service
Use Connection Pooling: Reuse HTTP connections for better reliability
# Robust network error handling
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
def create_robust_session():
session = requests.Session()
# Configure retry strategy
retry_strategy = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504],
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
Debugging Tools and Techniques#
Enable Debug Logging#
Add verbose logging to diagnose issues:
import logging
# Enable debug logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
# Add request/response logging
def log_api_call(func):
def wrapper(*args, **kwargs):
logger.debug(f"API call: {func.__name__} with args={args}, kwargs={kwargs}")
try:
result = func(*args, **kwargs)
logger.debug(f"API response: {result}")
return result
except Exception as e:
logger.error(f"API error: {e}")
raise
return wrapper
Configuration Validation#
Validate configurations before job submission:
def validate_config(config):
"""Validate configuration before job creation"""
errors = []
# Check required fields
if "model_suite" not in config:
errors.append("Missing required field: model_suite")
if "columns" not in config or not config["columns"]:
errors.append("Missing required field: columns")
# Check column dependencies
defined_columns = set()
for column in config["columns"]:
if "prompt" in column:
# Extract template variables
import re
variables = re.findall(r'\{\{(\w+)\}\}', column["prompt"])
for var in variables:
if var not in defined_columns:
errors.append(f"Column '{column['name']}' references undefined variable '{var}'")
defined_columns.add(column["name"])
if errors:
raise ValueError("Configuration validation failed: " + "; ".join(errors))
return True
Health Check Utilities#
Monitor service health and availability:
def health_check():
"""Check Data Designer service health"""
try:
# Try to list configurations (lightweight operation)
configs = client.data_designer.configs.list(page_size=1)
print("✅ Data Designer service is healthy")
return True
except Exception as e:
print(f"❌ Data Designer service health check failed: {e}")
return False
def diagnostic_report():
"""Generate diagnostic report"""
print("=== Data Designer Diagnostic Report ===")
# Service health
health_status = health_check()
# Environment check
import os
base_url = os.getenv('DATA_DESIGNER_BASE_URL')
print(f"Base URL: {base_url}")
# Active jobs
try:
jobs = client.data_designer.jobs.list(page_size=10)
print(f"Active jobs: {len(jobs.data)}")
for job in jobs.data:
print(f" - {job.id}: {job.status}")
except Exception as e:
print(f"Failed to get job list: {e}")
Getting Help#
If you continue to experience issues:
Check Service Status: Verify that the Data Designer service is running and healthy
Review Logs: Examine both client-side and server-side logs for error details
Consult Documentation: Review the API reference and user guides
Contact Support: Provide diagnostic information, error logs, and configuration details
Support Information Template#
When contacting support, include:
Data Designer Version: [version]
Cluster Environment: [environment details]
Error Message: [full error message]
Configuration: [sanitized configuration]
Job ID: [if applicable]
Timestamp: [when error occurred]
Steps to Reproduce: [detailed steps]