Troubleshoot Data Designer#

Learn how to troubleshoot common issues with the NeMo Data Designer microservice.

Common Configuration Errors#

Invalid Column Dependencies#

Problem: Jobs fail with template variable errors when column dependencies are not properly ordered.

Symptoms:

  • Error: Template variable '{{column_name}}' not found

  • Jobs fail during configuration validation

Solutions:

  1. Order Dependencies: Ensure referenced columns are defined before dependent columns

  2. Validate Template Variables: Check that all {{variable}} references match existing column names

  3. Use Preview Mode: Test complex configurations with preview before full job execution

# ❌ Incorrect: product_name references category before it's defined
columns = [
    {
        "name": "product_name",
        "prompt": "Generate a product name for {{category}}"
    },
    {
        "name": "category",
        "type": "category",
        "params": {"values": ["electronics", "clothing"]}
    }
]

# ✅ Correct: Define base columns first
columns = [
    {
        "name": "category",
        "type": "category", 
        "params": {"values": ["electronics", "clothing"]}
    },
    {
        "name": "product_name",
        "prompt": "Generate a product name for {{category}}"
    }
]

Invalid Parameter Ranges#

Problem: Configuration validation fails due to invalid parameter ranges or types.

Symptoms:

  • Error: Invalid parameter range: min value greater than max value

  • Error: Invalid distribution parameters

Solutions:

  1. Validate Number Ranges: Ensure min < max for number column types

  2. Check Distribution Parameters: Verify statistical distribution parameters are valid

  3. Test Parameter Combinations: Use preview mode to validate parameter combinations

# ❌ Incorrect: Invalid range
{
    "name": "price",
    "type": "number",
    "params": {
        "min": 100,
        "max": 50  # max < min
    }
}

# ✅ Correct: Valid range
{
    "name": "price", 
    "type": "number",
    "params": {
        "min": 10,
        "max": 100
    }
}

Model Suite Configuration Issues#

Problem: Jobs fail due to incorrect model suite configuration or unavailable models.

Symptoms:

  • Error: Model suite 'custom-model' not found

  • Error: Failed to connect to model endpoint

Solutions:

  1. Verify Model Suite: Check available model suites with configuration list API

  2. Environment Variables: Ensure NIM_PROXY_URL is set for custom models

  3. Model Availability: Verify custom models are running and accessible

# Check available model suites
curl "${DATA_DESIGNER_BASE_URL}/v1beta1/data-designer/configs" | jq '.data[].model_suite' | sort -u

# Verify custom model endpoint
curl "${NIM_PROXY_URL}/health"

Job Failure Debugging#

Job Status Monitoring#

Problem: Jobs fail without clear error messages or hang indefinitely.

Debugging Steps:

  1. Check Job Status: Use the job status endpoint to get detailed error information

  2. Review Job Logs: Stream job logs to identify failure points

  3. Analyze Error Patterns: Look for common error patterns in logs

# Monitor job with detailed error handling
def monitor_job(job_id):
    while True:
        job = client.data_designer.jobs.retrieve(job_id)
        
        if job.status == "failed":
            print(f"Job failed: {job.error_message}")
            # Get detailed logs
            logs = client.data_designer.jobs.logs(job_id, lines=100)
            for log in logs:
                if log.level == "ERROR":
                    print(f"ERROR: {log.message}")
            break
        elif job.status == "completed":
            print("Job completed successfully")
            break
        
        time.sleep(10)

Memory and Resource Errors#

Problem: Jobs fail due to insufficient memory or compute resources.

Symptoms:

  • Error: Out of memory error during generation

  • Error: Job timeout exceeded

  • Jobs that start but never complete

Solutions:

  1. Reduce Batch Size: Generate smaller datasets or split large jobs

  2. Optimize Configuration: Simplify complex column dependencies

  3. Monitor Resource Usage: Check cluster resource availability

# Split large jobs into smaller batches
def generate_large_dataset(config, total_rows, batch_size=1000):
    results = []
    for i in range(0, total_rows, batch_size):
        batch_rows = min(batch_size, total_rows - i)
        job = client.data_designer.jobs.create(
            config=config,
            rows=batch_rows,
            name=f"batch-{i//batch_size}"
        )
        results.append(job.id)
    return results

Model Generation Errors#

Problem: Jobs fail during the data generation phase due to model issues.

Symptoms:

  • Error: Model generation failed for column 'xyz'

  • Error: Invalid model response format

  • Generated data contains unexpected null values

Solutions:

  1. Validate Prompts: Ensure prompts are clear and well-formed

  2. Check Model Response: Review model outputs for format consistency

  3. Adjust Inference Parameters: Modify temperature, top_p, or other generation settings

# Test prompt quality with preview
def test_prompt_quality(config):
    preview = client.data_designer.preview(config=config)
    
    for row in preview.data:
        for column, value in row.items():
            if value is None or value == "":
                print(f"Warning: Empty value for column {column}")
            if len(str(value)) > 1000:
                print(f"Warning: Very long value for column {column}")

Performance Optimization#

Slow Generation Performance#

Problem: Data generation jobs are slower than expected.

Diagnostic Steps:

  1. Profile Configuration: Identify which columns are slow to generate

  2. Optimize Prompts: Simplify complex prompts or reduce dependencies

  3. Adjust Batch Sizes: Experiment with different batch sizes

Solutions:

  1. Parallel Generation: Use multiple smaller jobs instead of one large job

  2. Optimize Column Order: Place fast-generating columns first

  3. Cache Common Values: Use category columns for frequently repeated values

# Optimize performance with parallel generation
import concurrent.futures
import time

def generate_batch(config, rows, batch_id):
    job = client.data_designer.jobs.create(
        config=config,
        rows=rows,
        name=f"parallel-batch-{batch_id}"
    )
    
    # Wait for completion
    while True:
        status = client.data_designer.jobs.retrieve(job.id)
        if status.status in ["completed", "failed"]:
            return job.id
        time.sleep(5)

# Generate multiple batches in parallel
def parallel_generation(config, total_rows, num_batches=4):
    batch_size = total_rows // num_batches
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=num_batches) as executor:
        futures = []
        for i in range(num_batches):
            future = executor.submit(generate_batch, config, batch_size, i)
            futures.append(future)
        
        job_ids = [future.result() for future in futures]
    
    return job_ids

Memory Usage Optimization#

Problem: Jobs consume excessive memory during generation.

Solutions:

  1. Reduce Column Complexity: Simplify column dependencies and prompt length

  2. Use Streaming: Process results in chunks rather than loading all at once

  3. Optimize Data Types: Use appropriate data types for each column

# Stream large results to avoid memory issues
def stream_large_results(job_id, chunk_size=1000):
    offset = 0
    while True:
        results = client.data_designer.jobs.results(
            job_id=job_id,
            limit=chunk_size,
            offset=offset
        )
        
        if not results.data:
            break
            
        # Process chunk
        process_chunk(results.data)
        offset += chunk_size

Resource Management#

Cluster Resource Issues#

Problem: Jobs fail due to insufficient cluster resources.

Symptoms:

  • Error: Insufficient CPU/Memory resources

  • Jobs stuck in “created” status

  • Long wait times for job execution

Solutions:

  1. Check Resource Availability: Monitor cluster resource usage

  2. Optimize Resource Requests: Adjust job resource requirements

  3. Schedule Jobs: Distribute jobs across time to avoid resource contention

# Check cluster resource usage
kubectl top nodes
kubectl top pods -n nemo-microservices

# Check Data Designer service status
kubectl get pods -n nemo-microservices -l app=data-designer
kubectl describe pod <data-designer-pod-name> -n nemo-microservices

Job Queue Management#

Problem: Too many concurrent jobs causing resource exhaustion.

Solutions:

  1. Implement Job Queuing: Manage job execution order

  2. Limit Concurrent Jobs: Set maximum number of parallel jobs

  3. Monitor Queue Status: Track job queue length and processing times

# Simple job queue manager
class JobQueueManager:
    def __init__(self, max_concurrent=3):
        self.max_concurrent = max_concurrent
        self.active_jobs = []
        self.pending_jobs = []
    
    def submit_job(self, config, rows, name):
        job_request = {
            "config": config,
            "rows": rows, 
            "name": name
        }
        
        if len(self.active_jobs) < self.max_concurrent:
            job = client.data_designer.jobs.create(**job_request)
            self.active_jobs.append(job.id)
            return job.id
        else:
            self.pending_jobs.append(job_request)
            return None
    
    def check_completed_jobs(self):
        completed = []
        for job_id in self.active_jobs:
            status = client.data_designer.jobs.retrieve(job_id)
            if status.status in ["completed", "failed"]:
                completed.append(job_id)
        
        # Remove completed jobs and start pending ones
        for job_id in completed:
            self.active_jobs.remove(job_id)
            
        while self.pending_jobs and len(self.active_jobs) < self.max_concurrent:
            job_request = self.pending_jobs.pop(0)
            job = client.data_designer.jobs.create(**job_request)
            self.active_jobs.append(job.id)

Authentication and Connection Issues#

API Authentication Problems#

Problem: API requests fail with authentication errors.

Symptoms:

  • Error: 401 Unauthorized

  • Error: Invalid API key

  • Error: Authentication token expired

Solutions:

  1. Verify API Key: Check that API key is correctly set in environment variables

  2. Check Key Permissions: Ensure API key has necessary permissions

  3. Token Refresh: Implement token refresh logic for long-running applications

# Robust authentication with retry logic
import time
from functools import wraps

def retry_on_auth_error(max_retries=3, delay=1):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if "401" in str(e) and attempt < max_retries - 1:
                        print(f"Authentication error, retrying in {delay} seconds...")
                        time.sleep(delay)
                        # Refresh client/token here if needed
                        continue
                    raise
            return None
        return wrapper
    return decorator

@retry_on_auth_error()
def create_job_with_retry(config, rows):
    return client.data_designer.jobs.create(config=config, rows=rows)

Network Connectivity Issues#

Problem: Intermittent network failures cause job creation or monitoring to fail.

Solutions:

  1. Implement Retry Logic: Add exponential backoff for network requests

  2. Check Network Connectivity: Verify network access to Data Designer service

  3. Use Connection Pooling: Reuse HTTP connections for better reliability

# Robust network error handling
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

def create_robust_session():
    session = requests.Session()
    
    # Configure retry strategy
    retry_strategy = Retry(
        total=3,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504],
    )
    
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    
    return session

Debugging Tools and Techniques#

Enable Debug Logging#

Add verbose logging to diagnose issues:

import logging

# Enable debug logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)

# Add request/response logging
def log_api_call(func):
    def wrapper(*args, **kwargs):
        logger.debug(f"API call: {func.__name__} with args={args}, kwargs={kwargs}")
        try:
            result = func(*args, **kwargs)
            logger.debug(f"API response: {result}")
            return result
        except Exception as e:
            logger.error(f"API error: {e}")
            raise
    return wrapper

Configuration Validation#

Validate configurations before job submission:

def validate_config(config):
    """Validate configuration before job creation"""
    errors = []
    
    # Check required fields
    if "model_suite" not in config:
        errors.append("Missing required field: model_suite")
    
    if "columns" not in config or not config["columns"]:
        errors.append("Missing required field: columns")
    
    # Check column dependencies
    defined_columns = set()
    for column in config["columns"]:
        if "prompt" in column:
            # Extract template variables
            import re
            variables = re.findall(r'\{\{(\w+)\}\}', column["prompt"])
            for var in variables:
                if var not in defined_columns:
                    errors.append(f"Column '{column['name']}' references undefined variable '{var}'")
        
        defined_columns.add(column["name"])
    
    if errors:
        raise ValueError("Configuration validation failed: " + "; ".join(errors))
    
    return True

Health Check Utilities#

Monitor service health and availability:

def health_check():
    """Check Data Designer service health"""
    try:
        # Try to list configurations (lightweight operation)
        configs = client.data_designer.configs.list(page_size=1)
        print("✅ Data Designer service is healthy")
        return True
    except Exception as e:
        print(f"❌ Data Designer service health check failed: {e}")
        return False

def diagnostic_report():
    """Generate diagnostic report"""
    print("=== Data Designer Diagnostic Report ===")
    
    # Service health
    health_status = health_check()
    
    # Environment check
    import os
    base_url = os.getenv('DATA_DESIGNER_BASE_URL')
    print(f"Base URL: {base_url}")
    
    # Active jobs
    try:
        jobs = client.data_designer.jobs.list(page_size=10)
        print(f"Active jobs: {len(jobs.data)}")
        for job in jobs.data:
            print(f"  - {job.id}: {job.status}")
    except Exception as e:
        print(f"Failed to get job list: {e}")

Getting Help#

If you continue to experience issues:

  1. Check Service Status: Verify that the Data Designer service is running and healthy

  2. Review Logs: Examine both client-side and server-side logs for error details

  3. Consult Documentation: Review the API reference and user guides

  4. Contact Support: Provide diagnostic information, error logs, and configuration details

Support Information Template#

When contacting support, include:

Data Designer Version: [version]
Cluster Environment: [environment details]
Error Message: [full error message]
Configuration: [sanitized configuration]
Job ID: [if applicable]
Timestamp: [when error occurred]
Steps to Reproduce: [detailed steps]