Code Validation#

Data Designer provides powerful capabilities for generating and validating code. This feature is particularly valuable when creating code examples, documentation, tutorials, or test data for programming applications. With Data Designer’s code validation, you can ensure that your generated code is syntactically correct, follows best practices, and meets quality standards.

Overview#

Data Designer can generate code in various programming languages and then validate it to ensure quality and correctness. This is particularly useful for creating:

Code examples for documentation
Test data for programming tutorials
Synthetic implementation examples
Code training datasets

Supported Languages#

Data Designer supports validation for these languages:

Python (CodeLang.PYTHON)
SQL dialects:
- ANSI SQL (CodeLang.SQL_ANSI)
- MySQL (CodeLang.SQL_MYSQL)
- PostgreSQL (CodeLang.SQL_POSTGRES)
- SQLite (CodeLang.SQL_SQLITE)
- T-SQL (CodeLang.SQL_TSQL)
- BigQuery (CodeLang.SQL_BIGQUERY)

Generating Code#

To generate code, use the LLMCodeColumn column type with the output_format set to the appropriate CodeLang value:

import os
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.data_designer import DataDesignerClient, DataDesignerConfigBuilder
from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P


# Initialize client
data_designer_client = DataDesignerClient(
    client=NeMoMicroservices(base_url=os.environ['NEMO_MICROSERVICES_BASE_URL'])
)

# Define model configuration
model_config = P.ModelConfig(
    alias="code-generation-model",
    model=P.Model(
            api_endpoint=P.ApiEndpoint(
                model_id="meta/llama-3.3-70b-instruct",
                url="https://integrate.api.nvidia.com/v1",
                api_key="your-api-key"
            )
        ),
    inference_parameters=P.InferenceParameters(
        temperature=0.60,
        top_p=0.99,
        max_tokens=2048,
    ),
)

# Create builder with model configuration
config_builder = DataDesignerConfigBuilder(model_configs=[model_config])

# Add an instruction column
config_builder.add_column(
    C.SamplerColumn(
        name="instruction",
        type=P.SamplerType.CATEGORY, 
        params=P.CategorySamplerParams(
            values=[
                "Create a function to calculate factorial",
                "Write a function to sort a list",
                "Generate code to read a CSV file",
                "Create a class for a simple calculator"
            ]
        )
    )
)

# Generate Python code
config_builder.add_column(
    C.LLMCodeColumn(
        name="code_implementation",
        output_format=P.CodeLang.PYTHON,  # Specify code type
        model_alias="code-generation-model",
        system_prompt="You are an expert Python programmer who writes clean, efficient, and well-documented code.",
        prompt="""
        Write Python code for the following instruction:
        Instruction: {{ instruction }}
    
        Important Guidelines:
        * Code Quality: Your code should be clean, complete, self-contained and accurate.
        * Code Validity: Please ensure that your python code is executable and does not contain any errors.
        * Packages: Remember to import any necessary libraries, and to use all libraries you import.
        """
    )
)

Validating Generated Code#

After generating code, you can add a validation column to check for errors and quality issues:

# Add code validation
config_builder.add_column(
    C.CodeValidationColumn(
        name="code_validity_result",
        code_lang=CodeLang.PYTHON,  
        target_column="code_implementation"  # Column containing the code
    )
)

Validation Output#

The validation process creates several output columns:

For Python:

code_validity_result
code_implementation_python_linter_score
code_implementation_python_linter_severity
code_implementation_python_linter_messages

For SQL:

code_validity_result
code_implementation_validator_messages

Complete Python Example#

Here’s a complete example of generating and validating Python code:

import os
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.data_designer import DataDesignerClient, DataDesignerConfigBuilder
from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P

# Initialize client
data_designer_client = DataDesignerClient(
    client=NeMoMicroservices(base_url=os.environ['NEMO_MICROSERVICES_BASE_URL'])
)

# Define model configuration
model_config = P.ModelConfig(
    alias="python-code-model",
    model=P.Model(
            api_endpoint=P.ApiEndpoint(
                model_id="meta/llama-3.3-70b-instruct",
                url="https://integrate.api.nvidia.com/v1",
                api_key="your-api-key"
            )
        ),
    inference_parameters=P.InferenceParameters(
        temperature=0.60,
        top_p=0.99,
        max_tokens=2048,
    ),
)

# Create a new Data Designer builder
config_builder = DataDesignerConfigBuilder(model_configs=[model_config])

# Add a category for code topics
config_builder.add_column(
    C.SamplerColumn(
        name="code_topic",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["Data Processing", "Web Scraping", "API Integration", "Data Visualization"]
        )
    )
)

# Add a complexity level
config_builder.add_column(
    C.SamplerColumn(
        name="complexity_level",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["Beginner", "Intermediate", "Advanced"]
        )
    )
)

# Generate an instruction
config_builder.add_column(
    C.LLMTextColumn(
        name="instruction",
        model_alias="python-code-model",
        system_prompt="You are an expert at creating clear programming tasks.",
        prompt="""
        Create a specific Python programming task related to {{ code_topic }} at a {{ complexity_level }} level.
        The task should be clear, specific, and actionable.
        """
    )
)

# Generate Python code implementation
config_builder.add_column(
    C.LLMCodeColumn(
        name="code_implementation",
        output_format=P.CodeLang.PYTHON,
        model_alias="python-code-model",
        system_prompt="You are an expert Python programmer who writes clean, efficient, and well-documented code.",
        prompt="""
        Write Python code for the following instruction:
        Instruction: {{ instruction }}
        Important Guidelines:
        * Code Quality: Your code should be clean, complete, self-contained and accurate.
        * Code Validity: Please ensure that your Python code is executable and does not contain any errors.
        * Packages: Remember to import any necessary libraries, and to use all libraries you import.
        * Complexity: The code should match a {{ complexity_level }} level of expertise.
        """
    )
)

# Add code validation
config_builder.add_column(
    C.CodeValidationColumn(
        name="code_validity_result",
        code_lang=P.CodeLang.PYTHON,
        target_column="code_implementation"
    )
)

# Build configuration and generate preview
preview_result = data_designer_client.preview(config_builder)
print("Generated sample:")
print(preview_result.dataset.head())

# Create full dataset
job_result = data_designer_client.create(config_builder, num_records=100, wait_until_done=True)
dataset = job_results.load_dataset()
print(dataset.head())

LLM-Based Code Evaluation#

In addition to static validation, you can add an LLM-based judge to evaluate code quality more holistically:

from nemo_microservices.beta.data_designer.config.params.rubrics import TEXT_TO_PYTHON_LLM_JUDGE_PROMPT_TEMPLATE, PYTHON_RUBRICS

# Add an LLM judge to evaluate code quality
config_builder.add_column(
    C.LLMJudgeColumn(
        name="code_judge_result",
        model_alias="python-code-model",
        prompt=TEXT_TO_PYTHON_LLM_JUDGE_PROMPT_TEMPLATE,
        rubrics=PYTHON_RUBRICS
    )
)

The judge will evaluate the code based on predefined rubrics like correctness, efficiency, readability, and documentation.

SQL Code Generation and Validation#

Here’s an example for SQL code generation and validation:

# Generate SQL query
config_builder.add_column(
    C.LLMCodeColumn(
        name="sql_query",
        output_format=P.CodeLang.SQL_POSTGRES,
        model_alias="sql-model",
        system_prompt="You are an expert SQL developer who writes efficient and readable queries.",
        prompt="""
        Write a PostgreSQL query for the following requirement:
        Requirement: {{ sql_requirement }}
        
        Guidelines:
        * Use proper SQL syntax and formatting
        * Include appropriate comments
        * Ensure the query is optimized for performance
        """
    )
)

# Add SQL validation
config_builder.add_column(
    C.CodeValidationColumn(
        name="sql_validity_result",
        code_lang=P.CodeLang.SQL_POSTGRES,
        target_column="sql_query"
    )
)

This will validate the SQL syntax and provide feedback on potential issues with the generated queries.