Code Validation#

Data Designer provides powerful capabilities for generating and validating code. This feature is particularly valuable when creating code examples, documentation, tutorials, or test data for programming applications. With Data Designer’s code validation, you can ensure that your generated code is syntactically correct, follows best practices, and meets quality standards.

Overview#

Data Designer can generate code in various programming languages and then validate it to ensure quality and correctness. This is particularly useful for creating:

Code examples for documentation
Test data for programming tutorials
Synthetic implementation examples
Code training datasets

Supported Languages#

Data Designer supports validation for these languages:

Python (CodeLang.PYTHON)
SQL dialects:
- ANSI SQL (CodeLang.SQL_ANSI)
- MySQL (CodeLang.SQL_MYSQL)
- PostgreSQL (CodeLang.SQL_POSTGRES)
- SQLite (CodeLang.SQL_SQLITE)
- T-SQL (CodeLang.SQL_TSQL)
- BigQuery (CodeLang.SQL_BIGQUERY)

Generating Code#

To generate code, use the LLMCodeColumn column type with the output_format set to the appropriate CodeLang value:

import os
from nemo_microservices.data_designer.essentials import (
    CategorySamplerParams,
    CodeLang,
    CodeValidatorParams,
    DataDesignerConfigBuilder,
    InferenceParameters,
    LLMCodeColumnConfig,
    ModelConfig,
    NeMoDataDesignerClient,
    SamplerColumnConfig,
    SamplerType,
    ValidationColumnConfig,
)

# Initialize client
data_designer_client = NeMoDataDesignerClient(
    base_url=os.environ['NEMO_MICROSERVICES_BASE_URL']
)


# Define model configuration
model_config = ModelConfig(
    alias="code-generation-model",
    model="meta/llama-3.3-70b-instruct",
    inference_parameters=InferenceParameters(
        temperature=0.60,
        top_p=0.99,
        max_tokens=2048,
    ),
)

# Create builder with model configuration
config_builder = DataDesignerConfigBuilder(model_configs=[model_config])

# Add an instruction column
config_builder.add_column(
    SamplerColumnConfig(
        name="instruction",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=[
                "Create a function to calculate factorial",
                "Write a function to sort a list",
                "Generate code to read a CSV file",
                "Create a class for a simple calculator"
            ]
        )
    )
)

# Generate Python code
config_builder.add_column(
    LLMCodeColumnConfig(
        name="code_implementation",
        code_lang=CodeLang.PYTHON,  # Specify code type
        model_alias="code-generation-model",
        system_prompt="You are an expert Python programmer who writes clean, efficient, and well-documented code.",
        prompt="""
        Write Python code for the following instruction:
        Instruction: {{ instruction }}

        Important Guidelines:
        * Code Quality: Your code should be clean, complete, self-contained and accurate.
        * Code Validity: Please ensure that your python code is executable and does not contain any errors.
        * Packages: Remember to import any necessary libraries, and to use all libraries you import.
        """
    )
)

Validating Generated Code#

After generating code, you can add a validation column to check for errors and quality issues. Configure the ValidationColumnConfig with validator_type set to ValidatorType.CODE, and validator_params to CodeValidatorParams as shown in the following example.

# Add code validation
config_builder.add_column(
    ValidationColumnConfig(
        name="code_validity_result",
        validator_type=ValidatorType.CODE,
        target_columns=["code_implementation"],  # Column containing the code
        validator_params=CodeValidatorParams(
            code_lang=CodeLang.PYTHON,
        ),
        batch_size=100
    )
)

The CodeValidatorParams class takes a single argument, code_lang, which accepts any of the supported languages mentioned above.

Validation Output#

The validation process creates a single output column, where each cell is a JSON containing the output of the validation together with additional metadata.

For Python, the following fields are present:

is_valid
python_linter_score
python_linter_severity
python_linter_messages

For SQL:

is_valid
error_messages

If multiple target_columns are passed, the cell is instead a nested JSON, with fields in the first level corresponding to the name of each column.

Complete Python Example#

Here’s a complete example of generating and validating Python code:

import os
from nemo_microservices.data_designer.essentials import (
    CategorySamplerParams,
    CodeLang,
    CodeValidatorParams,
    DataDesignerConfigBuilder,
    InferenceParameters,
    LLMCodeColumnConfig,
    LLMJudgeColumnConfig,
    LLMTextColumnConfig,
    ModelConfig,
    NeMoDataDesignerClient,
    SamplerColumnConfig,
    SamplerType,
    ValidationColumnConfig,
    ValidatorType,
)

# Initialize client
data_designer_client = NeMoDataDesignerClient(
    base_url=os.environ['NEMO_MICROSERVICES_BASE_URL']
)

# Define model configuration
model_config = ModelConfig(
    alias="nemotron-nano",
    model="nvidia/nvidia-nemotron-nano-9b-v2",
    inference_parameters=InferenceParameters(
        temperature=0.5,
        top_p=1.0,
        max_tokens=1024,
        max_parallel_requests=4,
    ),
    provider="nvbuild",
)

# Create a new Data Designer builder
config_builder = DataDesignerConfigBuilder(model_configs=[model_config])

# Add a category for code topics
config_builder.add_column(
    SamplerColumnConfig(
        name="code_topic",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=["Data Processing", "Web Scraping", "API Integration", "Data Visualization"]
        )
    )
)

# Add a complexity level
config_builder.add_column(
    SamplerColumnConfig(
        name="complexity_level",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=["Beginner", "Intermediate", "Advanced"]
        )
    )
)

# Generate an instruction
config_builder.add_column(
    LLMTextColumnConfig(
        name="instruction",
        model_alias="nemotron-nano",
        system_prompt="You are an expert at creating clear programming tasks.",
        prompt="""
        Create a specific Python programming task related to {{ code_topic }} at a {{ complexity_level }} level.
        The task should be clear, specific, and actionable.
        """
    )
)

# Generate Python code implementation
config_builder.add_column(
    LLMCodeColumnConfig(
        name="code_implementation",
        code_lang=CodeLang.PYTHON,
        model_alias="nemotron-nano",
        system_prompt="You are an expert Python programmer who writes clean, efficient, and well-documented code.",
        prompt="""
        Write Python code for the following instruction:
        Instruction: {{ instruction }}
        Important Guidelines:
        * Code Quality: Your code should be clean, complete, self-contained and accurate.
        * Code Validity: Please ensure that your Python code is executable and does not contain any errors.
        * Packages: Remember to import any necessary libraries, and to use all libraries you import.
        * Complexity: The code should match a {{ complexity_level }} level of expertise.
        """
    )
)

# Add code validation
config_builder.add_column(
    ValidationColumnConfig(
        name="code_validity_result",
        validator_type=ValidatorType.CODE,
        target_columns=["code_implementation"],
        validator_params=CodeValidatorParams(
            code_lang=CodeLang.PYTHON,
        ),
        batch_size=100
    )
)

# Build configuration and generate preview
preview_result = data_designer_client.preview(config_builder)
print("Generated sample:")
print(preview_result.dataset.head())

# Create full dataset
job_result = data_designer_client.create(config_builder, num_records=100, wait_until_done=True)
dataset = job_result.load_dataset()
print(dataset.head())

LLM-Based Code Evaluation#

In addition to static validation, you can add an LLM-based judge to evaluate code quality more holistically:

text_to_python_judge_template = """\
You are an expert in Python programming, with specialized knowledge in software engineering, data science, and algorithmic problem-solving. \
You think about potential flaws and errors in the code. You are a tough critic, but a fair one.

Take a deep breath and use the Python Code Quality Rubric below to score the **Generated Python Code** based on the INSTRUCTIONS.

#### INSTRUCTIONS
The Generated Python Code should be a valid response to the Natural Language Prompt below

Natural Language Prompt:
{{ instruction }}

Generated Python Code
{{ code_implementation }}
"""

python_scoring = [
    Score(
        name="Relevance",
        description="Adherence to INSTRUCTIONS and CONTEXT",
        options={
            "4": "Perfectly meets all specified requirements.",
            "3": "Meets most requirements with minor deviations.",
            "2": "Moderate deviation from the instructions.",
            "1": "Significant deviations from the instructions.",
            "0": "Does not adhere to the instructions.",
        },
    ),
    Score(
        name="Pythonic",
        description="Pythonic Code and Best Practices (Does the code follow Python conventions and best practices?)",
        options={
            "4": "The code exemplifies Pythonic principles, making excellent use of Python-specific constructs, standard library modules and programming idioms; follows all relevant PEPs.",
            "3": "The code closely follows Python conventions and adheres to many best practices; good use of Python-specific constructs, standard library modules and programming idioms.",
            "2": "The code generally follows Python conventions but has room for better alignment with Pythonic practices.",
            "1": "The code loosely follows Python conventions, with several deviations from best practices.",
            "0": "The code does not follow Python conventions or best practices, using non-Pythonic approaches.",
        },
    ),
    Score(
        name="Readability",
        description="Readability and Maintainability (Is the Python code easy to understand and maintain?)",
        options={
            "4": "The code is excellently formatted, follows PEP 8 guidelines, is elegantly concise and clear, uses meaningful variable names, ensuring high readability and ease of maintenance; organizes complex logic well. Docstrings are given in a Google Docstring format.",
            "3": "The code is well-formatted in the sense of code-as-documentation, making it relatively easy to understand and maintain; uses descriptive names and organizes logic clearly.",
            "2": "The code is somewhat readable with basic formatting and some comments, but improvements are needed; needs better use of descriptive names and organization.",
            "1": "The code has minimal formatting, making it hard to understand; lacks meaningful names and organization.",
            "0": "The code is unreadable, with no attempt at formatting or description.",
        },
    ),
    Score(
        name="Efficiency",
        description="Efficiency and Performance (Is the code optimized for performance?)",
        options={
            "4": "The solution is highly efficient, using appropriate data structures and algorithms; avoids unnecessary computations and optimizes for both time and space complexity.",
            "3": "The solution is efficient, with good use of Python's built-in functions and libraries; minor areas for optimization.",
            "2": "The solution is moderately efficient, but misses some opportunities for optimization; uses some inefficient patterns.",
            "1": "The solution shows poor efficiency, with notable performance issues; lacks effective optimization techniques.",
            "0": "The solution is highly inefficient; overlooks fundamental optimization practices, resulting in significant performance issues.",
        },
    ),
]

# Add an LLM judge to evaluate code quality
config_builder.add_column(
    LLMJudgeColumnConfig(
        name="code_judge_result",
        model_alias="python-code-model",
        prompt=text_to_python_judge_template,
        scores=python_scoring
    )
)

The judge will evaluate the code based on predefined rubrics like correctness, efficiency, readability, and documentation.

SQL Code Generation and Validation#

Here’s an example for SQL code generation and validation:

# Generate SQL query
config_builder.add_column(
    LLMCodeColumnConfig(
        name="sql_query",
        code_lang=CodeLang.SQL_POSTGRES,
        model_alias="sql-model",
        system_prompt="You are an expert SQL developer who writes efficient and readable queries.",
        prompt="""
        Write a PostgreSQL query for the following requirement:
        Requirement: {{ sql_requirement }}

        Guidelines:
        * Use proper SQL syntax and formatting
        * Include appropriate comments
        * Ensure the query is optimized for performance
        """
    )
)

# Add SQL validation
config_builder.add_column(
    ValidationColumnConfig(
        name="sql_validity_result",
        validator_type=ValidatorType.CODE,
        target_columns=["sql_query"],
        validator_params=CodeValidatorParams(
            code_lang=CodeLang.SQL_POSTGRES,
        ),
        batch_size=100,
    )
)

This will validate the SQL syntax and provide feedback on potential issues with the generated queries.