Curate TextProcess DataSpecialized Processing

Code Filtering

View as Markdown

NVIDIA NeMo Curator provides specialized filters for assessing and filtering code snippets and programming files. These filters help ensure that code included in your training dataset meets quality standards and doesn’t contain problematic patterns. Code filtering addresses specific challenges related to programming content, including code quality assessment, detection of non-code content mislabeled as code, identification of embedded data structures or boilerplate, language-specific filtering considerations, and token efficiency for code. These filters are particularly important when preparing datasets for code language models or programming assistants.

How It Works

Code filtering evaluates programming content based on measurable attributes that correlate with code quality and usability for model training. The filters analyze various aspects of code:

  1. Structure Analysis: Examines lines of code, indentation patterns, and overall file organization
  2. Comment Analysis: Measures the ratio of comments to executable code to identify well-documented code versus automatically generated or tutorial content
  3. Content Verification: Ensures files actually contain code rather than data, configuration, or misclassified content
  4. Language-Specific Patterns: Applies different criteria based on programming language conventions
  5. Token Efficiency: Evaluates how efficiently the code can be tokenized for model training

These filters can be applied individually or in combination to create comprehensive quality assessment pipelines. Each filter typically computes a score or makes a binary decision based on configurable thresholds that can be adjusted to match specific requirements.


Usage

Here’s an example of applying code filters to a dataset:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.text.io.reader import JsonlReader
3from nemo_curator.stages.text.io.writer import JsonlWriter
4from nemo_curator.stages.text.modules import ScoreFilter
5from nemo_curator.stages.text.filters import (
6 PythonCommentToCodeFilter,
7 NumberOfLinesOfCodeFilter,
8 AlphaFilter
9)
10
11# Create pipeline
12pipeline = Pipeline(name="code_quality_filtering")
13
14# Load your code dataset
15reader = JsonlReader(
16 file_paths="code_data/*.jsonl",
17 fields=["content", "id"] # Specify fields to read
18)
19pipeline.add_stage(reader)
20
21# Add filter stages for code quality
22pipeline.add_stage(ScoreFilter(
23 filter_obj=PythonCommentToCodeFilter(
24 min_comment_to_code_ratio=0.01,
25 max_comment_to_code_ratio=0.8
26 ),
27 text_field="content",
28 score_field="comment_ratio"
29))
30pipeline.add_stage(ScoreFilter(
31 filter_obj=NumberOfLinesOfCodeFilter(min_lines=5, max_lines=1000),
32 text_field="content",
33 score_field="line_count"
34))
35pipeline.add_stage(ScoreFilter(
36 filter_obj=AlphaFilter(min_alpha_ratio=0.3),
37 text_field="content",
38 score_field="alpha_ratio"
39))
40
41# Add output stage
42writer = JsonlWriter(path="filtered_code/")
43pipeline.add_stage(writer)
44
45# Execute pipeline
46results = pipeline.run()

Available Code Filters

NeMo Curator offers several specialized filters for code content:

Comment Analysis Filters

FilterDescriptionKey ParametersDefault Values
PythonCommentToCodeFilterFilters Python files based on comment-to-code ratiomin_comment_to_code_ratio, max_comment_to_code_ratiomin=0.01, max=0.85
GeneralCommentToCodeFilterSimilar filter for other languageslanguage, min_comment_to_code_ratio, max_comment_to_code_ratiomin=0.01, max=0.85

The comment-to-code ratio is an important metric for code quality. Low comment ratios may indicate poor documentation, while high comment ratios might suggest automatically generated code or tutorials. These ratios should be adjusted based on specific programming languages:

1# For Python files with docstrings
2python_filter = ScoreFilter(
3 filter_obj=PythonCommentToCodeFilter(
4 min_comment_to_code_ratio=0.05, # At least 5% comments
5 max_comment_to_code_ratio=0.7 # At most 70% comments
6 ),
7 text_field="content"
8)
9
10# For other languages
11cpp_filter = ScoreFilter(
12 filter_obj=GeneralCommentToCodeFilter(
13 language="text/x-c++", # MIME type for C++
14 min_comment_to_code_ratio=0.02,
15 max_comment_to_code_ratio=0.6
16 ),
17 text_field="content"
18)

The GeneralCommentToCodeFilter supports various language MIME types:

  • text/x-c++ for C++
  • text/x-java for Java
  • text/javascript for JavaScript
  • text/x-ruby for Ruby
  • text/x-csharp for C#
  • text/x-c for C
  • text/x-asm for Assembly

Code Structure Filters

FilterDescriptionKey ParametersDefault Values
NumberOfLinesOfCodeFilterFilters based on the number of linesmin_lines, max_linesmin_lines=10, max_lines=20000
AlphaFilterEnsures code has sufficient alphabetic contentmin_alpha_ratiomin_alpha_ratio=0.25
TokenizerFertilityFilterMeasures token efficiencypath_to_tokenizer (required), min_char_to_token_ratiomin_char_to_token_ratio=2.5

Code structure filters help identify problematic patterns:

1# Filter for reasonable line counts
2line_filter = ScoreFilter(
3 filter_obj=NumberOfLinesOfCodeFilter(
4 min_lines=5, # Filter out tiny snippets
5 max_lines=2000 # Filter out extremely long files
6 ),
7 text_field="content"
8)
9
10# Filter for alphabetic content (avoid large data blobs)
11alpha_filter = ScoreFilter(
12 filter_obj=AlphaFilter(min_alpha_ratio=0.3), # At least 30% alphabetic chars
13 text_field="content"
14)

The TokenizerFertilityFilter helps ensure code has efficient token encoding:

1# Filter for token efficiency
2# Note: path_to_tokenizer is required
3tokenization_filter = ScoreFilter(
4 filter_obj=TokenizerFertilityFilter(
5 path_to_tokenizer="/path/to/code_tokenizer.model", # Required parameter
6 min_char_to_token_ratio=2.5 # Each token encodes at least 2.5 chars on average
7 ),
8 text_field="content"
9)

This filter helps avoid content that has poor token efficiency, which can impact model training.

File Format Filters

FilterDescriptionKey ParametersDefault Values
XMLHeaderFilterIdentifies files that are actually XMLchar_prefix_search_length100
HTMLBoilerplateFilterFilters HTML with too much boilerplatemin_lang_content_ratio, min_lang_content_num_charsratio=0.2, chars=100
PerExtensionFilterApplies standards based on file extensionlang, extension, metadata_filedepends on metadata

Language-Specific Considerations

Different programming languages have different conventions and characteristics. The PerExtensionFilter applies customized filtering based on file extension:

1# Apply language-specific filters
2python_specific = ScoreFilter(
3 filter_obj=PerExtensionFilter(
4 lang="python",
5 extension=".py",
6 metadata_file="code_meta.csv" # Contains language-specific thresholds
7 ),
8 text_field="content"
9)

The metadata file can specify different thresholds for metrics like:

  • Average line length
  • Comment ratio
  • Empty line ratio
  • Alphabetic content ratio

Best Practices for Code Filtering

When filtering code datasets, consider these best practices:

  1. Language-specific configurations: Adjust thresholds based on the programming language

    1from nemo_curator.stages.text.modules import ScoreFilter
    2from nemo_curator.stages.text.filters import PythonCommentToCodeFilter, GeneralCommentToCodeFilter
    3
    4# Python tends to have more comments than C
    5python_comment_filter = ScoreFilter(
    6 filter_obj=PythonCommentToCodeFilter(min_comment_to_code_ratio=0.05),
    7 text_field="content"
    8)
    9c_comment_filter = ScoreFilter(
    10 filter_obj=GeneralCommentToCodeFilter(language="text/x-c", min_comment_to_code_ratio=0.02),
    11 text_field="content"
    12)
  2. Preserve code structure: Ensure filters don’t inadvertently remove valid coding patterns

    1from nemo_curator.stages.text.modules import ScoreFilter
    2from nemo_curator.stages.text.filters import GeneralCommentToCodeFilter
    3
    4# Some languages naturally have low comment ratios
    5assembly_filter = ScoreFilter(
    6 filter_obj=GeneralCommentToCodeFilter(
    7 language="text/x-asm",
    8 min_comment_to_code_ratio=0.001 # Very low minimum for assembly
    9 ),
    10 text_field="content"
    11)
  3. Combine with language detection: Verify file extensions match content

    1# First check if the content is actually Python using FastText language ID
    2from nemo_curator.stages.text.filters import FastTextLangId
    3from nemo_curator.pipeline import Pipeline
    4from nemo_curator.stages.text.modules import ScoreFilter
    5
    6# Create pipeline for Python code filtering with language detection
    7pipeline = Pipeline(name="python_code_filtering")
    8
    9# Add language detection stage
    10pipeline.add_stage(ScoreFilter(
    11 filter_obj=FastTextLangId(
    12 model_path="/path/to/lid.176.bin", # Download from fasttext.cc
    13 min_langid_score=0.8
    14 ),
    15 text_field="content",
    16 score_field="language"
    17))
    18
    19# Then apply Python-specific filters
    20pipeline.add_stage(ScoreFilter(
    21 filter_obj=PythonCommentToCodeFilter(),
    22 text_field="content"
    23))

    The FastTextLangId filter requires downloading the FastText language identification model from fasttext.cc.

  4. Avoid over-filtering: Track rejection rates and adjust thresholds as needed

    1# Track filter statistics by running individual filters and measuring results
    2from nemo_curator.stages.text.io.reader import JsonlReader
    3
    4# Load dataset for testing
    5reader = JsonlReader(file_paths="test_data/*.jsonl")
    6
    7# Test individual filters to measure rejection rates
    8filters_to_test = {
    9 "python_comment": PythonCommentToCodeFilter(),
    10 "line_count": NumberOfLinesOfCodeFilter(min_lines=5, max_lines=1000),
    11 "alpha_content": AlphaFilter(min_alpha_ratio=0.3)
    12}
    13
    14# Note: Actual statistics collection would require running the pipeline
    15# and analyzing the results to determine optimal thresholds

Use Cases

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.text.modules import ScoreFilter
3from nemo_curator.stages.text.filters import NumberOfLinesOfCodeFilter, XMLHeaderFilter, GeneralCommentToCodeFilter
4
5# Create pipeline to filter non-functional code snippets
6pipeline = Pipeline(name="code_cleaning")
7
8# Remove extremely short files
9pipeline.add_stage(ScoreFilter(
10 filter_obj=NumberOfLinesOfCodeFilter(min_lines=3),
11 text_field="content"
12))
13
14# Remove files with XML preamble (misidentified as code)
15pipeline.add_stage(ScoreFilter(
16 filter_obj=XMLHeaderFilter(),
17 text_field="content"
18))
19
20# Ensure reasonable comment-to-code ratio
21pipeline.add_stage(ScoreFilter(
22 filter_obj=GeneralCommentToCodeFilter(language="text/x-c++"),
23 text_field="content"
24))

By applying these specialized code filters, you can improve the quality of code in your training datasets, leading to better model performance for code-related tasks.