> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.

# nemo_curator.stages.text.filters.heuristic.code.code

## Module Contents

### Classes

| Name                                                                                                             | Description                                                                   |
| ---------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------- |
| [`AlphaFilter`](#nemo_curator-stages-text-filters-heuristic-code-code-AlphaFilter)                               | This filter tries to identify files that have large tensors, or tables stored |
| [`GeneralCommentToCodeFilter`](#nemo_curator-stages-text-filters-heuristic-code-code-GeneralCommentToCodeFilter) | -                                                                             |
| [`HTMLBoilerplateFilter`](#nemo_curator-stages-text-filters-heuristic-code-code-HTMLBoilerplateFilter)           | This filter tries to identify HTML that is largely boilerplate.               |
| [`NumberOfLinesOfCodeFilter`](#nemo_curator-stages-text-filters-heuristic-code-code-NumberOfLinesOfCodeFilter)   | -                                                                             |
| [`PerExtensionFilter`](#nemo_curator-stages-text-filters-heuristic-code-code-PerExtensionFilter)                 | This filter that has specific conditions depending on the file extension.     |
| [`PythonCommentToCodeFilter`](#nemo_curator-stages-text-filters-heuristic-code-code-PythonCommentToCodeFilter)   | -                                                                             |
| [`TokenizerFertilityFilter`](#nemo_curator-stages-text-filters-heuristic-code-code-TokenizerFertilityFilter)     | -                                                                             |
| [`XMLHeaderFilter`](#nemo_curator-stages-text-filters-heuristic-code-code-XMLHeaderFilter)                       | This filter tries to identify files that have incorrect file extensions.      |

### API

<Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-AlphaFilter">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.stages.text.filters.heuristic.code.code.AlphaFilter(
        min_alpha_ratio: float = 0.25
    )
    ```
  </CodeBlock>
</Anchor>

<Indent>
  **Bases:** [DocumentFilter](/nemo-curator/nemo_curator/stages/text/filters/doc_filter#nemo_curator-stages-text-filters-doc_filter-DocumentFilter)

  This filter tries to identify files that have large tensors, or tables stored
  as raw text within code files.
  (Source: Starcoder [https://arxiv.org/abs/2305.06161](https://arxiv.org/abs/2305.06161))

  <ParamField path="_name" type="= 'alpha_filter'" />

  <Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-AlphaFilter-keep_document">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.filters.heuristic.code.code.AlphaFilter.keep_document(
          score: float
      ) -> bool
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-AlphaFilter-score_document">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.filters.heuristic.code.code.AlphaFilter.score_document(
          source: str
      ) -> float
      ```
    </CodeBlock>
  </Anchor>

  <Indent />
</Indent>

<Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-GeneralCommentToCodeFilter">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.stages.text.filters.heuristic.code.code.GeneralCommentToCodeFilter(
        language: str,
        min_comment_to_code_ratio: float = 0.01,
        max_comment_to_code_ratio: float = 0.85
    )
    ```
  </CodeBlock>
</Anchor>

<Indent>
  **Bases:** [DocumentFilter](/nemo-curator/nemo_curator/stages/text/filters/doc_filter#nemo_curator-stages-text-filters-doc_filter-DocumentFilter)

  <ParamField path="_name" type="= 'comment_ratio'" />

  <Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-GeneralCommentToCodeFilter-keep_document">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.filters.heuristic.code.code.GeneralCommentToCodeFilter.keep_document(
          score: float
      ) -> bool
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-GeneralCommentToCodeFilter-score_document">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.filters.heuristic.code.code.GeneralCommentToCodeFilter.score_document(
          source: str
      ) -> float
      ```
    </CodeBlock>
  </Anchor>

  <Indent />
</Indent>

<Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-HTMLBoilerplateFilter">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.stages.text.filters.heuristic.code.code.HTMLBoilerplateFilter(
        min_lang_content_ratio: float = 0.2,
        min_lang_content_num_chars: int = 100
    )
    ```
  </CodeBlock>
</Anchor>

<Indent>
  **Bases:** [DocumentFilter](/nemo-curator/nemo_curator/stages/text/filters/doc_filter#nemo_curator-stages-text-filters-doc_filter-DocumentFilter)

  This filter tries to identify HTML that is largely boilerplate.

  <ParamField path="_name" type="= 'html_boilerplate'" />

  <Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-HTMLBoilerplateFilter-keep_document">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.filters.heuristic.code.code.HTMLBoilerplateFilter.keep_document(
          score: float
      ) -> bool
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-HTMLBoilerplateFilter-score_document">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.filters.heuristic.code.code.HTMLBoilerplateFilter.score_document(
          source: str
      ) -> float | None
      ```
    </CodeBlock>
  </Anchor>

  <Indent />
</Indent>

<Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-NumberOfLinesOfCodeFilter">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.stages.text.filters.heuristic.code.code.NumberOfLinesOfCodeFilter(
        min_lines: int = 10,
        max_lines: int = 20000
    )
    ```
  </CodeBlock>
</Anchor>

<Indent>
  **Bases:** [DocumentFilter](/nemo-curator/nemo_curator/stages/text/filters/doc_filter#nemo_curator-stages-text-filters-doc_filter-DocumentFilter)

  <ParamField path="_name" type="= 'num_lines'" />

  <Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-NumberOfLinesOfCodeFilter-keep_document">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.filters.heuristic.code.code.NumberOfLinesOfCodeFilter.keep_document(
          score: int
      ) -> bool
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-NumberOfLinesOfCodeFilter-score_document">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.filters.heuristic.code.code.NumberOfLinesOfCodeFilter.score_document(
          source: str
      ) -> int
      ```
    </CodeBlock>
  </Anchor>

  <Indent />
</Indent>

<Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-PerExtensionFilter">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.stages.text.filters.heuristic.code.code.PerExtensionFilter(
        lang: str,
        extension: str,
        metadata_file: str = 'code_meta.csv'
    )
    ```
  </CodeBlock>
</Anchor>

<Indent>
  **Bases:** [DocumentFilter](/nemo-curator/nemo_curator/stages/text/filters/doc_filter#nemo_curator-stages-text-filters-doc_filter-DocumentFilter)

  This filter that has specific conditions depending on the file extension.

  <ParamField path="_ext_to_filter" type="= self._load_filter_csv(metadata_file, lang)" />

  <ParamField path="_name" type="= 'per_extension_filter'" />

  <Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-PerExtensionFilter-_alphanum_fraction">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.filters.heuristic.code.code.PerExtensionFilter._alphanum_fraction(
          source: str
      ) -> float
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-PerExtensionFilter-_get_filter_params">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.filters.heuristic.code.code.PerExtensionFilter._get_filter_params(
          row: dict
      ) -> tuple[bool, int | None, float | None, float | None, float | None]
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Extract filter parameters from csv row
  </Indent>

  <Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-PerExtensionFilter-_language_format_from_dataset">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.filters.heuristic.code.code.PerExtensionFilter._language_format_from_dataset(
          lang: str
      ) -> str
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Convert: Language field in dataset -> language field in csv file that defines the filters.
  </Indent>

  <Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-PerExtensionFilter-_line_statistics">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.filters.heuristic.code.code.PerExtensionFilter._line_statistics(
          source: str
      ) -> tuple[int, float]
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-PerExtensionFilter-_load_filter_csv">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.filters.heuristic.code.code.PerExtensionFilter._load_filter_csv(
          path: str,
          language: str | None = None
      ) -> dict
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Load csv file that specifies the filter to apply for each (lang, extension).
  </Indent>

  <Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-PerExtensionFilter-keep_document">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.filters.heuristic.code.code.PerExtensionFilter.keep_document(
          score: float | None
      ) -> bool
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-PerExtensionFilter-score_document">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.filters.heuristic.code.code.PerExtensionFilter.score_document(
          source: str
      ) -> float
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Filter files based on line length and % alphanumeric characters.
    The filtering parameters depend on the file extension, given by `ext_to_filter`
  </Indent>
</Indent>

<Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-PythonCommentToCodeFilter">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.stages.text.filters.heuristic.code.code.PythonCommentToCodeFilter(
        min_comment_to_code_ratio: float = 0.01,
        max_comment_to_code_ratio: float = 0.85
    )
    ```
  </CodeBlock>
</Anchor>

<Indent>
  **Bases:** [DocumentFilter](/nemo-curator/nemo_curator/stages/text/filters/doc_filter#nemo_curator-stages-text-filters-doc_filter-DocumentFilter)

  <ParamField path="_name" type="= 'python_comment_ratio'" />

  <Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-PythonCommentToCodeFilter-keep_document">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.filters.heuristic.code.code.PythonCommentToCodeFilter.keep_document(
          score: float
      ) -> bool
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-PythonCommentToCodeFilter-score_document">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.filters.heuristic.code.code.PythonCommentToCodeFilter.score_document(
          source: str
      ) -> float
      ```
    </CodeBlock>
  </Anchor>

  <Indent />
</Indent>

<Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-TokenizerFertilityFilter">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.stages.text.filters.heuristic.code.code.TokenizerFertilityFilter(
        path_to_tokenizer: str | None = None,
        min_char_to_token_ratio: float = 2.5
    )
    ```
  </CodeBlock>
</Anchor>

<Indent>
  **Bases:** [DocumentFilter](/nemo-curator/nemo_curator/stages/text/filters/doc_filter#nemo_curator-stages-text-filters-doc_filter-DocumentFilter)

  <ParamField path="_name" type="= 'tokenizer_fertility'" />

  <ParamField path="_tokenizer" type="= sentencepiece.SentencePieceProcessor()" />

  <Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-TokenizerFertilityFilter-keep_document">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.filters.heuristic.code.code.TokenizerFertilityFilter.keep_document(
          score: float
      ) -> bool
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-TokenizerFertilityFilter-score_document">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.filters.heuristic.code.code.TokenizerFertilityFilter.score_document(
          source: str
      ) -> float
      ```
    </CodeBlock>
  </Anchor>

  <Indent />
</Indent>

<Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-XMLHeaderFilter">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.stages.text.filters.heuristic.code.code.XMLHeaderFilter(
        char_prefix_search_length: int = 100
    )
    ```
  </CodeBlock>
</Anchor>

<Indent>
  **Bases:** [DocumentFilter](/nemo-curator/nemo_curator/stages/text/filters/doc_filter#nemo_curator-stages-text-filters-doc_filter-DocumentFilter)

  This filter tries to identify files that have incorrect file extensions.
  In many cases, these end up being XML files and we try to identify them
  based on the header.
  (Source: Starcoder [https://arxiv.org/abs/2305.06161](https://arxiv.org/abs/2305.06161))

  <ParamField path="_name" type="= 'xml_header'" />

  <Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-XMLHeaderFilter-keep_document">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.filters.heuristic.code.code.XMLHeaderFilter.keep_document(
          score: float
      ) -> bool
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-stages-text-filters-heuristic-code-code-XMLHeaderFilter-score_document">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.filters.heuristic.code.code.XMLHeaderFilter.score_document(
          source: str
      ) -> float
      ```
    </CodeBlock>
  </Anchor>

  <Indent />
</Indent>