***

description: >-
Domain-specific processing for code and advanced curation tasks with
specialized modules
categories:

* workflows
  tags:
* specialized-processing
* code
* advanced
  personas:
* data-scientist-focused
* mle-focused
  difficulty: advanced
  content\_type: workflow
  modality: text-only

***

# Specialized Processing

Domain-specific processing for code and advanced curation tasks using NeMo Curator's specialized modules.

This section covers advanced processing techniques for specific data types and use cases that require specialized handling beyond general text processing. These tools are designed for specific domains like programming content.

## How It Works

Specialized processing modules in NeMo Curator are designed for specific data types and use cases:

* **Code Processing**: Handles programming languages with syntax-aware filtering

Each specialized processor understands the unique characteristics of its target domain and applies appropriate metrics and thresholds within the broader [data processing framework ](/about/concepts/text/data/processing).

***

## Available Specialized Tools

<Cards>
  <Card title="Code Processing" href="/curate-text/process-data/specialized-processing/code">
    Specialized filters for programming content and source code
    programming
    syntax
    comments
    languages
  </Card>
</Cards>

## Usage

### Quick Examples

<Tabs>
  <Tab title="Code Processing">
    ```python
    from nemo_curator.pipeline import Pipeline
    from nemo_curator.stages.text.modules import ScoreFilter
    from nemo_curator.stages.text.filters import PythonCommentToCodeFilter, NumberOfLinesOfCodeFilter
    from nemo_curator.stages.text.io.reader import JsonlReader

    # Filter Python code based on quality metrics
    code_pipeline = Pipeline(
        name="code_processing_pipeline",
        stages=[
        JsonlReader(
            file_paths="code_data/*.jsonl",
            fields=["content"]
        ),
        ScoreFilter(
            filter_obj=PythonCommentToCodeFilter(
                min_comment_to_code_ratio=0.01,
                max_comment_to_code_ratio=0.8
            ),
            text_field="content",
            score_field="comment_ratio"
        ),
        ScoreFilter(
            filter_obj=NumberOfLinesOfCodeFilter(min_lines=5, max_lines=1000),
            text_field="content", 
            score_field="line_count"
        )
    ])

    results = code_pipeline.run()
    ```
  </Tab>
</Tabs>

## When to Use Specialized Processing

* **Code datasets**: When working with programming content that needs syntax-aware filtering

## Performance Considerations

* **Code processing**: Fast heuristic-based filtering, suitable for large code repositories
