For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
  • Get Started
    • Overview
    • Install (All Modalities)
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
      • Overview
        • Overview
        • Code Processing
    • Save and Export
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
    • Save and Export
  • Curate Audio
    • Overview
    • Save and Export
  • Setup & Deployment
    • Overview
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • How It Works
  • Available Specialized Tools
  • Usage
  • Quick Examples
  • When to Use Specialized Processing
  • Performance Considerations
Curate TextProcess DataSpecialized Processing

Specialized Processing

||View as Markdown|
Previous

Heuristic Filtering

Next

Code Filtering

Domain-specific processing for code and advanced curation tasks using NeMo Curator’s specialized modules.

This section covers advanced processing techniques for specific data types and use cases that require specialized handling beyond general text processing. These tools are designed for specific domains like programming content.

How It Works

Specialized processing modules in NeMo Curator are designed for specific data types and use cases:

  • Code Processing: Handles programming languages with syntax-aware filtering

Each specialized processor understands the unique characteristics of its target domain and applies appropriate metrics and thresholds within the broader data processing framework .


Available Specialized Tools

Code Processing

Specialized filters for programming content and source code programming syntax comments languages

Usage

Quick Examples

Code Processing
1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.text.filters import ScoreFilter
3from nemo_curator.stages.text.filters.heuristic.code import PythonCommentToCodeFilter, NumberOfLinesOfCodeFilter
4from nemo_curator.stages.text.io.reader import JsonlReader
5
6# Filter Python code based on quality metrics
7code_pipeline = Pipeline(
8 name="code_processing_pipeline",
9 stages=[
10 JsonlReader(
11 file_paths="code_data/*.jsonl",
12 fields=["content"]
13 ),
14 ScoreFilter(
15 filter_obj=PythonCommentToCodeFilter(
16 min_comment_to_code_ratio=0.01,
17 max_comment_to_code_ratio=0.8
18 ),
19 text_field="content",
20 score_field="comment_ratio"
21 ),
22 ScoreFilter(
23 filter_obj=NumberOfLinesOfCodeFilter(min_lines=5, max_lines=1000),
24 text_field="content",
25 score_field="line_count"
26 )
27])
28
29results = code_pipeline.run()

When to Use Specialized Processing

  • Code datasets: When working with programming content that needs syntax-aware filtering

Performance Considerations

  • Code processing: Fast heuristic-based filtering, suitable for large code repositories