***

description: >-
Guide to leveraging NVIDIA GPU acceleration in NeMo Curator for faster data
processing and memory optimization
categories:

* reference
  tags:
* gpu-accelerated
* cuda
* rmm
* performance
* memory-management
* optimization
  personas:
* mle-focused
* admin-focused
  difficulty: intermediate
  content\_type: reference
  modality: universal

***

# GPU Processing Guide

This guide explains how to use GPU acceleration in NVIDIA NeMo Curator for faster text data processing.

## Setting Up GPU Support

To use GPU acceleration, you'll need:

1. NVIDIA GPU with CUDA support
2. RAPIDS libraries installed (cuDF, RMM)
3. PyTorch with CUDA support for model inference

### Example: GPU-Accelerated Text Classification

```python
from nemo_curator.stages.text.classifiers import QualityClassifier
from nemo_curator.pipeline import Pipeline
from nemo_curator.tasks import DocumentBatch
import pandas as pd

# Create sample data
data = pd.DataFrame({
    "text": ["This is high quality text.", "Poor quality text here."]
})
batch = DocumentBatch(data=data, task_id="test_task", dataset_name="test_dataset")

# Set up GPU-accelerated classifier
classifier = QualityClassifier(
    model_inference_batch_size=256,
    autocast=True  # Enable mixed precision for faster inference
)

# Create and run pipeline
pipeline = Pipeline(name="test_pipeline")
pipeline.add_stage(classifier)
result = pipeline.run(initial_tasks=[batch])

print(result)
```

### Example: GPU-Accelerated Fuzzy Deduplication

```python
from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow

# Set up GPU-accelerated fuzzy deduplication
workflow = FuzzyDeduplicationWorkflow(
    input_path="/path/to/input/data",
    cache_path="/path/to/cache",
    output_path="/path/to/output",
    text_field="text",
    # GPU-accelerated MinHash parameters
    char_ngrams=24,
    num_bands=20,
    minhashes_per_band=13,
    use_64_bit_hash=False
)

# Run deduplication workflow
workflow.run()
```

## GPU-Accelerated Modules

NVIDIA NeMo Curator provides these GPU-accelerated modules:

### Data Processing

* **Exact deduplication**: GPU-optimized processing for duplicate detection
* **Fuzzy deduplication**: GPU-accelerated MinHash computation for approximate duplicates
* **Semantic deduplication**: GPU embeddings and similarity calculations for content-based deduplication

### Text Classification

* **Domain classification**: English and multilingual content categorization
* **Quality classification**: Content quality assessment using GPU-accelerated models
* **Safety models**: AEGIS and Instruction Data Guard for content safety evaluation
* **Educational content**: FineWeb models for educational value scoring
* **Content type classification**: Automatic content type detection
* **Task and complexity classification**: Instruction complexity assessment
