GPU Processing Guide
This guide explains how to use GPU acceleration in NVIDIA NeMo Curator for faster text data processing.
Setting Up GPU Support
To use GPU acceleration, you’ll need:
- NVIDIA GPU with CUDA support
- RAPIDS libraries installed (cuDF, RMM)
- PyTorch with CUDA support for model inference
Example: GPU-Accelerated Text Classification
Example: GPU-Accelerated Fuzzy Deduplication
GPU-Accelerated Modules
NVIDIA NeMo Curator provides these GPU-accelerated modules:
Data Processing
- Exact deduplication: GPU-optimized processing for duplicate detection
- Fuzzy deduplication: GPU-accelerated MinHash computation for approximate duplicates
- Semantic deduplication: GPU embeddings and similarity calculations for content-based deduplication
Text Classification
- Domain classification: English and multilingual content categorization
- Quality classification: Content quality assessment using GPU-accelerated models
- Safety models: AEGIS and Instruction Data Guard for content safety evaluation
- Educational content: FineWeb models for educational value scoring
- Content type classification: Automatic content type detection
- Task and complexity classification: Instruction complexity assessment