ReferenceInfra

GPU Processing Guide

View as Markdown

This guide explains how to use GPU acceleration in NVIDIA NeMo Curator for faster text data processing.

Setting Up GPU Support

To use GPU acceleration, you’ll need:

  1. NVIDIA GPU with CUDA support
  2. RAPIDS libraries installed (cuDF, RMM)
  3. PyTorch with CUDA support for model inference

Example: GPU-Accelerated Text Classification

1from nemo_curator.stages.text.classifiers import QualityClassifier
2from nemo_curator.pipeline import Pipeline
3from nemo_curator.tasks import DocumentBatch
4import pandas as pd
5
6# Create sample data
7data = pd.DataFrame({
8 "text": ["This is high quality text.", "Poor quality text here."]
9})
10batch = DocumentBatch(data=data, task_id="test_task", dataset_name="test_dataset")
11
12# Set up GPU-accelerated classifier
13classifier = QualityClassifier(
14 model_inference_batch_size=256,
15 autocast=True # Enable mixed precision for faster inference
16)
17
18# Create and run pipeline
19pipeline = Pipeline(name="test_pipeline")
20pipeline.add_stage(classifier)
21result = pipeline.run(initial_tasks=[batch])
22
23print(result)

Example: GPU-Accelerated Fuzzy Deduplication

1from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
2
3# Set up GPU-accelerated fuzzy deduplication
4workflow = FuzzyDeduplicationWorkflow(
5 input_path="/path/to/input/data",
6 cache_path="/path/to/cache",
7 output_path="/path/to/output",
8 text_field="text",
9 # GPU-accelerated MinHash parameters
10 char_ngrams=24,
11 num_bands=20,
12 minhashes_per_band=13,
13 use_64_bit_hash=False
14)
15
16# Run deduplication workflow
17workflow.run()

GPU-Accelerated Modules

NVIDIA NeMo Curator provides these GPU-accelerated modules:

Data Processing

  • Exact deduplication: GPU-optimized processing for duplicate detection
  • Fuzzy deduplication: GPU-accelerated MinHash computation for approximate duplicates
  • Semantic deduplication: GPU embeddings and similarity calculations for content-based deduplication

Text Classification

  • Domain classification: English and multilingual content categorization
  • Quality classification: Content quality assessment using GPU-accelerated models
  • Safety models: AEGIS and Instruction Data Guard for content safety evaluation
  • Educational content: FineWeb models for educational value scoring
  • Content type classification: Automatic content type detection
  • Task and complexity classification: Instruction complexity assessment