Memory Management Guide#
This guide explains strategies for managing memory when processing large text datasets with NVIDIA NeMo Curator.
Memory Challenges in Text Curation#
Processing large text datasets presents several challenges:
Datasets larger than available RAM/VRAM
Memory-intensive operations like deduplication
Long-running processes that may leak memory
Balancing memory across distributed systems
Memory Management Strategies#
1. Partition Control#
Control how data is split across workers:
from nemo_curator.datasets import DocumentDataset
# Control partition size when reading
dataset = DocumentDataset.read_json(
files,
blocksize="256MB", # Size of each partition
files_per_partition=10 # Files to group together
)
2. Batch Processing#
Process data in manageable chunks:
from nemo_curator.utils.file_utils import get_batched_files
# Process 50 files at a time
for files in get_batched_files(
"input/", "output/", "jsonl",
batch_size=50
):
batch = DocumentDataset.read_json(files)
processed = pipeline(batch)
processed.to_json("output/")
3. Memory-Aware Operations#
Some operations need special memory handling:
Deduplication#
from nemo_curator.modules import ExactDeduplication
# Control memory usage in deduplication
dedup = ExactDeduplication(
batch_size=1000, # Documents per batch
hash_field="text"
)
Classification#
from nemo_curator.classifiers import QualityClassifier
# Manage classifier memory
classifier = QualityClassifier(
batch_size=32, # Smaller batches use less memory
device="cuda:0" # Control GPU device
)
Memory Monitoring#
CPU Memory#
Monitor system memory:
import psutil
def check_memory():
mem = psutil.virtual_memory()
print(f"Memory usage: {mem.percent}%")
print(f"Available: {mem.available / 1e9:.1f} GB")
GPU Memory#
Monitor GPU memory:
import pynvml
def check_gpu_memory():
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
info = pynvml.nvmlDeviceGetMemoryInfo(handle)
print(f"GPU memory used: {info.used / 1e9:.1f} GB")
Best Practices#
Monitor Memory Usage
Track memory during development
Set up monitoring for production
Handle out-of-memory gracefully
Optimize Data Loading
Use lazy loading when possible
Control partition sizes
Clean up unused data
Resource Management
Release memory after large operations
Use context managers for cleanup
Monitor long-running processes