Memory Management Guide#
This guide explains strategies for managing memory when processing large text datasets with NVIDIA NeMo Curator.
Memory Challenges in Text Curation#
Processing large text datasets presents several challenges:
Datasets larger than available RAM/VRAM
Memory-intensive operations like deduplication
Long-running processes that may leak memory
Balancing memory across distributed systems
Memory Management Strategies#
1. Partition Control#
Control how data is split across workers using file partitioning:
from nemo_curator.stages.file_partitioning import FilePartitioningStage
# Control partition size when reading
partitioner = FilePartitioningStage(
file_paths=files,
blocksize="256MB", # Target size of each partition in memory
files_per_partition=10 # Alternative: group files by count instead of size
)
2. Batch Processing#
Process data in manageable chunks by controlling file partitioning:
from nemo_curator.stages.text.io.reader import JsonlReader
from nemo_curator.stages.text.io.writer import JsonlWriter
# Read with controlled partition sizes
reader = JsonlReader(
file_paths="input/",
files_per_partition=50, # Process 50 files at a time
blocksize="1GB" # Alternative: control memory usage per partition
)
# Process and write in batches
writer = JsonlWriter(path="output/")
3. Memory-Aware Operations#
Some operations need special memory handling:
Deduplication#
from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow
# Control memory usage in deduplication
dedup = ExactDeduplicationWorkflow(
input_path="input/",
output_path="output/",
text_field="text",
input_blocksize="1GB" # Control memory usage per input block
)
Classification#
from nemo_curator.stages.text.classifiers import QualityClassifier
# Manage classifier memory
classifier = QualityClassifier(
model_inference_batch_size=64, # Smaller batches use less memory (default: 256)
max_chars=3000 # Limit text length to reduce memory usage (default: 6000)
)
Memory Monitoring#
CPU Memory#
Monitor system memory:
# Note: Requires installing psutil: pip install psutil
import psutil
def check_memory():
mem = psutil.virtual_memory()
print(f"Memory usage: {mem.percent}%")
print(f"Available: {mem.available / 1e9:.1f} GB")
GPU Memory#
Monitor GPU memory:
# Note: Requires CUDA installation with nemo_curator[cuda12]
import pynvml
def check_gpu_memory():
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
info = pynvml.nvmlDeviceGetMemoryInfo(handle)
print(f"GPU memory used: {info.used / 1e9:.1f} GB")
Best Practices#
Monitor Memory Usage
Track memory during development
Set up monitoring for production
Handle out-of-memory gracefully
Optimize Data Loading
Use lazy loading when possible
Control partition sizes
Clean up unused data
Resource Management
Release memory after large operations
Use context managers for cleanup
Monitor long-running processes