Memory Management | NeMo Curator

This guide explains existing implementations and strategies for managing memory when processing large text datasets with NVIDIA NeMo Curator.

Memory Challenges in Data Curation

Processing large-scale datasets for LLM training presents unique memory management challenges:

Dataset Scale: Modern LLM training datasets can exceed petabytes, far larger than available RAM/VRAM on any single machine or even cluster. Efficient streaming and batching are essential to process data incrementally.
Memory-Intensive Operations: Tasks like fuzzy deduplication, embedding generation, and classification require loading large models into GPU memory while simultaneously processing document batches, creating competing demands for limited resources.
Long-Running Pipelines: Processing billions of documents can take days or weeks. Even small memory leaks accumulate over time, potentially causing worker crashes or degraded performance. Automatic worker recycling helps mitigate this.
Distributed Resource Allocation: In multi-node clusters, balancing CPU, GPU, and memory resources across workers becomes complex. Different pipeline stages have different resource requirements (such as I/O-heavy readers compared to GPU-heavy classifiers), requiring intelligent allocation.
Variable Data Sizes: Individual documents can range from a few bytes to megabytes. Processing batches of highly variable-sized documents can cause unpredictable memory spikes if not properly managed.

NeMo Curator addresses these challenges through automatic resource management, streaming execution, and configurable batching parameters that you’ll learn about in this guide.

Memory Management in Curator

Pipeline and Executor Architecture

NeMo Curator uses a Pipeline and Executor architecture to manage resource allocation and distribute work across compute resources efficiently.

How It Works

1. Pipeline Composition

The Pipeline class provides a high-level abstraction for composing data processing workflows:

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.text.io import JsonlReader
3 from nemo_curator.stages.text.io.writer import JsonlWriter
4 
5 pipeline = Pipeline(
6     name="my_pipeline",
7     description="Process text documents"
8 )
9 pipeline.add_stage(JsonlReader(file_paths="input/"))
10 
11 # Add text processing stages
12 # pipeline.add_stage(...)
13 
14 pipeline.add_stage(JsonlWriter(path="output/"))
15 
16 # Execute the pipeline
17 pipeline.run()

Each stage declares its resource requirements through the Resources class that the executor uses for allocation.

2. Resource Declaration

Stages declare their computational needs using the Resources dataclass:

1 from nemo_curator.stages.resources import Resources
2 
3 # CPU-only stage
4 cpu_only_resources = Resources(cpus=2.0)
5 pipeline.add_stage(MyCpuStage(...).with_(resources=cpu_only_resources))
6 
7 # GPU stage with memory requirement
8 single_gpu_resources = Resources(
9     cpus=4.0,
10     gpu_memory_gb=8.0  # GPU memory required in GB (only for single-GPU stages)
11 )
12 pipeline.add_stage(MySingleGpuStage(...).with_(resources=single_gpu_resources))
13 
14 # Multi-GPU stage
15 multi_gpu_resources = Resources(
16     cpus=8.0,
17     gpus=2.0  # Request 2 full GPUs
18 )
19 pipeline.add_stage(MyMultiGpuStage(...).with_(resources=multi_gpu_resources))

Curator automatically allocates memory based on available hardware.

3. Executor Backends

Executors handle the actual distribution and execution of work. Curator supports multiple executor backends, with the default being the XennaExecutor:

1 from nemo_curator.backends.xenna import XennaExecutor
2 
3 executor = XennaExecutor(config={
4     "execution_mode": "streaming",  # or "batch"
5     "cpu_allocation_percentage": 0.95,  # Reserve 5% for system
6     "autoscale_interval_s": 180,  # Adjust workers every 3 minutes
7     "logging_interval": 60  # Log status every minute
8 })
9 
10 pipeline.run(executor=executor)

Refer to the Pipeline Execution Backends page for more information about Curator’s executors.

4. Worker Management

Executors automatically manage workers based on stage resource requirements:

Worker Allocation: Creates workers with the exact resources each stage declares
Setup/Teardown: Calls setup() once per worker (such as load models) and teardown() for cleanup
Setup on Node: Calls setup_on_node() once per node (such as download model weights)
Task Batching: Processes multiple tasks per worker call based on batch_size
Auto-scaling: Dynamically adjusts worker count based on workload

5. Memory-Efficient Execution

The executor ensures memory efficiency through:

Lazy Evaluation: Data flows through the pipeline stage-by-stage without materializing entire datasets
Batched Processing: Stages process data in configurable batch sizes to control memory usage
Resource Isolation: Each worker gets isolated resources preventing interference
Automatic Cleanup: Workers are recycled periodically to prevent memory leaks

Memory Management Strategies

The previous section discussed how Curator handles resource and worker allocations when executing a pipeline. In most cases, you don’t need to configure Resources or executors directly. Curator automatically:

Allocates appropriate resources for each stage based on its requirements
Uses the XennaExecutor by default when running pipelines
Manages worker lifecycle and scaling

The primary way to control memory usage is by configuring data batch sizes through reader parameters like files_per_partition and blocksize. These settings determine how much data flows into each stage at a time, directly impacting memory consumption across your entire pipeline.

Below, we highlight practical ways to configure batch sizes and memory-aware operations.

1. Batch Processing

Process data in manageable chunks by controlling file partitioning:

1 from nemo_curator.stages.text.io.reader import JsonlReader
2 
3 # Read with controlled partition sizes
4 reader = JsonlReader(
5     file_paths="jsonl_input/",
6     files_per_partition=50,  # Process 50 files at a time
7     # blocksize="1GB"  # Alternative: control memory usage per data batch
8 )

1 from nemo_curator.stages.text.io.reader import ParquetReader
2 
3 # Read with controlled partition sizes
4 reader = ParquetReader(
5     file_paths="parquet_input/",
6     files_per_partition=50,  # Process 50 files at a time
7     # blocksize="1GB"  # Alternative: control memory usage per data batch
8 )

Setting an appropriate files_per_partition or blocksize is important because it controls how much data is loaded into memory at once and flows through your pipeline stages. Smaller batches reduce memory usage but may decrease throughput, while larger batches improve processing speed at the cost of higher memory consumption. Choose values based on your available memory and dataset characteristics.

2. Memory-Aware Operations

Some operations need special memory handling:

Deduplication

1 from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow
2 
3 # Control memory usage in deduplication
4 dedup = ExactDeduplicationWorkflow(
5     input_path="input/",
6     output_path="output/",
7     text_field="text",
8     input_blocksize="1GB"  # Control memory usage per input block
9 )

Note on Workflows vs. Pipelines: Deduplication uses workflows that automatically handle I/O (reading and writing) internally, rather than requiring explicit reader and writer stages. The input_blocksize parameter controls memory usage in the same way as the blocksize parameter in JsonlReader and ParquetReader. For most other operations, you build pipelines by explicitly composing reader → processing stages → writer.

Classification

1 from nemo_curator.stages.text.classifiers import QualityClassifier
2 
3 # Manage classifier memory
4 classifier = QualityClassifier(
5     model_inference_batch_size=64,  # Smaller batches use less memory (default: 256)
6     max_chars=3000  # Limit text length to reduce memory usage (default: 6000)
7 )

Understanding Batch Sizes: Curator has two levels of batching that serve different purposes:

batch_size (stage-level): Controls how many DocumentBatch tasks are processed together by a worker. This affects CPU memory and task scheduling efficiency. Most users don’t need to modify this.
model_inference_batch_size (model-specific): Controls how many individual documents are passed to the model’s forward pass at once. This directly affects GPU memory usage during inference. This is the primary parameter to adjust when encountering GPU out-of-memory errors or optimizing GPU utilization.

If you encounter a torch.OutOfMemoryError during model classification, it is almost always because the model_inference_batch_size is too large. Try smaller batch sizes to resolve the error.

Memory Monitoring

Monitoring memory is essential for production data curation pipelines, especially when processing large-scale datasets over extended periods. Without monitoring, you may encounter silent performance degradation, unexpected out-of-memory failures, resource waste, and difficult-to-debug crashes.

NeMo Curator integrates with Prometheus and Grafana for pipeline monitoring. Refer to the Monitoring page for setup instructions, key metrics to track, and multi-user cluster configuration.

Best Practices

Monitor Memory Usage
- During Development Use system monitoring tools (htop, nvidia-smi, watch -n 1 nvidia-smi) to observe memory usage patterns as your pipeline runs. Start with small datasets to identify memory bottlenecks before scaling up.
- In Production Set up monitoring dashboards using Prometheus and Grafana (refer to Monitoring) to track CPU/GPU memory usage, worker utilization, and pipeline throughput over time.
- Ray Dashboard If using Ray-based executors, access the Ray dashboard (typically at http://localhost:8265) to view real-time resource usage, task execution, and memory consumption across workers.
Optimize Data Loading
- Split large files into smaller files before curation If you have individual files that are very large (for example, a single 50 GB JSONL file), you should split them into smaller files (for example, 100 × 500 MB files) before processing. The blocksize parameter controls how much data is read into memory at once but does not automatically split large files. Pre-splitting ensures better parallelization and prevents memory issues.
- Control partition sizes via files_per_partition or blocksize to manage how much data flows through your pipeline
Resource Management
- Use Context Managers: Always use with statements for file operations and resource allocation to ensure proper cleanup even if errors occur.
- Clean Up Large Objects: When working with large datasets in custom stages, explicitly delete temporary objects (e.g., del large_dataframe) and consider calling gc.collect() after processing large batches to free memory immediately rather than waiting for automatic garbage collection.
- GPU Memory: For GPU-based stages, PyTorch may cache GPU memory. If you encounter GPU out-of-memory errors despite having sufficient GPU capacity, try torch.cuda.empty_cache() between stages to clear the cache.
- Worker Lifecycle: Xenna automatically recycles workers periodically (controlled by worker_max_lifetime_m and worker_restart_interval_m in stage configs) to prevent memory leaks from accumulating during long-running pipelines.
- Worker Recycling (Ray Data): For stages that use C libraries prone to heap fragmentation (such as jusText/lxml for HTML extraction), set max_calls_per_worker on DocumentIterateExtractStage to restart worker processes after a fixed number of tasks. CommonCrawlDownloadExtractStage automatically sets this to 2 for jusText extraction. Refer to the Common Crawl guide for details.