Resumable Processing#
This guide explains how to implement resumable processing for large-scale data operations that may be interrupted.
Why Resumable Processing Matters#
When processing large datasets, operations can be interrupted due to:
System timeouts
Hardware failures
Network issues
Resource constraints
Scheduled maintenance
NeMo Curator provides built-in functionality for resuming operations from where they left off.
Key Utilities for Resumable Processing#
1. get_remaining_files
#
This function identifies files that haven’t been processed yet:
from nemo_curator.utils.file_utils import get_remaining_files
# Get only files that haven't been processed yet
files = get_remaining_files("input_directory/", "output_directory/", "jsonl")
dataset = DocumentDataset.read_json(files, add_filename=True)
# Continue processing with unprocessed files only
processed_dataset = my_processor(dataset)
processed_dataset.to_json("output_directory/", write_to_filename=True)
2. get_batched_files
#
This function returns an iterator that yields batches of unprocessed files:
from nemo_curator.utils.file_utils import get_batched_files
# Process files in batches of 64
for file_batch in get_batched_files("input_directory/", "output_directory/", "jsonl", batch_size=64):
dataset = DocumentDataset.read_json(file_batch, add_filename=True)
# Process batch
processed_batch = my_processor(dataset)
# Write results for this batch
processed_batch.to_json("output_directory/", write_to_filename=True)
How Resumable Processing Works#
The resumption system works by:
Examining filenames in the input directory
Comparing them with filenames in the output directory
Identifying files that exist in the input but not in the output directory
Processing only those unprocessed files
This approach requires:
Using
add_filename=True
when reading filesUsing
write_to_filename=True
when writing filesMaintaining consistent filename patterns between input and output
Best Practices for Resumable Processing#
Preserve filenames: Use
add_filename=True
when reading files andwrite_to_filename=True
when writing.Batch appropriately: Choose batch sizes that balance memory usage and processing efficiency.
Use checkpointing: For complex pipelines, consider writing intermediate results to disk.
Test resumability: Verify that your process can resume correctly after simulated interruptions.
Monitor disk space: Ensure sufficient storage for both input and output files.
Log progress: Maintain logs of processed files to help diagnose issues.