Resumable Processing
This guide explains strategies to make large-scale data operations resumable.
Why Resumable Processing Matters
Large datasets can trigger interruptions due to:
- System timeouts
- Hardware failures
- Network issues
- Resource constraints
- Scheduled maintenance
NeMo Curator provides built-in functionality for resuming operations from where they left off.
How it Works
The resumption approach works by:
- Examining filenames in the input directory using
get_all_file_paths_under() - Comparing them with filenames in the output directory
- Identifying unprocessed files by comparing file counts or specific file lists
- Rerunning the pipeline on remaining files
This approach works best when you:
- Use consistent directory structures for input and output
- Process files in batches using
files_per_partitionto manage memory usage - Create checkpoints by writing intermediate results to disk
Practical Patterns for Resumable Processing
1. Process remaining files using directory comparison
Use file listing utilities to identify unprocessed files and process them directly:
2. Batch processing with file partitioning
Control memory usage and enable checkpoint creation by using NeMo Curator’s built-in file partitioning: