Data Operations#

Data Copy and Synchronization#

Common tools like rsync and scp can be leveraged to accomplish this, though depending on your dataset, these might not be the most effective tools to do this in the minimal amount of time. Both of these run in serial, so if your dataset contains many files and/or subdirectories, then this adds up to slow copying times.

Parallel Solutions#

Msrsync#

Initially written to work on older versions of RHEL which only supported Python 2.6.x, there is a version which supports Python 3.x.

Available on GitHub as a single script.

Example and performance comparison#

msrsync wraps rsync and achieves parallelism by taking the file list created when initializing rsync and splitting this into buckets which are then worked by multiple instances of rsync. This is especially performant when working with datasets that contain many small files.

This is a public dataset for training which is representative of data seen in the field. We’re testing copying this dataset from local storage on a login node into a home directory on shared storage over NFSv3.

Files: 123,403

Size: 19GB

cd /tmp

wget --limit-rate=10m http://images.cocodataset.org/zips/unlabeled2017.zip

unzip unlabeled2017.zip

rsync

time rsync -a /tmp/unlabeled2017/ /home/johndoe/coco/unlabeled2017/uncompressed

real 6m26.706s
user 0m32.033s
sys 2m16.699s

msrsync3

We’ll configure the number of processors to use as a static multiple of 4. Depending on the size of the dataset, this value can be adjusted, though at a point you’ll run into diminishing returns due to overheads.

wget https://raw.githubusercontent.com/jbd/msrsync/refs/heads/master/msrsync3

chmod +x msrsync3

time ./msrsync3 -p 4 -r "-a" /tmp/unlabeled2017/ /home/johndoe/coco/unlabeled2017/uncompressed

real 2m29.440s
user 0m39.906s
sys 2m43.294s