For clean Markdown of any page, append .md to the page URL. For a complete documentation index, see https://docs.nvidia.com/aistore/blog/llms.txt. For full documentation content, see https://docs.nvidia.com/aistore/blog/llms-full.txt.

## Introduction

Over the past decade, and especially in the last 3-4 years, the size of AI datasets has grown significantly, often exceeding the combined capacity of block storage devices that can be attached to a single server machine.

Hence, distributed storage.

There are plenty of distributed-storage options to choose from. However, the choice may appear to be limited, at least in part, due to the following challenges:

- large [ML dataset](https://labelyourdata.com/articles/what-is-dataset-in-machine-learning#:~:text=A%20machine%20learning%20dataset%20is,same%20way%20as%20humans%20do) contains a mix of large files (images, video) and [small files](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=%22small+file+problem%22) (e.g., image labels);
- training a model entails random access to the entirety of the (again) extremely large dataset that cannot be cached;
- due to the model's complexity and its size, training itself takes many hours, sometimes days,
- which then leads to the motivation to reduce this time without compromising accuracy.

All of the above is a problem. Performance problem, to be precise, for any distributed storage solution that was _not_ originally designed to address all of the above in the first place.

## What's in a shard?

In this article, we narrowly focus on one specific aspect that directly affects training performance: _sharding_.

In [AIStore](https://github.com/NVIDIA/aistore), sharding refers to _serializing_ original files (images, labels, etc.) into .tar (or .tgz, .zip, .tar.lz4) formatted objects.

Further, an AIStore shard is an object that also abides by a certain convention that is often referred to as [WebDataset format](https://huggingface.co/docs/hub/en/datasets-webdataset). Ultimately, WebDataset convention is a short way to express the idea that serialized shards must contain batches of samples that, when received, can be immediately distributed between computing workers - one batch at a time.

But we'll talk more about WebDataset convention (or format) later in this text. Serialization itself, though, offers well-known benefits:

- if done right, iterable serialized formats enable purely sequential I/O operations, improving performance on local drives by 3x-10x compared to the random access to huge numbers of files, including [small files](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=%22small+file+problem%22);
- long-lived compute-to-storage connections established to transmit larger payloads reduce network overhead;
- all of the above, in combination, optimizes end-to-end request handling latency and overall system throughput.

In the context of machine learning, further advantages can be gained through sharding, particularly when working with large datasets that are several orders of magnitude greater than server's RAM.

Finally, sharding facilitates efficient, bias eliminating, data shuffling: you can shuffle shard names globally and use a shuffle buffer on the client side to further ensure that training data is adequately randomized.

### Background: serialized shards in machine learning

There's a handful of popular serialization formats. Maybe the first one that comes to mind is Google's [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord) - the format that efficiently serializes _structured_ data using [Protocol Buffers](https://github.com/protocolbuffers/protobuf).

There's also some existing research that tries to compare the associated pros and cons - see for instance:

* Streaming Technologies and Serialization Protocols: Empirical Performance Analysis, at [https://arxiv.org/abs/2407.13494](https://arxiv.org/abs/2407.13494)

In AIStore, we decided to optimize for unstructured data which broadly includes images, video, audio, sensor-generated streams, logs, web pages, biometric data, and much more. In short, _unstructured data_ was a big part of the motivation but not the entire motivation. The second part of the motivation, or rather self-imposed requirement, was _open_ format - emphasis on _open_.

The same already mentioned TFRecord is tightly integrated with [TensorFlow](https://www.tensorflow.org/), limiting its broader applicability. On the other hand, [.tar](https://en.wikipedia.org/wiki/Tar_(computing)) exists for about 45 years (since 1979), is absolutely ubiquitous, and absolutely open.

> We strongly advocate using open, ubiquitous, widely-supported, and standard **GNU serialization formats**. Such as `.tar`, `.zip`, and similar. In addition to optimized performance, you are also getting your original data _archived_, preserving both the (original) directory structure, names, and sizes - everything.

The rest of this text is organized as follows:

- [AIStore sharding API and CLI](#aistore-sharding-api-and-cli)
- [Initial sharding of machine learning datasets](#initial-sharding-of-machine-learning-datasets)
- [`ishard`: integration with machine learning workflow](#ishard-integration-with-machine-learning-workflow)
- [Data loading benchmark](#data-loading-benchmark)
- [AIStore `ishard` vs. AIStore `dsort`: when to use what?](#aistore-ishard-vs-aistore-dsort-when-to-use-what)

## AIStore sharding API and CLI

Sharding becomes especially important when working with petascale datasets.

To support those sizes, AIStore not only scales linearly with each added storage node and data drive. We also recognize the critical importance of sharding and provide APIs to create, read, write, and list archives in various serialized [formats](/aistore/archive).

In addition, there are integrated [batch operations](/aistore/cli/archive#archive-multiple-objects) to run concurrent multi-object _sharding_: to create shards given arbitrary lists or ranges of objects in any AIStore-accessible bucket

> whereby source (or input) objects are not necessarily _present_ in-cluster.

For example, given a bucket that contains objects (`foo`, `bar`, and `baz`) we could use [CLI](/aistore/cli) to create our first `.tar` shard:

```sh
$ ais archive --help     ## see help for options and inline usage examples

$ ais archive ais://src ais://dst/shard1.tar --list foo,bar,baz

$ ais archive ls ais://dst/shard1.tar
NAME                     SIZE
shard1.tar               31.00KiB
    shard1.tar/foo       9.26KiB
    shard1.tar/bar       9.26KiB
    shard1.tar/baz       9.26KiB
```

## Initial sharding of machine learning datasets

Goes without saying: original machine learning datasets can have arbitrary structures, including deeply nested directories, a massive number of small or large files, or task-dependent annotation files.

Despite all this, there is almost always a need to batch associated files that constitute computable samples together for immediate consumption by a model. For example, `train/part1/toyota.jpeg` and `label/20240807/toyota.xml` are training data and corresponding annotations that should be kept together in the same batch. While you could prevent splitting computable samples by manually selecting associated files and individually archiving each of them, this approach is definitely impractical when working on a large, petabyte-scale dataset containing billions of arbitrarily-structured files.

This is where `ishard` comes in.

Initial Sharding utility (`ishard`) is designed to create WebDataset-formatted shards from the original dataset without splitting computable samples. The ultimate goal is to allow users to treat AIStore as a vast data lake, where they can easily upload training data in its raw format, regardless of size and directory structure.

Next, use `ishard` to perform the sharding pre-process correctly and optimally. The only question users need to address boils down to: **How should `ishard` associate samples with their corresponding annotations/labels?**

In this article, we will demonstrate how simple it is to configure `ishard` to associate samples and how much the overall data loading performance is improved after sharding.

The diagram illustrates the step-by-step workflow of the `ishard` process:

1. **List Files**: Retrieve all selected files from the source bucket.
2. **Group Samples**: Associate files that constitute computable samples together according to the configured `sample_key_pattern` rule.
3. **Create Shards**: Collect samples until the configured `shard_size` is reached, then request AIStore to asynchronously archive the collection into a shard.

![](https://files.buildwithfern.com/aistore.docs.buildwithfern.com/aistore/e649fe5f9c848a8b1fb2c6e14612aca85fa6b6a91e29fde1ff5287b25ea763e0/pages/assets/ishard/ishard_workflow.gif)

It’s important to note that `ishard` doesn’t alter the original dataset but instead transforms it into a new independent bucket. Specifically, `ishard` copies and shards a selected subset of data into an isolated workspace tailored for your specific ML task. This fully isolated on-demand approach ensures that `ishard` provides an exact, independent, and I/O performance-optimized working dataset for each individual ML task.

![](https://files.buildwithfern.com/aistore.docs.buildwithfern.com/aistore/075e5d1d60955a403458a4df7df0d96a3398c4c338271edf5ef6b00c56af44cc/pages/assets/ishard/ishard-ml-buckets.gif)

## `ishard`: integration with machine learning workflow

In this post, we'll demonstrate how to utilize `ishard` and AIStore's PyTorch data loaders to efficiently work with the [ImageNet](https://www.image-net.org/) dataset.

### Prepare the original ImageNet dataset

First, use this [script](/aistore/examples/ishard-imagenet/download.sh) to download the ImageNet dataset. After downloading, you can inspect the resulting directory structure as follows.

```sh
tree $IMAGENET_HOME
.
├── annotation
│   ├── n00007846
│   │   └── Annotation
│   │       └── n00007846
│   │           ├── n00007846_103856.xml
│   │           ├── n00007846_104163.xml
│   │           └── ...
│   ├── n00015319
│   │   └── ...
│   └── ...
├── train
│   ├── n01440764
│   │   ├── n01440764_10026.JPEG
│   │   ├── n01440764_10027.JPEG
|   │   └── ...
│   ├── n01443537
|   |   └── ...
│   └── ...
│
├── validation
│   └── val
│       ├── ILSVRC2010_val_00000001.JPEG
│       ├── ILSVRC2010_val_00000002.JPEG
|       └── ...
...
```

Prepare buckets and upload the ImageNet dataset to AIStore.

```sh
$ ais bucket create ais://ImageNet ais://ImageNet-sharded
$ ais put "./" ais://ImageNet --recursive
```

### Execute `ishard`

Install the latest version of the `ishard` executable.

```sh
$ go install github.com/NVIDIA/aistore/cmd/ishard@latest
```

> Because sharding a large dataset can take hours to complete, it is highly recommended to first perform a `dry-run` of your `ishard` command to ensure it produces the expected output shard composition.

### Usage 1: extract only base filename as sample key

In most cases, training data and associated annotation files share the same base filename but have different extensions (e.g., `n01440764_12957.JPEG` for data and `n01440764_12957.xml` for annotations). By default, `ishard` recognizes files sharing the same base filename as an indivisible sample, ensuring they are included in the same shard.

![](https://files.buildwithfern.com/aistore.docs.buildwithfern.com/aistore/aa31dd0d15524f1afd4ca941567a08f5f2b4a2ccea1b3501ea41838219ec6f9b/pages/assets/ishard/ishard-base_file_name.png)

```sh
$ ishard -shard_size=512MiB -src_bck=ImageNet -dst_bck=ImageNet-sharded

shard-10.tar                                                    512.01MiB
    shard-10/train/n01440764/n01440764_12957.JPEG               131.09KiB
    shard-10/train_annotation/n01440764/n01440764_12957.xml     709B
    shard-10/train/n01440764/n01440764_12971.JPEG               58.81KiB
    shard-10/train/n01440764/n01440764_12972.JPEG               164.68KiB
    shard-10/train_annotation/n01440764/n01440764_12972.xml     484B
...
shard-11.tar                                                    512.10MiB
...
```

As shown above, in the ImageNet dataset, some source image files do not have corresponding bounding box annotations `.xml`. For fully supervised tasks where annotation files are always needed for each training sample, you can specify `sample_exts` to include all desired extensions for each sample, and explicitly set `missing_extension_action="exclude"`. This will enforce ishard to filter out incomplete samples.

```sh
# The created shards size is less than total source objects size, because incomplete samples are excluded
$ ishard -shard_size=512MiB -sample_exts=".JPEG,.xml" -missing_extension_action="exclude" -src_bck=ImageNet -dst_bck=ImageNet-sharded -progress
Source Objects: 148.36GiB
Created Shards Size: 61.3GiB / 62.3GiB [============================================================>-] 98% 11m49s

$ ais archive ls ais://ImageNet-sharded --limit=100
NAME                                                            SIZE
shard-000000.tar                                                517.74MiB
    shard-000000/train/n01440764/n01440764_10040.JPEG           143.06KiB
    shard-000000/train_annotation/n01440764/n01440764_10040.xml 483B
    shard-000000/train/n01440764/n01440764_10048.JPEG           44.15KiB
    shard-000000/train_annotation/n01440764/n01440764_10048.xml 483B
...
```

You can also specify `-missing_extension_action="abort"` to stop if any sample is missing a specified extension. `ishard` will correctly report which sample is missing which extension.

```sh
$ ishard -shard_size=512MiB -sample_exts=".JPEG,.xml" -missing_extension_action="abort" -src_bck=ImageNet -dst_bck=ImageNet-sharded -progress

Source Objects: 148.36GiB
ishard execution failed: sample n01484850.sbow contains extension .mat, not specified in `sample_ext` config
...
```

### Usage 2: sharding by original directory structure

Sometimes, your dataset might already be hierarchically structured. For example, the ImageNet training dataset is categorized by synsets IDs (e.g., `n0xxxxxxx`) as directory names. These synset IDs correspond to specific labels in the dataset, such as the synset ID `n01440764`, which corresponds to the category "tench" (a type of fish). All images in the directory `n01440764` are labeled as images of tench.

In this case, you can use the `full_name` sample key pattern and specify the prefix `/train` in the source bucket, indicating to `ishard` to only consider files in this directory. Then, `ishard` will reconstruct the original directory structure and peel off every subdirectory to form independent shards, from bottom to top.

![](https://files.buildwithfern.com/aistore.docs.buildwithfern.com/aistore/9c8e5708a3aa38c999469e5d2a0e4b326b8d65021c0cf76f2d59df0f96ddc700/pages/assets/ishard/ishard-full_name.png)

```sh
$ ishard -sample_key_pattern="full_name" -shard_size=512MiB -src_bck=ImageNet/train -dst_bck=ImageNet-sharded
$ ais archive ls ais://ImageNet-sharded --limit=100

shard-10.tar                                            157.60MiB
    shard-10/train/n02097658/n02097658_10013.JPEG       15.18KiB
    shard-10/train/n02097658/n02097658_10015.JPEG       120.11KiB
    shard-10/train/n02097658/n02097658_10020.JPEG       29.57KiB
...
shard-11.tar                                            155.44MiB
    shard-11/train/n03495258/n03495258_10028.JPEG       81.69KiB
    shard-11/train/n03495258/n03495258_1003.JPEG        411.34KiB
...
```

You may notice that, although we configured `shard_size=512MiB`, none of the output shards actually reach this size. This is because, by default, `ishard` maintains clear boundaries between files that belong to different virtual directories, even if an output shard's size doesn't reach the requested `shard_size`.
To disable this default setting and compact each output shard's size closer to `shard_size`, regardless of directories, you can specify the `-collapse` flag. This allows `ishard` to collapse samples into their parent directory if their overall size doesn't reach `shard_size`.

```sh
$ ishard -sample_key_pattern="full_name" -collapse -shard_size=512MiB -src_bck=ImageNet/train -dst_bck=ImageNet-sharded
$ ais archive ls ais://ImageNet-sharded --limit=100

shard-9.tar                                             543.05MiB
...
    shard-9/train/n01631663/n01631663_10036.JPEG        153.73KiB
    shard-9/train/n01631663/n01631663_996.JPEG          245.67KiB
    shard-9/train/n02965783/n02965783_10361.JPEG        78.58KiB
...
```

### Usage 3: sharding by customized categories

In some cases, you may want to pack samples into shards based on customized categories. For example, with the ImageNet dataset, you may not need to classify images at a detailed level. Instead, you may want to treat both "tench" and "goldfish" as just "fish". In such cases, you can use an external key map (EKM) to specify the exact mapping from samples to output shards.

> If your desired sharding structure cannot be directly derived from the names of the source files using `sample_key_pattern`, we suggest using EKM to specify your desired packing rules.

The following example EKM file will pack all samples matching these specified templates into their corresponding categories.

```json
{
  "fish-%d.tar": [
    "train/n01440764.*",  // tench
    "train/n01443537.*",  // goldfish
    ...
  ],
  "dog-%d.tar": [
    "train/n02084071.*",  // toy terrier
    "train/n02085782.*",  // Japanese spaniel
    "train/n02085936.*",  // Maltese dog
    ...
  ],
  "bird-%d.tar": [
    "train/n01514668.*",  // cock
    "train/n01514859.*",  // hen
    ...
  ],
}
```

To run `ishard` with the EKM file:

```sh
$ ishard -ekm="/path/to/category.json" -sample_key_pattern="base_name" -src_bck=ImageNet/train -dst_bck=ImageNet-sharded
$ ais archive ls ais://ImageNet-sharded --limit=100

NAME                                                     SIZE
bird-0.tar                                               1.08MiB
    bird-0.tar/train/n01514668/n01514668_10004.JPEG      124.09KiB
...
bird-176.tar                                             1.06MiB
    bird-176.tar/train/n01514668/n01514668_9964.JPEG     133.77KiB
    bird-176.tar/train/n01514668/n01514668_9973.JPEG     95.13KiB
...
dog-0.tar                                                1.36MiB
    dog-0.tar/train/n02085782/n02085782_1006.JPEG        1.01KiB
...
fish-0.tar                                               1.01MiB
    fish-0.tar/train/n01440764/n01440764_10026.JPEG      13.38KiB
...
```

## Data loading benchmark

We conducted a micro-benchmark to assess the impact of sharding on data loading performance by iterating over one epoch of a 150GB ImageNet dataset. We compared the results before and after applying `ishard` with different shard sizes.

### Bench setup

- OS: Ubuntu 22.04.1 LTS
- CPUs: 16 cores
- Memory: 32GB

AIStore was deployed locally with the following setup using [local playground](/aistore/getting_started#local-playground) deployment script:

```sh
$ ./scripts/clean_deploy.sh --target-cnt 3 --proxy-cnt 1 --mountpath-cnt 3
```

### 1. Loading from the original dataset

Below is the minimal code for using [`AISIterDataset`](/aistore/pytorch#class-aisiterdataset) to iterate through the original `ImageNet` dataset stored in `ais://ImageNet`:

```python
client = Client(AIS_ENDPOINT)
dataset = AISIterDataset(ais_source_list=client.bucket("ais://ImageNet"))
loader = DataLoader(dataset, batch_size=256, num_workers=8)

start = timer()
for urls, data in loader: # iterate through the original `ais://ImageNet` dataset
    for idx in range(len(urls)):
        len(data[idx])  # Ensure the data content is read
elapsed_time = timer() - start

print(f"Time spent: {elapsed_time:.2f} seconds")
```

### 2. Loading from the sharded dataset

Here’s the minimal code for using [`AISShardReader`](/aistore/pytorch#class-aisshardreader) to iterate through the sharded `ImageNet` dataset stored in `ais://ImageNet-sharded`:

```python
client = Client(AIS_ENDPOINT)
shard_reader = AISShardReader(bucket_list=client.bucket("ais://ImageNet-sharded"))
loader = DataLoader(shard_reader, batch_size=256, num_workers=8)

start = timer()
for basenames, content_dict in loader: # iterate through the sharded `ais://ImageNet-sharded` dataset
    for idx, basename in enumerate(basenames):
        for k, v in content_dict.items():
            if v[idx] != b"":
                len(v[idx])  # Ensure the data content is read
elapsed_time = timer() - start

print(f"Time spent: {elapsed_time:.2f} seconds")
```

### Performance comparison

| Shard Avg. Size               | Not Sharded | 128 KiB | 256 KiB | 512 KiB | 2 MiB  | 8 MiB  | 32 MiB | 128 MiB | 512 MiB |
|-------------------------------|-------------|---------|---------|---------|--------|--------|--------|---------|---------|
| Total Time Spent (sec)         | 1184.40     | 959.13  | 682.74  | 476.34  | 348.27 | 369.88 | 407.16 | 528.28  | 598.07  |
| Total Throughput (MiB/s)       | 128.27      | 158.39  | 222.52  | 318.93  | 436.22 | 410.73 | 373.12 | 287.57  | 254.02  |
| Throughput per Worker (MiB/s)  | 16.03       | 19.80   | 27.82   | 39.86   | 54.53  | 51.34  | 46.64  | 35.95   | 31.75   |

![](https://files.buildwithfern.com/aistore.docs.buildwithfern.com/aistore/84b462f855c2728c9c45bace46dc92bcbd64711b333b9694a0eda995dc33eb8b/pages/assets/ishard/ishard-throughput-plot.png)

The benchmark reveals a significant improvement in data loading performance after sharding with `ishard`. The throughput increases dramatically, with the best performance observed at a shard size of 2 MiB, which delivers nearly 3.4 times the efficiency compared to the unsharded dataset. These findings underscore the importance of selecting an optimal shard size, as it effectively balances I/O efficiency with processing overhead, leading to the best possible data loading performance.

## AIStore `ishard` vs. AIStore `dsort`: when to use what?

Both `ishard` and [`dsort`](/aistore/dsort) are AIStore extensions designed for dataset sharding, but they serve different purposes at different stages of your workflow. `ishard` is intended to transform an initially flat-formatted dataset into a sharded format, whereas `dsort` is used for reorganizing an already-sharded dataset.

If your dataset is already in a sharded format and you need to reorganize the data within those shards, whether it’s adjusting shard size or reordering/re-shuffling data, you should directly use `dsort` with your desired configuration. However, if your dataset is not yet sharded, you’ll need to start with `ishard` to create the initial shards. You can then use `dsort` for further reorganization if needed. `ishard` also offers inline integration via the `-sort` flag, allowing you to directly execute `dsort` on the dataset immediately after `ishard`-ing it.

## Conclusion

AIStore `ishard` is motivated by the idea to optimally serialize machine learning data consisting of a mix of large and small files. By serializing these original files into shards, `ishard` helps to remove performance bottlenecks almost ineviably associated with random read access and inefficient I/O operations in distributed storage environments.

Additionally, `ishard` offers flexible configuration options to manage arbitrarily structured datasets. It allows users to associate files that constitute computable samples, and group them into easily _consumable_ shards.

Finally, the performance improvements demonstrated in our micro-benchmarks underscore the substantial benefits of using `ishard` for sharding large-scale datasets. This enhances overall data loading efficiency, making it a crucial tool for optimizing large scale machine learning pipelines.

Looking ahead, we plan to utilize `ishard` in a data center, in large AIStore cluster environments, to benchmark and compare (_with_ and _without_ sharding) performances of petacale datasets.