It probably won’t be much of an exaggeration to say that the majority of popular AI datasets are available on the Internet and public remote buckets. Those datasets are often growing in size, thus continuously providing a wealth of information to research and analyze.
It is, therefore, appropriate to ask a follow-up question: how to efficiently work with those datasets? And what happens if the dataset in question is larger than the capacity of a single host? What happens if it is large enough to require a cluster of storage servers?
The often cited paper called Revisiting Unreasonable Effectiveness of Data in Deep Learning Era lists a good number of those large and very popular datasets, as well as the reasons to utilize them for training.
Meet Internet Downloader - an integrated part of the AIStore. AIS cluster can be easily deployed on any commodity hardware, and AIS downloader can then be used to quickly populate AIS buckets with any contents from a given location.
AIStore supports a number of 3rd party Backend providers.
To access remote data (and store it in-cluster), AIStore utilizes the respective provider’s SDK.
For Amazon S3, that would be
aws-sdk-go-v2, for Azure -azure-storage-blob-go, and so on. Each SDK can be conditionally linked into AIS executable - the decision (to link or not to link) is made prior to deployment.
This has a certain implication for the Downloader. Namely:
Downloadable source can be both an Internet link (or links) or a remote bucket accessible via the corresponding backend implementation.
You can, for instance, download a Google Cloud bucket via its Internet location that would look something like: https://www.googleapis.com/storage/.../bucket-name/....
However. When downloading a remote bucket (any remote bucket), it is always preferable to have the corresponding SDK linked-in. Downloader will then detect the SDK “presence” at runtime and use a wider range of options available via this SDK.
AIStore includes native support for downloading datasets and models from HuggingFace, providing:
The following example shows downloading a dataset from HuggingFace:
Other supported features include:
The rest of this document describes these and other capabilities in greater detail and illustrates them with examples.
Downloading jobs run asynchronously; you can monitor the progress of each specific job. The following example runs two jobs, each downloading 10 objects (gzipped tarballs in this case) from a given Google Cloud bucket:
For more examples see: Downloader CLI
AIS Downloader supports 4 (four) request types:
Prior to downloading, make sure destination bucket already exists. To create a bucket using AIS CLI, run
ais create, for instance:Also, see AIS API for details on how to create, destroy, and list storage buckets. For Python-based clients, a better starting point could be here.
The rest of this document is structured around supported types of downloading jobs and can serve as an API reference for the Downloader.
The request (described below) downloads a single object and is considered the most basic. This request returns id on successful request which can then be used to check the status or abort the download job.
NOTE:
localhost:8080(above and elsewhere in this document) can be replaced with any legitimate (http or https) address of any AIS gateway.
A multi object download requires either a map or a list in JSON body:
custom_object_name (key) -> external_link (value). This format allows object names to not depend on automatic naming as it is done in list format.external_link to resource. Objects names are created from the base of the link.This request returns id on successful request which can then be used to check the status or abort the download job.
A range download retrieves (in one shot) multiple objects while expecting (and relying upon) a certain naming convention which happens to be often used. This request returns id on successful request which can then be used to check the status or abort the download job.
Namely, the range download expects the object name to consist of prefix + index + suffix, as described below:
Consider a website named randomwebsite.com/some_dir/ that contains the following files:
object1log.txtobject2log.txtobject3log.txtobject1000log.txtTo populate AIStore with objects in the range from object200log.txt to object300log.txt (101 objects total), use the range download.
Tip: use -g option in curl to turn off URL globbing parser - it will allow to use { and } without escaping them.
A backend download prefetches multiple objects which names match provided prefix and suffix and are contained in a given remote bucket.
Any download request can be aborted at any time by making a DELETE request to /v1/download/abort with provided id (which is returned upon job creation).
The status of any download request can be queried at any time using GET request with provided id (which is returned upon job creation).
The list of all download requests can be queried at any time. Note that this has the same syntax as Status except the id parameter is empty.
Any aborted or finished download request can be removed from the list of downloads by making a DELETE request to /v1/download/remove with provided id (which is returned upon job creation).