For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Blog
DocsAPI Reference
DocsAPI Reference
    • AIStore
    • Documentation
  • Core Documentation
    • In-depth Overview
    • Terminology and core abstractions
    • Getting Started
    • Networking model
    • Buckets: design, operations, namespaces, and system buckets
    • Observability overview
    • CLI overview
    • Production deployment
    • Technical Blog
  • APIs, SDKs, and Compatibility
    • Go API
    • Python SDK
    • PyPI package
    • Python SDK reference guide
    • PyTorch integration
    • TensorFlow integration
    • HTTP API reference
    • curl examples
    • Easy URL
    • S3 compatibility
    • s3cmd quick start
    • Presigned S3 requests
    • Boto3 support
  • Command-Line Interface
    • CLI overview
    • ais help
    • CLI reference guide
    • Bucket operations
    • Cluster and remote-cluster management
    • Storage and mountpath management
    • Monitoring and ais show
    • Downloads
    • Jobs
    • Authentication and access control
    • Configuration via CLI
    • ETL CLI
    • Distributed shuffle CLI
    • ML / get-batch CLI
    • GCP credentials
    • TLS certificate management
  • Storage and Data Management
    • Storage services
    • Buckets: design, operations, namespaces, and system buckets
    • Native Bucket Inventory (NBI)
    • Backend providers
    • On-disk layout
    • Virtual directories
    • System files
    • Evicting remote buckets and cached data
  • Cluster Operations
    • Node lifecycle: maintenance, shutdown, decommission
    • Global rebalance
    • Resilver
    • AIS in Containerized Environments
    • Highly available control plane
    • Information Center (IC)
    • Out-of-band updates
    • Troubleshooting
  • Configuration and Security
    • Configuration
    • Environment variables
    • Feature flags
    • AuthN and access control
    • Authentication validation
    • HTTPS and certificates
    • Switching a cluster to HTTPS
  • ETL and Advanced Workflows
    • ETL overview
    • ETL CLI docs
    • ETL Python SDK examples
    • Custom transformers
    • ETL Python webserver SDK
    • ETL Go webserver package
    • Archives: read, write, and list
    • Distributed shuffle (dsort)
    • Initial sharding utility (ishard)
    • Downloader
    • Blob Downloader
    • Batch object retrieval (get-batch)
    • Batch operations
    • Tools and utilities
    • Extended actions (xactions)
  • Observability, Monitoring, and Performance
    • Observability overview
    • Monitoring with CLI
    • Logs
    • Prometheus integration
    • Metrics reference
    • Grafana dashboards
    • Kubernetes monitoring
    • Distributed tracing
    • Monitoring get-batch
    • AIS load generator (aisloader)
    • Benchmarking AIStore
    • Performance tuning and testing
    • Performance monitoring via CLI
    • Rate limiting
    • Checksumming
    • Filesystem Health Checker (FSHC)
    • Traffic patterns
  • Networking
    • Networking: multi-homing, network separation, IPv6
    • HTTPS configuration
    • Switching to HTTPS
    • Idle connections
    • MessagePack protocol
  • Deployment
    • AIStore on Kubernetes
    • Kubernetes Operator
    • Ansible playbooks
    • Helm charts
    • Deployment monitoring
    • Docker
  • Developer Resources
    • Development guide
    • aisnode command line
    • Build tags
  • Object and Bucket Naming
    • Unicode and special symbols in object and bucket names
    • Extremely long object names
Blog
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoAIStore
On this page
  • Why Downloader?
  • Features
  • HuggingFace Integration
  • HuggingFace Example
  • Example
  • Request to download
  • Table of Contents
  • Single Download
  • Request JSON Parameters
  • Sample Request
  • Single object download
  • Multi Download
  • Request JSON Parameters
  • Sample Request
  • Multi Download using object map
  • Multi Download using object list
  • Range Download
  • Range Format
  • Request JSON Parameters
  • Sample Request
  • Download a (range) list of objects
  • Download a (range) list of objects into a subdirectory inside a bucket
  • Download a (range) list of objects, selecting every tenth object
  • Backend download
  • Request JSON Parameters
  • Sample Request
  • Download objects from a remote bucket
  • Aborting
  • Request JSON Parameters
  • Sample Request
  • Abort download
  • Status
  • Request JSON Parameters
  • Sample Request
  • Get download status
  • List of Downloads
  • Request Parameters
  • Sample Requests
  • Get list of all downloads
  • Get list of downloads with description starting with a digit
  • Remove from List
  • Request JSON Parameters
  • Sample Request
  • Remove download job from the list
ETL and Advanced Workflows

Downloader

||View as Markdown|
Previous

Distributed shuffle (dsort)

Next

Blob Downloader

Why Downloader?

It probably won’t be much of an exaggeration to say that the majority of popular AI datasets are available on the Internet and public remote buckets. Those datasets are often growing in size, thus continuously providing a wealth of information to research and analyze.

It is, therefore, appropriate to ask a follow-up question: how to efficiently work with those datasets? And what happens if the dataset in question is larger than the capacity of a single host? What happens if it is large enough to require a cluster of storage servers?

The often cited paper called Revisiting Unreasonable Effectiveness of Data in Deep Learning Era lists a good number of those large and very popular datasets, as well as the reasons to utilize them for training.

Meet Internet Downloader - an integrated part of the AIStore. AIS cluster can be easily deployed on any commodity hardware, and AIS downloader can then be used to quickly populate AIS buckets with any contents from a given location.

Features

AIStore supports a number of 3rd party Backend providers.

To access remote data (and store it in-cluster), AIStore utilizes the respective provider’s SDK.

For Amazon S3, that would be aws-sdk-go-v2, for Azure - azure-storage-blob-go, and so on. Each SDK can be conditionally linked into AIS executable - the decision (to link or not to link) is made prior to deployment.

This has a certain implication for the Downloader. Namely:

Downloadable source can be both an Internet link (or links) or a remote bucket accessible via the corresponding backend implementation. You can, for instance, download a Google Cloud bucket via its Internet location that would look something like: https://www.googleapis.com/storage/.../bucket-name/....

However. When downloading a remote bucket (any remote bucket), it is always preferable to have the corresponding SDK linked-in. Downloader will then detect the SDK “presence” at runtime and use a wider range of options available via this SDK.

HuggingFace Integration

AIStore includes native support for downloading datasets and models from HuggingFace, providing:

  • Direct dataset downloads - Download entire datasets or specific files with simple CLI commands
  • Model downloads - Access any public or private model repository
  • Authentication support - Use HuggingFace tokens for private/gated content
  • Configurable job batching - Splits large datasets into optimized download jobs based on file sizes
  • Concurrent metadata collection - Parallel HEAD requests for faster dataset discovery

HuggingFace Example

The following example shows downloading a dataset from HuggingFace:

1$ ais download --hf-dataset squad ais://datasets/squad/
2Warning: destination bucket ais://datasets/squad doesn't exist. Bucket with default properties will be created.
3Found 2 parquet files in dataset 'squad'
4Started download job dnl-c7nsf2UG9
5All 2 files successfully downloaded
6$ ais ls ais://datasets/squad/
7NAME SIZE
8train/0.parquet 13.79MiB
9validation/0.parquet 1.74MiB

Other supported features include:

  • Can download a single file (object), a range, an entire bucket, and a virtual directory in a given remote bucket.
  • Easy to use with command line interface.
  • Versioning and checksum support allows for an optimal download of the same source location multiple times to incrementally update AIS destination with source changes (if any).

The rest of this document describes these and other capabilities in greater detail and illustrates them with examples.

Example

Downloading jobs run asynchronously; you can monitor the progress of each specific job. The following example runs two jobs, each downloading 10 objects (gzipped tarballs in this case) from a given Google Cloud bucket:

1$ ais start download "gs://lpr-imagenet/train-{0001..0010}.tgz" ais://imagenet
25JjIuGemR
3Run `ais show job download 5JjIuGemR` to monitor the progress of downloading.
4$ ais start download "gs://lpr-imagenet/train-{0011..0020}.tgz" ais://imagenet
5H9OjbW5FH
6Run `ais show job download H9OjbW5FH` to monitor the progress of downloading.
7$ ais show job download
8JOB ID STATUS ERRORS DESCRIPTION
95JjIuGemR Finished 0 https://storage.googleapis.com/lpr-imagenet/imagenet_train-{0001..0010}.tgz -> ais://imagenet
10H9OjbW5FH Finished 0 https://storage.googleapis.com/lpr-imagenet/imagenet_train-{0011..0020}.tgz -> ais://imagenet

For more examples see: Downloader CLI

Request to download

AIS Downloader supports 4 (four) request types:

  • Single - download a single object.
  • Multi - download multiple objects provided by JSON map (string -> string) or list of strings.
  • Range - download multiple objects based on a given naming pattern.
  • Backend - given optional prefix and optional suffix, download matching objects from the specified remote bucket.

Prior to downloading, make sure destination bucket already exists. To create a bucket using AIS CLI, run ais create, for instance:

1$ ais create imagenet

Also, see AIS API for details on how to create, destroy, and list storage buckets. For Python-based clients, a better starting point could be here.

The rest of this document is structured around supported types of downloading jobs and can serve as an API reference for the Downloader.

Table of Contents

  • Single (object) download
  • Multi (object) download
  • Range (object) download
  • Backend download
  • Aborting
  • Status (of the download)
  • List of downloads
  • Remove from list

Single Download

The request (described below) downloads a single object and is considered the most basic. This request returns id on successful request which can then be used to check the status or abort the download job.

Request JSON Parameters

NameTypeDescriptionOptional?
bucket.namestringBucket where the downloaded object is saved to.No
bucket.providerstringDetermines the provider of the bucket. By default, locality is determined automatically.Yes
bucket.namespacestringDetermines the namespace of the bucket.Yes
descriptionstringDescription for the download request.Yes
timeoutstringTimeout for request to external resource.Yes
limits.connectionsintNumber of concurrent connections each target can make.Yes
limits.bytes_per_hourintNumber of bytes the cluster can download in one hour.Yes
linkstringURL of where the object is downloaded from.No
object_namestringName of the object the download is saved as. If no objname is provided, the name will be the last element in the URL’s path.Yes

Sample Request

Single object download

$$ curl -Li -H 'Content-Type: application/json' -d '{
> "type": "single",
> "bucket": {"name": "ubuntu"},
> "object_name": "ubuntu.iso",
> "link": "http://releases.ubuntu.com/18.04.1/ubuntu-18.04.1-desktop-amd64.iso"
>}' -X POST 'http://localhost:8080/v1/download'

NOTE:

localhost:8080 (above and elsewhere in this document) can be replaced with any legitimate (http or https) address of any AIS gateway.

Multi Download

A multi object download requires either a map or a list in JSON body:

  • Map - in map, each entry should contain custom_object_name (key) -> external_link (value). This format allows object names to not depend on automatic naming as it is done in list format.
  • List - in list, each entry should contain external_link to resource. Objects names are created from the base of the link.

This request returns id on successful request which can then be used to check the status or abort the download job.

Request JSON Parameters

NameTypeDescriptionOptional?
bucket.namestringBucket where the downloaded object is saved to.No
bucket.providerstringDetermines the provider of the bucket. By default, locality is determined automatically.Yes
bucket.namespacestringDetermines the namespace of the bucket.Yes
descriptionstringDescription for the download request.Yes
timeoutstringTimeout for request to external resource.Yes
limits.connectionsintNumber of concurrent connections each target can make.Yes
limits.bytes_per_hourintNumber of bytes the cluster can download in one hour.Yes
objectsarray or mapThe payload with the objects to download.No

Sample Request

Multi Download using object map

$$ curl -Li -H 'Content-Type: application/json' -d '{
> "type": "multi",
> "bucket": {"name": "ubuntu"},
> "objects": {
> "train-labels.gz": "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz",
> "t10k-labels-idx1.gz": "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz",
> "train-images.gz": "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz"
> }
>}' -X POST 'http://localhost:8080/v1/download'

Multi Download using object list

$$ curl -Li -H 'Content-Type: application/json' -d '{
> "type": "multi",
> "bucket": {"name": "ubuntu"},
> "objects": [
> "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz",
> "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz",
> "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz"
> ]
>}' -X POST 'http://localhost:8080/v1/download'

Range Download

A range download retrieves (in one shot) multiple objects while expecting (and relying upon) a certain naming convention which happens to be often used. This request returns id on successful request which can then be used to check the status or abort the download job.

Namely, the range download expects the object name to consist of prefix + index + suffix, as described below:

Range Format

Consider a website named randomwebsite.com/some_dir/ that contains the following files:

  • object1log.txt
  • object2log.txt
  • object3log.txt
  • …
  • object1000log.txt

To populate AIStore with objects in the range from object200log.txt to object300log.txt (101 objects total), use the range download.

Request JSON Parameters

NameTypeDescriptionOptional?
bucket.namestringBucket where the downloaded object is saved to.No
bucket.providerstringDetermines the provider of the bucket. By default, locality is determined automatically.Yes
bucket.namespacestringDetermines the namespace of the bucket.Yes
descriptionstringDescription for the download request.Yes
timeoutstringTimeout for request to external resource.Yes
limits.connectionsintNumber of concurrent connections each target can make.Yes
limits.bytes_per_hourintNumber of bytes the cluster can download in one hour.Yes
subdirstringSubdirectory in the bucket where the downloaded objects are saved to.Yes
templatestringBash template describing names of the objects in the URL.No

Sample Request

Download a (range) list of objects

$$ curl -Lig -H 'Content-Type: application/json' -d '{
> "type": "range",
> "bucket": {"name": "test"},
> "template": "randomwebsite.com/some_dir/object{200..300}log.txt"
>}' -X POST 'http://localhost:8080/v1/download'

Download a (range) list of objects into a subdirectory inside a bucket

$$ curl -Lig -H 'Content-Type: application/json' -d '{
> "type": "range",
> "bucket": {"name": "test"},
> "template": "randomwebsite.com/some_dir/object{200..300}log.txt",
> "subdir": "some/subdir/"
>}' -X POST 'http://localhost:8080/v1/download'

Download a (range) list of objects, selecting every tenth object

$$ curl -Lig -H 'Content-Type: application/json' -d '{
> "type": "range",
> "bucket": {"name": "test"},
> "template": "randomwebsite.com/some_dir/object{1..1000..10}log.txt"
>}' -X POST 'http://localhost:8080/v1/download'

Tip: use -g option in curl to turn off URL globbing parser - it will allow to use { and } without escaping them.

Backend download

A backend download prefetches multiple objects which names match provided prefix and suffix and are contained in a given remote bucket.

Request JSON Parameters

NameTypeDescriptionOptional?
bucket.namestringBucket where the downloaded object is saved to.No
bucket.providerstringDetermines the provider of the bucket.Yes
bucket.namespacestringDetermines the namespace of the bucket.Yes
descriptionstringDescription for the download request.Yes
syncboolSynchronizes the remote bucket: downloads new or updated objects (regular download) + checks and deletes cached objects if they are no longer present in the remote bucket.Yes
prefixstringPrefix of the objects names to download.Yes
suffixstringSuffix of the objects names to download.Yes

Sample Request

Download objects from a remote bucket

$$ curl -Liv -H 'Content-Type: application/json' -d '{
> "type": "backend",
> "bucket": {"name": "lpr-vision", "provider": "gcp"},
> "prefix": "imagenet/imagenet_train-",
> "suffix": ".tgz"
>}' -X POST 'http://localhost:8080/v1/download'

Aborting

Any download request can be aborted at any time by making a DELETE request to /v1/download/abort with provided id (which is returned upon job creation).

Request JSON Parameters

NameTypeDescriptionOptional?
idstringUnique identifier of download job returned upon job creation.No

Sample Request

Abort download

1$ curl -Li -H 'Content-Type: application/json' -d '{"id": "5JjIuGemR"}' -X DELETE 'http://localhost:8080/v1/download/abort'

Status

The status of any download request can be queried at any time using GET request with provided id (which is returned upon job creation).

Request JSON Parameters

NameTypeDescriptionOptional?
idstringUnique identifier of download job returned upon job creation.No

Sample Request

Get download status

1$ curl -Li -H 'Content-Type: application/json' -d '{"id": "5JjIuGemR"}' -X GET 'http://localhost:8080/v1/download'

List of Downloads

The list of all download requests can be queried at any time. Note that this has the same syntax as Status except the id parameter is empty.

Request Parameters

NameTypeDescriptionOptional?
regexstringRegex for the description of download requests.Yes

Sample Requests

Get list of all downloads

1$ curl -Li -X GET 'http://localhost:8080/v1/download'

Get list of downloads with description starting with a digit

1$ curl -Li -H 'Content-Type: application/json' -d '{"regex": "^[0-9]"}' -X GET 'http://localhost:8080/v1/download'

Remove from List

Any aborted or finished download request can be removed from the list of downloads by making a DELETE request to /v1/download/remove with provided id (which is returned upon job creation).

Request JSON Parameters

NameTypeDescriptionOptional?
idstringUnique identifier of download job returned upon job creation.No

Sample Request

Remove download job from the list

1$ curl -Li -H 'Content-Type: application/json' -d '{"id": "5JjIuGemR"}' -X DELETE 'http://localhost:8080/v1/download/remove'