For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Blog
DocsAPI Reference
DocsAPI Reference
    • AIStore
    • Documentation
  • Core Documentation
    • In-depth Overview
    • Terminology and core abstractions
    • Getting Started
    • Networking model
    • Buckets: design, operations, namespaces, and system buckets
    • Observability overview
    • CLI overview
    • Production deployment
    • Technical Blog
  • APIs, SDKs, and Compatibility
    • Go API
    • Python SDK
    • PyPI package
    • Python SDK reference guide
    • PyTorch integration
    • TensorFlow integration
    • HTTP API reference
    • curl examples
    • Easy URL
    • S3 compatibility
    • s3cmd quick start
    • Presigned S3 requests
    • Boto3 support
  • Command-Line Interface
    • CLI overview
    • ais help
    • CLI reference guide
    • Bucket operations
    • Cluster and remote-cluster management
    • Storage and mountpath management
    • Monitoring and ais show
    • Downloads
    • Jobs
    • Authentication and access control
    • Configuration via CLI
    • ETL CLI
    • Distributed shuffle CLI
    • ML / get-batch CLI
    • GCP credentials
    • TLS certificate management
  • Storage and Data Management
    • Storage services
    • Buckets: design, operations, namespaces, and system buckets
    • Native Bucket Inventory (NBI)
    • Backend providers
    • On-disk layout
    • Virtual directories
    • System files
    • Evicting remote buckets and cached data
  • Cluster Operations
    • Node lifecycle: maintenance, shutdown, decommission
    • Global rebalance
    • Resilver
    • AIS in Containerized Environments
    • Highly available control plane
    • Information Center (IC)
    • Out-of-band updates
    • Troubleshooting
  • Configuration and Security
    • Configuration
    • Environment variables
    • Feature flags
    • AuthN and access control
    • Authentication validation
    • HTTPS and certificates
    • Switching a cluster to HTTPS
  • ETL and Advanced Workflows
    • ETL overview
    • ETL CLI docs
    • ETL Python SDK examples
    • Custom transformers
    • ETL Python webserver SDK
    • ETL Go webserver package
    • Archives: read, write, and list
    • Distributed shuffle (dsort)
    • Initial sharding utility (ishard)
    • Downloader
    • Blob Downloader
    • Batch object retrieval (get-batch)
    • Batch operations
    • Tools and utilities
    • Extended actions (xactions)
  • Observability, Monitoring, and Performance
    • Observability overview
    • Monitoring with CLI
    • Logs
    • Prometheus integration
    • Metrics reference
    • Grafana dashboards
    • Kubernetes monitoring
    • Distributed tracing
    • Monitoring get-batch
    • AIS load generator (aisloader)
    • Benchmarking AIStore
    • Performance tuning and testing
    • Performance monitoring via CLI
    • Rate limiting
    • Checksumming
    • Filesystem Health Checker (FSHC)
    • Traffic patterns
  • Networking
    • Networking: multi-homing, network separation, IPv6
    • HTTPS configuration
    • Switching to HTTPS
    • Idle connections
    • MessagePack protocol
  • Deployment
    • AIStore on Kubernetes
    • Kubernetes Operator
    • Ansible playbooks
    • Helm charts
    • Deployment monitoring
    • Docker
  • Developer Resources
    • Development guide
    • aisnode command line
    • Build tags
  • Object and Bucket Naming
    • Unicode and special symbols in object and bucket names
    • Extremely long object names
  • Blog Posts
    • AIStore + HuggingFace: Distributed Downloads for Large-Scale Machine Learning
Blog
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoAIStore
On this page
  • AIStore + HuggingFace: Distributed Downloads for Large-Scale Machine Learning
  • Table of contents
  • Background
  • CLI Integration: Simplified Workflows
  • Prerequisites
  • Basic Download Commands
  • Authentication and Configuration
  • Progress Monitoring
  • Download Optimizations
  • Job Batching: Size-Based Distribution
  • Concurrent Metadata Collection
  • Complete Walkthrough: NonverbalTTS Dataset
  • Walkthrough Prerequisites
  • Step 1: Download Dataset with Configurable Job Batching
  • Step 2: Monitor Distributed Job Execution
  • Step 3: Verify Download Completion
  • Options for Using Downloaded Data
  • Step 4: Initialize ETL Transformers
  • Step 5: Preprocessing using ETL
  • Step 6: ML Pipeline Integration
  • Option A: Direct SDK Usage (Simple)
  • Option B: PyTorch Integration (Recommended for ML Training)
  • Next Steps
  • Conclusion
  • References:
Blog Posts

AIStore + HuggingFace: Distributed Downloads for Large-Scale Machine Learning

||View as Markdown|
Previous

Extremely long object names

Aug 22, 2025·Nihal Nooney
aistorehuggingfacemachine-learningdatasetscliperformance

AIStore + HuggingFace: Distributed Downloads for Large-Scale Machine Learning

Machine learning teams increasingly rely on large datasets from HuggingFace to power their models. But traditional download tools struggle with terabyte-scale datasets containing thousands of files, creating bottlenecks that slow development cycles.

This post introduces AIStore’s new HuggingFace download integration, which enables efficient downloads of large datasets with parallel batch jobs.

Table of contents

  1. Background
  2. CLI Integration: Simplified Workflows
  3. Download Optimizations
  4. Complete Walkthrough: NonverbalTTS Dataset
  5. Next Steps
  6. Conclusion

Background

Sequential downloads create significant bottlenecks when dealing with complex datasets that have hundreds of thousands of files distributed across multiple directories.

AIStore addresses this by parallelizing downloads within each target using multiple workers (one per mountpath), batching jobs based on file size, and collecting file metadata in parallel. This approach leverages the network throughput from each individual target to the HuggingFace servers.

CLI Integration: Simplified Workflows

Prerequisites

The following examples assume an active AIStore cluster. If the destination buckets (e.g., ais://datasets, ais://models) don’t exist, they will be created automatically with default properties.

AIStore’s CLI includes HuggingFace-specific flags for the ais download command that handle distributed operations behind the scenes.

Basic Download Commands

$# Download entire dataset
$$ ais download --hf-dataset squad ais://datasets/squad/
$
$# Download entire model
$$ ais download --hf-model bert-base-uncased ais://models/bert/
$
$# Download specific file
$$ ais download --hf-dataset squad --hf-file train/0.parquet ais://datasets/squad/

Authentication and Configuration

$# Export your HuggingFace token and use for private/gated content
$$ export HF_TOKEN=your_hf_token_here
$$ ais download --hf-dataset private-dataset --hf-auth $HF_TOKEN ais://private-data/
$
$# Control batching with blob threshold
$$ ais download --hf-dataset large-dataset --blob-threshold 200MB ais://datasets/large/

Progress Monitoring

$# Real-time progress tracking
$$ ais show job --refresh 2s
$
$# Detailed job information
$$ ais show job download --verbose

Download Optimizations

The system uses some key techniques to improve download performance:

Job Batching: Size-Based Distribution

Job batching categorizes files based on configurable size thresholds:

$# Configure blob threshold for job batching
$$ ais download --hf-dataset squad --blob-threshold 100MB ais://ml-datasets/

Files are categorized into two groups:

  • Large files (above blob threshold): Get individual download jobs for maximum parallelism
  • Small files (below threshold): Batched together to reduce overhead

Job Batching Diagram Figure: How AIStore batches files based on size threshold (100MB in this example)

Concurrent Metadata Collection

Before downloading files, AIStore makes parallel HEAD requests to the HuggingFace API to collect file metadata (like file sizes) concurrently rather than sequentially. This reduces setup time for datasets with many files.

Complete Walkthrough: NonverbalTTS Dataset

Let’s walk through an example downloading a machine learning dataset and processing it with ETL operations:

Walkthrough Prerequisites

For this walkthrough, we’ll create and use three buckets:

  • ais://deepvs - for the initial dataset download
  • ais://ml-dataset - for ETL-processed files
  • ais://ml-dataset-parsed - for the final parsed dataset

If these buckets don’t exist, they will be created automatically with default properties.

Step 1: Download Dataset with Configurable Job Batching

$# Download deepvk/NonverbalTTS dataset with job batching
$$ ais download --hf-dataset deepvk/NonverbalTTS ais://deepvs --blob-threshold 500MB --max-conns 5
$Found 11 parquet files in dataset 'deepvk/NonverbalTTS'
$Created 7 individual jobs for files >= 500MiB
$Started download job dnl-B-oOHruKH9
$To monitor the progress, run 'ais show job dnl-B-oOHruKH9 --progress'

Step 2: Monitor Distributed Job Execution

$# Watch configurable job distribution across cluster targets
$$ ais show job
$download jobs
$JOB ID XACTION STATUS ERRORS DESCRIPTION
$dnl-B-oOHruKH9 D6JOGa7PH9 1 pending 0 multi-download -> ais://deepvs
$dnl-zoOHr7PG3 D6JOGa7PH9 1 pending 0 https://huggingface.co/api/datasets/deepvk/NonverbalTTS/parquet/default/other/0.parquet -> ais://deepvs/0.parquet
$dnl-oJOHruKG3 D6JOGa7PH9 1 pending 0 https://huggingface.co/api/datasets/deepvk/NonverbalTTS/parquet/default/train/1.parquet -> ais://deepvs/1.parquet
$dnl-F_ogHauKH9 D6JOGa7PH9 1 pending 0 https://huggingface.co/api/datasets/deepvk/NonverbalTTS/parquet/default/train/2.parquet -> ais://deepvs/2.parquet
$dnl-PoOHr7KG9 D6JOGa7PH9 1 pending 0 https://huggingface.co/api/datasets/deepvk/NonverbalTTS/parquet/default/train/3.parquet -> ais://deepvs/3.parquet
$....

Step 3: Verify Download Completion

$# Check bucket summary after download
$$ ais ls ais://deepvs --summary
$NAME PRESENT OBJECTS SIZE (apparent, objects, remote) USAGE(%)
$ais://deepvs yes 6 0 2.76GiB 2.76GiB 0B 0%

Options for Using Downloaded Data

At this point, you have several options:

  1. Use directly: Work with the downloaded files as-is if they meet your requirements
  2. Transform with ETL: Apply preprocessing for format conversion, file organization, or data standardization
  3. Custom processing: Use your own tools for data preparation

Why transform? HuggingFace datasets often have complex paths or formats that benefit from standardization. This walkthrough demonstrates ETL transformations for file organization (consistent naming) and format conversion (Parquet → JSON for framework compatibility).

Step 4: Initialize ETL Transformers

Note: ETL operations require AIStore to be deployed on Kubernetes. See ETL documentation for deployment requirements and setup instructions.

Before applying transformations, initialize the required ETL containers:

$# Initialize batch-rename ETL transformer for file organization
$$ ais etl init -f https://raw.githubusercontent.com/NVIDIA/ais-etl/main/transformers/batch_rename/etl_spec.yaml
$
$# Initialize parquet-parser ETL transformer for data parsing
$$ ais etl init -f https://raw.githubusercontent.com/NVIDIA/ais-etl/main/transformers/parquet-parser/etl_spec.yaml
$
$# Verify ETL transformers are running
$$ ais etl show

Step 5: Preprocessing using ETL

$# Organize and rename files using batch rename ETL
$$ ais etl bucket batch-rename-etl ais://deepvs ais://ml-dataset
$etl-bucket[BatchRename] ais://deepvs => ais://ml-dataset
$
$# Verify renamed files with structured naming
$$ ais ls ais://ml-dataset/
$NAME SIZE
$train_0.parquet 485MiB
$train_1.parquet 492MiB
$train_2.parquet 511MiB
$...
$# Convert parquet files to JSON format for easier ML framework integration
$$ ais etl bucket parquet-parser-etl ais://ml-dataset ais://ml-dataset-parsed
$etl-bucket[xO_sVT3Im] ais://ml-dataset => ais://ml-dataset-parsed
$
$# Verify processed dataset ready for ML training
$$ ais ls ais://ml-dataset-parsed --summary
$NAME PRESENT OBJECTS SIZE (apparent, objects, remote) USAGE(%)
$ais://ml-dataset-parsed yes 7 0 8.68GiB 8.68GiB 0B 1%

Step 6: ML Pipeline Integration

AIStore integrates seamlessly with popular ML frameworks. Here’s how to use the processed dataset in your training pipeline:

Option A: Direct SDK Usage (Simple)

1from aistore.sdk import Client
2import json
3
4client = Client("http://localhost:51080")
5bucket = client.bucket("ml-dataset-parsed")
6
7# Load processed training data
8for obj in bucket.list_objects():
9 if obj.name.startswith("train_"):
10 data = json.loads(obj.get_reader().read_all())
11 # Process individual training samples
12 for sample in data:
13 # Your training logic here
14 pass

Option B: PyTorch Integration (Recommended for ML Training)

1from aistore.sdk import Client
2from aistore.pytorch import AISIterDataset
3from torch.utils.data import DataLoader
4import json
5
6# Create dataset that reads directly from the cluster
7client = Client("http://localhost:51080")
8dataset = AISIterDataset(ais_source_list=client.bucket("ml-dataset-parsed"))
9
10# Configure DataLoader with multiprocessing
11loader = DataLoader(
12 dataset,
13 batch_size=32,
14 num_workers=4, # Parallel data loading across multiple cores
15)
16
17# Training loop
18for batch_names, batch_data in loader:
19 # Parse JSON data
20 parsed_samples = [json.loads(data) for data in batch_data]
21
22 # Convert to tensors and train your model
23 # model.train_step(parsed_samples)
24 pass

Next Steps

The HuggingFace integration opens up some practical areas for expansion:

Download and Transform API: AIStore supports combining download and ETL transformation in a single API call, eliminating the two-step process shown in the walkthrough. This allows downloading HuggingFace datasets with immediate transformation (e.g., Parquet → JSON) in one operation. CLI integration for this functionality is in development.

Additional Dataset Formats: Beyond the current Parquet support, HuggingFace datasets are available in multiple formats that teams commonly need:

  • JSON format - Direct JSON downloads for frameworks requiring this format
  • CSV format - For traditional data processing workflows
  • WebDataset format - For large-scale ML pipelines using WebDataset

Conclusion

AIStore’s HuggingFace integration addresses common dataset download bottlenecks in machine learning workflows. Job batching and concurrent metadata collection enable efficient, parallel downloads of terabyte-scale datasets that would otherwise overwhelm traditional tools. Once stored in AIStore, teams can leverage local ETL operations to transform and prepare data without additional network transfers. This approach provides a streamlined path from raw downloads to training-ready datasets, eliminating the typical download-wait-process cycle that slows ML development.


References:

AIStore Core Documentation

  • AIStore GitHub
  • AIStore Blog
  • AIStore Downloader Documentation
  • AIStore Python SDK
  • AIStore PyTorch Integration - High-performance data loading for ML training

ETL (Extract, Transform, Load) Resources

  • ETL Documentation - Comprehensive guide to AIStore ETL capabilities and Kubernetes deployment
  • ETL CLI Reference - Command-line interface for ETL operations
  • Batch-Rename Transformer - File organization and renaming
  • Parquet Parser Transformer - Parquet to JSON conversion
  • AIStore Kubernetes Deployment - Production Kubernetes deployment tools and documentation

External Resources

  • HuggingFace Documentation
  • HuggingFace Datasets API Reference
  • Apache Parquet Format Specification