Blog Posts

AIStore + HuggingFace: Distributed Downloads for Large-Scale Machine Learning

View as Markdown

Machine learning teams increasingly rely on large datasets from HuggingFace to power their models. But traditional download tools struggle with terabyte-scale datasets containing thousands of files, creating bottlenecks that slow development cycles.

This post introduces AIStore’s new HuggingFace download integration, which enables efficient downloads of large datasets with parallel batch jobs.

Table of contents

  1. Background
  2. CLI Integration: Simplified Workflows
  3. Download Optimizations
  4. Complete Walkthrough: NonverbalTTS Dataset
  5. Next Steps
  6. Conclusion

Background

Sequential downloads create significant bottlenecks when dealing with complex datasets that have hundreds of thousands of files distributed across multiple directories.

AIStore addresses this by parallelizing downloads within each target using multiple workers (one per mountpath), batching jobs based on file size, and collecting file metadata in parallel. This approach leverages the network throughput from each individual target to the HuggingFace servers.

CLI Integration: Simplified Workflows

Prerequisites

The following examples assume an active AIStore cluster. If the destination buckets (e.g., ais://datasets, ais://models) don’t exist, they will be created automatically with default properties.

AIStore’s CLI includes HuggingFace-specific flags for the ais download command that handle distributed operations behind the scenes.

Basic Download Commands

$# Download entire dataset
$$ ais download --hf-dataset squad ais://datasets/squad/
$
$# Download entire model
$$ ais download --hf-model bert-base-uncased ais://models/bert/
$
$# Download specific file
$$ ais download --hf-dataset squad --hf-file train/0.parquet ais://datasets/squad/

Authentication and Configuration

$# Export your HuggingFace token and use for private/gated content
$$ export HF_TOKEN=your_hf_token_here
$$ ais download --hf-dataset private-dataset --hf-auth $HF_TOKEN ais://private-data/
$
$# Control batching with blob threshold
$$ ais download --hf-dataset large-dataset --blob-threshold 200MB ais://datasets/large/

Progress Monitoring

$# Real-time progress tracking
$$ ais show job --refresh 2s
$
$# Detailed job information
$$ ais show job download --verbose

Download Optimizations

The system uses some key techniques to improve download performance:

Job Batching: Size-Based Distribution

Job batching categorizes files based on configurable size thresholds:

$# Configure blob threshold for job batching
$$ ais download --hf-dataset squad --blob-threshold 100MB ais://ml-datasets/

Files are categorized into two groups:

  • Large files (above blob threshold): Get individual download jobs for maximum parallelism
  • Small files (below threshold): Batched together to reduce overhead

Job Batching Diagram Figure: How AIStore batches files based on size threshold (100MB in this example)

Concurrent Metadata Collection

Before downloading files, AIStore makes parallel HEAD requests to the HuggingFace API to collect file metadata (like file sizes) concurrently rather than sequentially. This reduces setup time for datasets with many files.

Complete Walkthrough: NonverbalTTS Dataset

Let’s walk through an example downloading a machine learning dataset and processing it with ETL operations:

Walkthrough Prerequisites

For this walkthrough, we’ll create and use three buckets:

  • ais://deepvs - for the initial dataset download
  • ais://ml-dataset - for ETL-processed files
  • ais://ml-dataset-parsed - for the final parsed dataset

If these buckets don’t exist, they will be created automatically with default properties.

Step 1: Download Dataset with Configurable Job Batching

$# Download deepvk/NonverbalTTS dataset with job batching
$$ ais download --hf-dataset deepvk/NonverbalTTS ais://deepvs --blob-threshold 500MB --max-conns 5
$Found 11 parquet files in dataset 'deepvk/NonverbalTTS'
$Created 7 individual jobs for files >= 500MiB
$Started download job dnl-B-oOHruKH9
$To monitor the progress, run 'ais show job dnl-B-oOHruKH9 --progress'

Step 2: Monitor Distributed Job Execution

$# Watch configurable job distribution across cluster targets
$$ ais show job
$download jobs
$JOB ID XACTION STATUS ERRORS DESCRIPTION
$dnl-B-oOHruKH9 D6JOGa7PH9 1 pending 0 multi-download -> ais://deepvs
$dnl-zoOHr7PG3 D6JOGa7PH9 1 pending 0 https://huggingface.co/api/datasets/deepvk/NonverbalTTS/parquet/default/other/0.parquet -> ais://deepvs/0.parquet
$dnl-oJOHruKG3 D6JOGa7PH9 1 pending 0 https://huggingface.co/api/datasets/deepvk/NonverbalTTS/parquet/default/train/1.parquet -> ais://deepvs/1.parquet
$dnl-F_ogHauKH9 D6JOGa7PH9 1 pending 0 https://huggingface.co/api/datasets/deepvk/NonverbalTTS/parquet/default/train/2.parquet -> ais://deepvs/2.parquet
$dnl-PoOHr7KG9 D6JOGa7PH9 1 pending 0 https://huggingface.co/api/datasets/deepvk/NonverbalTTS/parquet/default/train/3.parquet -> ais://deepvs/3.parquet
$....

Step 3: Verify Download Completion

$# Check bucket summary after download
$$ ais ls ais://deepvs --summary
$NAME PRESENT OBJECTS SIZE (apparent, objects, remote) USAGE(%)
$ais://deepvs yes 6 0 2.76GiB 2.76GiB 0B 0%

Options for Using Downloaded Data

At this point, you have several options:

  1. Use directly: Work with the downloaded files as-is if they meet your requirements
  2. Transform with ETL: Apply preprocessing for format conversion, file organization, or data standardization
  3. Custom processing: Use your own tools for data preparation

Why transform? HuggingFace datasets often have complex paths or formats that benefit from standardization. This walkthrough demonstrates ETL transformations for file organization (consistent naming) and format conversion (Parquet → JSON for framework compatibility).

Step 4: Initialize ETL Transformers

Note: ETL operations require AIStore to be deployed on Kubernetes. See ETL documentation for deployment requirements and setup instructions.

Before applying transformations, initialize the required ETL containers:

$# Initialize batch-rename ETL transformer for file organization
$$ ais etl init -f https://raw.githubusercontent.com/NVIDIA/ais-etl/main/transformers/batch_rename/etl_spec.yaml
$
$# Initialize parquet-parser ETL transformer for data parsing
$$ ais etl init -f https://raw.githubusercontent.com/NVIDIA/ais-etl/main/transformers/parquet-parser/etl_spec.yaml
$
$# Verify ETL transformers are running
$$ ais etl show

Step 5: Preprocessing using ETL

$# Organize and rename files using batch rename ETL
$$ ais etl bucket batch-rename-etl ais://deepvs ais://ml-dataset
$etl-bucket[BatchRename] ais://deepvs => ais://ml-dataset
$
$# Verify renamed files with structured naming
$$ ais ls ais://ml-dataset/
$NAME SIZE
$train_0.parquet 485MiB
$train_1.parquet 492MiB
$train_2.parquet 511MiB
$...
$# Convert parquet files to JSON format for easier ML framework integration
$$ ais etl bucket parquet-parser-etl ais://ml-dataset ais://ml-dataset-parsed
$etl-bucket[xO_sVT3Im] ais://ml-dataset => ais://ml-dataset-parsed
$
$# Verify processed dataset ready for ML training
$$ ais ls ais://ml-dataset-parsed --summary
$NAME PRESENT OBJECTS SIZE (apparent, objects, remote) USAGE(%)
$ais://ml-dataset-parsed yes 7 0 8.68GiB 8.68GiB 0B 1%

Step 6: ML Pipeline Integration

AIStore integrates seamlessly with popular ML frameworks. Here’s how to use the processed dataset in your training pipeline:

Option A: Direct SDK Usage (Simple)

1from aistore.sdk import Client
2import json
3
4client = Client("http://localhost:51080")
5bucket = client.bucket("ml-dataset-parsed")
6
7# Load processed training data
8for obj in bucket.list_objects():
9 if obj.name.startswith("train_"):
10 data = json.loads(obj.get_reader().read_all())
11 # Process individual training samples
12 for sample in data:
13 # Your training logic here
14 pass
1from aistore.sdk import Client
2from aistore.pytorch import AISIterDataset
3from torch.utils.data import DataLoader
4import json
5
6# Create dataset that reads directly from the cluster
7client = Client("http://localhost:51080")
8dataset = AISIterDataset(ais_source_list=client.bucket("ml-dataset-parsed"))
9
10# Configure DataLoader with multiprocessing
11loader = DataLoader(
12 dataset,
13 batch_size=32,
14 num_workers=4, # Parallel data loading across multiple cores
15)
16
17# Training loop
18for batch_names, batch_data in loader:
19 # Parse JSON data
20 parsed_samples = [json.loads(data) for data in batch_data]
21
22 # Convert to tensors and train your model
23 # model.train_step(parsed_samples)
24 pass

Next Steps

The HuggingFace integration opens up some practical areas for expansion:

Download and Transform API: AIStore supports combining download and ETL transformation in a single API call, eliminating the two-step process shown in the walkthrough. This allows downloading HuggingFace datasets with immediate transformation (e.g., Parquet → JSON) in one operation. CLI integration for this functionality is in development.

Additional Dataset Formats: Beyond the current Parquet support, HuggingFace datasets are available in multiple formats that teams commonly need:

  • JSON format - Direct JSON downloads for frameworks requiring this format
  • CSV format - For traditional data processing workflows
  • WebDataset format - For large-scale ML pipelines using WebDataset

Conclusion

AIStore’s HuggingFace integration addresses common dataset download bottlenecks in machine learning workflows. Job batching and concurrent metadata collection enable efficient, parallel downloads of terabyte-scale datasets that would otherwise overwhelm traditional tools. Once stored in AIStore, teams can leverage local ETL operations to transform and prepare data without additional network transfers. This approach provides a streamlined path from raw downloads to training-ready datasets, eliminating the typical download-wait-process cycle that slows ML development.

References:

AIStore Core Documentation

ETL (Extract, Transform, Load) Resources

External Resources