For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Blog
DocsAPI Reference
DocsAPI Reference
    • AIStore
    • Documentation
  • Core Documentation
    • In-depth Overview
    • Terminology and core abstractions
    • Getting Started
    • Networking model
    • Buckets: design, operations, namespaces, and system buckets
    • Observability overview
    • CLI overview
    • Production deployment
    • Technical Blog
  • APIs, SDKs, and Compatibility
    • Go API
    • Python SDK
    • PyPI package
    • Python SDK reference guide
    • PyTorch integration
    • TensorFlow integration
    • HTTP API reference
    • curl examples
    • Easy URL
    • S3 compatibility
    • s3cmd quick start
    • Presigned S3 requests
    • Boto3 support
  • Command-Line Interface
    • CLI overview
    • ais help
    • CLI reference guide
    • Bucket operations
    • Cluster and remote-cluster management
    • Storage and mountpath management
    • Monitoring and ais show
    • Downloads
    • Jobs
    • Authentication and access control
    • Configuration via CLI
    • ETL CLI
    • Distributed shuffle CLI
    • ML / get-batch CLI
    • GCP credentials
    • TLS certificate management
  • Storage and Data Management
    • Storage services
    • Buckets: design, operations, namespaces, and system buckets
    • Native Bucket Inventory (NBI)
    • Backend providers
    • On-disk layout
    • Virtual directories
    • System files
    • Evicting remote buckets and cached data
  • Cluster Operations
    • Node lifecycle: maintenance, shutdown, decommission
    • Global rebalance
    • Resilver
    • AIS in Containerized Environments
    • Highly available control plane
    • Information Center (IC)
    • Out-of-band updates
    • Troubleshooting
  • Configuration and Security
    • Configuration
    • Environment variables
    • Feature flags
    • AuthN and access control
    • Authentication validation
    • HTTPS and certificates
    • Switching a cluster to HTTPS
  • ETL and Advanced Workflows
    • ETL overview
    • ETL CLI docs
    • ETL Python SDK examples
    • Custom transformers
    • ETL Python webserver SDK
    • ETL Go webserver package
    • Archives: read, write, and list
    • Distributed shuffle (dsort)
    • Initial sharding utility (ishard)
    • Downloader
    • Blob Downloader
    • Batch object retrieval (get-batch)
    • Batch operations
    • Tools and utilities
    • Extended actions (xactions)
  • Observability, Monitoring, and Performance
    • Observability overview
    • Monitoring with CLI
    • Logs
    • Prometheus integration
    • Metrics reference
    • Grafana dashboards
    • Kubernetes monitoring
    • Distributed tracing
    • Monitoring get-batch
    • AIS load generator (aisloader)
    • Benchmarking AIStore
    • Performance tuning and testing
    • Performance monitoring via CLI
    • Rate limiting
    • Checksumming
    • Filesystem Health Checker (FSHC)
    • Traffic patterns
  • Networking
    • Networking: multi-homing, network separation, IPv6
    • HTTPS configuration
    • Switching to HTTPS
    • Idle connections
    • MessagePack protocol
  • Deployment
    • AIStore on Kubernetes
    • Kubernetes Operator
    • Ansible playbooks
    • Helm charts
    • Deployment monitoring
    • Docker
  • Developer Resources
    • Development guide
    • aisnode command line
    • Build tags
  • Object and Bucket Naming
    • Unicode and special symbols in object and bucket names
    • Extremely long object names
  • Blog Posts
    • AIStore SDK & ETL: Transform an image dataset with AIS SDK and load into PyTorch
Blog
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoAIStore
On this page
  • Setup
  • The Dataset
  • Transforming the data
  • Initializing
  • Inline and Offline ETL
  • Transforming a single object inline
  • Transforming an entire bucket offline
  • Transforming multiple objects offline
  • AIS/PyTorch connector
  • References
Blog Posts

AIStore SDK & ETL: Transform an image dataset with AIS SDK and load into PyTorch

||View as Markdown|
Previous

Extremely long object names

Apr 03, 2023·Aaron Wilson
aistoreetlpytorchpython

Note: This blog post references init_code which has been removed and replaced with init_class. For the most up-to-date ETL initialization methods, please refer to the init_class documentation.

With recent updates to the Python SDK, it’s easier than ever to load data into AIS, transform it, and use it for training with PyTorch. In this post, we’ll demonstrate how to do that with a small dataset of images.

In a previous series of posts, we transformed the ImageNet dataset using a mixture of CLI and SDK commands. For background, you can view these posts below, but note that much of the syntax is out of date:

  • AIStore & ETL: Introduction
  • AIStore & ETL: Using AIS/PyTorch connector to transform ImageNet (post #2)

Setup

As we did in the posts above, we’ll assume that an instance of AIStore has been already deployed on Kubernetes. All the code below will expect an AIS_ENDPOINT environment variable set to the cluster’s endpoint.

To set up a local Kubernetes cluster and deploy AIStore on it, checkout the docs here. For more advanced deployments, take a look at our dedicated ais-k8s repository.

We’ll be using PyTorch’s torchvision to transform The Oxford-IIIT Pet Dataset - as illustrated:

AIS-ETL Overview

To interact with the cluster, we’ll be using the AIS Python SDK. Set up your Python environment and install the following requirements:

aistore
torchvision
torch

The Dataset

For this demo we will be using the Oxford-IIIT Pet Dataset since it is less than 1GB. The ImageNet Dataset is another reasonable choice, but consists of much larger downloads.

Once downloaded, the dataset includes an images and an annotations folder. For this example we will focus on the images directory, which consists of different sized .jpg images.

1import os
2import io
3import sys
4from PIL import Image
5from torchvision import transforms
6import torch
7
8from aistore.pytorch import AISDataset
9from aistore.sdk import Client
10from aistore.sdk.multiobj import ObjectRange
11
12AISTORE_ENDPOINT = os.getenv("AIS_ENDPOINT", "http://192.168.49.2:8080")
13client = Client(AISTORE_ENDPOINT)
14bucket_name = "images"
15
16
17def show_image(image_data):
18 with Image.open(io.BytesIO(image_data)) as image:
19 image.show()
20
21
22def load_data():
23 # First, let's create a bucket and put the data into AIS
24 bucket = client.bucket(bucket_name).create()
25 bucket.put_files("images/", pattern="*.jpg")
26 # Show a random (non-transformed) image from the dataset
27 image_data = bucket.object("Bengal_171.jpg").get_reader().read_all()
28 show_image(image_data)
29
30load_data()

example cat image

The class for this image can also be found in the annotations data:

1Bengal_171 6 1 2
2
3Translates to
4Class: 6 (ID)
5Species: 1 (cat)
6Breed: 2 (Bengal)

Transforming the data

Now that the data is in place, we need to define the transformation we want to apply before training on the data. Below we will deploy transformation code on an ETL K8s container. Once this code is deployed as an ETL in AIS, it can be applied to buckets or objects to transform them on the cluster.

1def etl():
2 def img_to_bytes(img):
3 buf = io.BytesIO()
4 img = img.convert('RGB')
5 img.save(buf, format='JPEG')
6 return buf.getvalue()
7
8 input_bytes = sys.stdin.buffer.read()
9 image = Image.open(io.BytesIO(input_bytes)).convert('RGB')
10 preprocessing = transforms.Compose([
11 transforms.RandomResizedCrop(224),
12 transforms.RandomHorizontalFlip(),
13 transforms.ToTensor(),
14 transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
15 transforms.ToPILImage(),
16 transforms.Lambda(img_to_bytes),
17 ])
18 processed_bytes = preprocessing(image)
19 sys.stdout.buffer.write(processed_bytes)

Initializing

We will use python3 (python:3.10) runtime and install the torchvision package to run the etl function above. When using the Python SDK init_code, it will automatically select the current version of Python (if supported) as the runtime for compatibility with the code passed in. To use a different runtime, check out the init_spec option.

runtime contains a predefined work environment in which the provided code/script will be run. A full list of supported runtimes can be found here.

1def create_etl():
2 client.etl("transform-images").init_code(
3 transform=etl,
4 dependencies=["torchvision"],
5 communication_type="io")
6
7
8image_etl = create_etl()

This initialization may take a few minutes to run, as it must download torchvision and all its dependencies.

1def show_etl(etl):
2 print(client.cluster().list_running_etls())
3 print(etl.view())
4
5show_etl(image_etl)

Inline and Offline ETL

AIS supports both inline (applied when getting objects) and offline (bucket to bucket) ETL. For more info see the ETL docs here.

Transforming a single object inline

With the ETL defined, we can use it when accessing our data.

1def get_with_etl(etl):
2 transformed_data = client.bucket(bucket_name).object("Bengal_171.jpg").get_reader(etl_name=etl.name).read_all()
3 show_image(transformed_data)
4
5get_with_etl(image_etl)

Post-transform image:

example image transformed

Transforming an entire bucket offline

Note that the job below may take a long time to run depending on your machine and the images you are transforming. You can view all jobs with client.cluster().list_running_jobs(). If you’d like to run a shorter example, you can limit which images are transformed with the prefix_filter option in the bucket.transform function:

1def etl_bucket(etl):
2 dest_bucket = client.bucket("transformed-images").create()
3 transform_job = client.bucket(bucket_name).transform(etl_name=etl.name, to_bck=dest_bucket)
4 client.job(transform_job).wait()
5 print(entry.name for entry in dest_bucket.list_all_objects())
6
7etl_bucket(image_etl)

Transforming multiple objects offline

We can also utilize the SDK’s object group feature to transform a selection of several objects with the defined ETL.

1def etl_group(etl):
2 dest_bucket = client.bucket("transformed-selected-images").create()
3 # Select a range of objects from the source bucket
4 object_range = ObjectRange(min_index=0, max_index=100, prefix="Bengal_", suffix=".jpg")
5 object_group = client.bucket(bucket_name).objects(obj_range=object_range)
6 transform_job = object_group.transform(etl_name=etl.name, to_bck=dest_bucket)
7 client.job(transform_job).wait_for_idle(timeout=300)
8 print([entry.name for entry in dest_bucket.list_all_objects()])
9
10etl_group(image_etl)

AIS/PyTorch connector

In the steps above, we demonstrated a few ways to transform objects, but to use the results we need to load them into a PyTorch Dataset and DataLoader. In PyTorch, a dataset can be defined by inheriting torch.utils.data.Dataset. Datasets can be fed into a DataLoader to handle batching, shuffling, etc. (see ‘torch.utils.data.DataLoader’).

To implement inline ETL, transforming objects as we read them, you will need to create a custom PyTorch Dataset as described by PyTorch here. In the future, AIS will likely provide some of this functionality directly. For now, we will use the output of the offline ETL (bucket-to-bucket) described above and use the provided AISDataset to read the transformed results. More info on reading AIS data into PyTorch can be found on the AIS blog here.

1def create_dataloader():
2 # Construct a dataset and dataloader to read data from the transformed bucket
3 dataset = AISDataset(AISTORE_ENDPOINT, "ais://transformed-images")
4 train_loader = torch.utils.data.DataLoader(dataset, shuffle=True)
5 return train_loader
6
7data_loader = create_dataloader()

This data loader can now be used with PyTorch to train a full model.

Full code examples for each action above can be found here

References

  1. AIStore & ETL: Introduction
  2. GitHub:
    • AIStore
    • Local Kubernetes Deployment
    • AIS/Kubernetes Operator, AIS on bare-metal, Deployment Playbooks, Helm
    • AIS-ETL containers and specs
  3. Documentation, blogs, videos:
    • https://aiatscale.org
    • https://github.com/NVIDIA/aistore/tree/main/docs
  4. Deprecated training code samples:
    • ImageNet PyTorch training with aistore.pytorch.Dataset
  5. Full code example
    • Transform Images With SDK
  6. Dataset
    • The Oxford-IIIT Pet Dataset