Archives: read, write, and list

AIStore natively supports four archive/serialization formats across all APIs, batch jobs, and functional extensions: TAR, TGZ (TAR.GZ), TAR.LZ4, and ZIP.

Motivation

Archives address the small-file problem - performance degradation from random access to very large datasets containing many small files.

To qualify “very large” and “small-file” - the range of the numbers we usually see in the field include datasets containing 10+ million files with sizes ranging from 1K to 100K.

AIStore’s implementation allows unmodified clients and applications to work efficiently with archived datasets.

Key benefits:

Improved I/O performance via reduced metadata lookups and network roundtrips
Seamless integration with existing, unmodified workflows
Implicit dataset backup: each archive acts as a self-contained, immutable copy of the original files

In addition to performance, sharded datasets provide a natural form of dataset backup: each shard is a self-contained, immutable representation of its original files, making it easy to replicate, snapshot, or version datasets without additional tooling.

Supported Formats

TAR (.tar) - Unix archive format (since 1979) supporting USTAR, PAX, and GNU TAR variants
TGZ (.tgz, .tar.gz) - TAR with gzip compression
TAR.LZ4 (.tar.lz4) - TAR with lz4 compression
ZIP (.zip) - PKWARE ZIP format (since 1989)

Operations

AIStore can natively read, write, append¹, and list archives. Operations include:

Regular GET and PUT requests:
- Go API - see “ArchPath” parameter
- Python SDK - ditto
- Python SDK/Archive - see archive-related config
get-batch - efficient multi-object/multi-file retrieval
list-objects - “opens” archives and includes contained pathnames in results
dsort - distributed archive creation and transformation
aisloader - benchmarking with archive workloads
Concurrent multi-object transactions for bulk archive generation from selected objects

Default format: TAR is the system default when serialization format is unspecified.

¹ APPEND is supported for TAR format only. Other formats (ZIP, TGZ, TAR.LZ4) were not designed for true append operations - only extract-all-recreate emulation, which significantly impacts performance.

Motivation

Supported Formats

Operations

See also