Working with Archives (Shards)
Working with Archives (Shards)
In AIStore, archives (also called shards) are special objects that contain multiple files packaged together in formats like TAR, TGZ, ZIP, or TAR.LZ4. Working with archives is essential for efficiently managing collections of related files and for operations like distributed sorting.
In this document:
- Commands to read, write, extract, and list archives - objects formatted as
TAR,TGZ(orTAR.GZ) ,ZIP, orTAR.LZ4.
For the most recently updated list of supported archival formats, please refer to this source.
The corresponding subset of CLI commands starts with ais archive, from where you can <TAB-TAB> to the actual (reading, writing, etc.) operation.
Table of Contents
- Subcommands
- Archive Files and Directories (
ais archive put) - Append Files to Existing Archives
- Archive Multiple Objects (
ais archive bucket) - List Archived Content (
ais archive ls) - Get Archived Content (‘ais archive get`)
- Get Archived Content: Multiple-Selection
- Generate Shards for Testing
Subcommands
The corresponding subset of subcommands starts with ais archive, from where you can <TAB-TAB> to the actual operation:
For detailed help on any command, use the --help option:
Archive files and directories (ais archive put)
Archive multiple files.
The operation accepts either an explicitly defined list or template-defined range of file names (to archive).
NOTE:
ais archive putworks with locally accessible (source) files and shall not be confused withais archive bucketcommand (below).
Also, note that ais put command with its --archpath option provides an alternative way to archive multiple objects:
For the most recently updated list of supported archival formats, please see:
Append files and directories to an existing archive
APPEND operation provides for appending files to existing archives (shards). As such, APPEND is a variation of PUT (above) with additional two boolean flags:
Example 1: add file to archive
step 1. create archive (by archiving a given source dir)
step 2. add a single file to existing archive
step 3. list entire bucket with an --archive option to show all archived entries
Alternatively, use regex to select:
Example 2: use --template flag to add source files
Generally, the --template option combines (an optional) prefix and/or one or more ranges (e.g., bash brace expansions).
In this case, the template we use is a simple prefix with no ranges.
Example 3: add file to archive
In this example, we assume that arch.tar already exists.
Example 4: add file to archive
Archive multiple objects (ais archive bucket)
The ais archive bucket command creates archives (shards) from multiple objects stored in a bucket. This is a powerful operation that:
- Takes objects from a specified source bucket
- Archives them as a single shard in the specified destination bucket
Features
- Source and destination buckets can be the same or different
- Supports multiple selection methods (lists, templates, prefixes)
- Supports all backend providers
- Supports various archival formats (.tar, .tar.gz/.tgz, .zip, .tar.lz4)
- Executes asynchronously and in parallel across all AIS nodes for maximum performance
Usage
Selection Options
The command provides multiple ways to select objects for archiving:
-
Template matching: Use patterns with ranges to select objects
-
List-based selection: Specify a comma-separated list of objects
-
Prefix-based selection: Select objects that share a common prefix
Non-Recursive Option (—nr)
The --nr (or --non-recursive) flag limits the scope of the archiving operation to only include objects at the specified directory level, without descending into subdirectories.
Examples with Non-Recursive Flag
-
Archive only the files directly in a directory (not its subdirectories):
This will only archive objects directly in the
aaa/directory, skipping any objects in subdirectories likeaaa/bbb/. -
Compare with recursive archiving (default behavior):
This will archive all objects under the
aaa/prefix, including those in subdirectories likeaaa/bbb/.
Visual Example
For a bucket with this structure:
With --nr flag:
Result:
Without --nr flag:
Result:
Additional Options
--append-or-put: Append to an existing archive if it exists; otherwise create new--cont-on-err: Continue archiving despite errors in multi-object transactions--dry-run: Preview the results without executing--include-src-bck: Prefix archived file names with the source bucket name--skip-lookup: Skip checking bucket existence for better performance--wait: Wait for the asynchronous operation to complete
Complete Examples
1. Archive objects with a specific prefix, non-recursively:
2. Archive objects using a template range:
3. Incrementally append to an existing archive:
4. Archive a list of objects from a given bucket:
Resulting ais://bck/arch.tar contains objects ais://bck/obj1 and ais://bck/obj2.
5. Archive objects from a different bucket, use template (range):
ais://dst/arch.tar now contains 10 objects from bucket ais://src: ais://src/obj-0, ais://src/obj-1 … ais://src/obj-9.
6. Archive 3 objects and then append 2 more:
Notes
ais archive bucketmust not be confused withais archive putarchive bucketarchives objects in the cluster- more precisely, objects accessible by the cluster
archive putarchives files from your local or locally accessible (NFS, SMB) directories
- The operation runs asynchronously
- use
--waitto wait for completion; see--helpfor details
- use
- When using the
--nr(non-recursive) flag, only the immediate contents of the specified virtual directory is archived - For more information on multi-object operations, please see:
List archived content
List archived content as a tree with archive (“shard”) name as a root and archived files as leaves. Filenames are always sorted alphabetically.
Options
Examples
Example: use ‘—prefix’ that crosses shard boundary
For starters, we recursively archive all aistore docs:
To list a virtual subdirectory inside this newly created shard (e.g.):
or, same:
Get archived content (‘ais archive get`)
Example: extract one file
Alternatively, use fully qualified name:
Example: extract one file using its fully-qualified name::
Example: extract all files from a single shard
Let’s say, we have a certain shard in a certain bucket:
We can then go ahead to GET and extract it to local directory, e.g.:
But here’s an alternative syntax to achieve the same:
or even:
The difference is that:
- in the first case we ask for a specific shard,
- while in the second (and third) we filter bucket’s content using a certain prefix
- and the fact (the convention) that archived filenames are prefixed with their parent (shard) name.
Example: extract all files from all shards (with a given prefix)
Let’s say, there’s a bucket ais://dst with a virtual directory abc/ that in turn contains:
Next, we GET and extract them all in the respective sub-directories (note --verbose option):
Example: use ‘—prefix’ that crosses shard boundary
For starters, we recursively archive all aistore docs:
To list a virtual subdirectory inside this newly created shard (e.g.):
Now, extract matching files from the bucket to /tmp/out:
Get archived content: multiple selection
Generally, both single and multi-selection from a given source shard is realized using one of the following 4 (four) options:
In particular, ‘—archregx’ and ‘—archmode’ pair defines multiple selection that can be further demonstrated on the following examples.
But first, note that in all multi-selection cases, the result is (currently) invariably formatted as .TAR (that contains the aforementioned selection).
Example: suffix match
Select all *.jpeg files from a given shard and return them all as 111.tar:
Example: WebDataset key
Select all files that have a given WebDataset key; return the result as 222.tar:
Example: prefix match
Similar to the above except that in this case ‘—archregx’ value specifies virtual subdirectory inside a given named shard:
Generate shards
ais archive gen-shards "BUCKET/TEMPLATE.EXT"
Put randomly generated shards that can be used for dSort testing.
The TEMPLATE must be bash-like brace expansion (see examples) and .EXT must be one of: .tar, .tar.gz.
Warning: Remember to always quote the argument ("...") otherwise the brace expansion will happen in terminal.
Options
Examples
Generate shards with varying numbers of files and file sizes
Generate 10 shards each containing 100 files of size 256KB and put them inside ais://dsort-testing bucket (creates it if it does not exist).
Shards will be named: shard-0.tar, shard-1.tar, …, shard-9.tar.
Generate shards using custom naming template
Generates 100 shards each containing 5 files of size 256KB and put them inside dsort-testing bucket.
Shards will be compressed and named: super_shard_000_last.tgz, super_shard_001_last.tgz, …, super_shard_099_last.tgz