Working with Archives (Shards)
Working with Archives (Shards)
Working with Archives (Shards)
In AIStore, archives (also called shards) are special objects that contain multiple files packaged together in formats like TAR, TGZ, ZIP, or TAR.LZ4. Working with archives is essential for efficiently managing collections of related files and for operations like distributed sorting.
In this document:
TAR, TGZ (or TAR.GZ) , ZIP, or TAR.LZ4.For the most recently updated list of supported archival formats, please refer to this source.
The corresponding subset of CLI commands starts with ais archive, from where you can <TAB-TAB> to the actual (reading, writing, etc.) operation.
ais archive put)ais archive bucket)ais archive ls)The corresponding subset of subcommands starts with ais archive, from where you can <TAB-TAB> to the actual operation:
For detailed help on any command, use the --help option:
ais archive put)Archive multiple files.
The operation accepts either an explicitly defined list or template-defined range of file names (to archive).
NOTE:
ais archive put works with locally accessible (source) files and shall not be confused with ais archive bucket command (below).Also, note that ais put command with its --archpath option provides an alternative way to archive multiple objects:
For the most recently updated list of supported archival formats, please see:
APPEND operation provides for appending files to existing archives (shards). As such, APPEND is a variation of PUT (above) with additional two boolean flags:
--archive option to show all archived entriesAlternatively, use regex to select:
--template flag to add source filesGenerally, the --template option combines (an optional) prefix and/or one or more ranges (e.g., bash brace expansions).
In this case, the template we use is a simple prefix with no ranges.
In this example, we assume that arch.tar already exists.
ais archive bucket)The ais archive bucket command creates archives (shards) from multiple objects stored in a bucket. This is a powerful operation that:
The command provides multiple ways to select objects for archiving:
Template matching: Use patterns with ranges to select objects
List-based selection: Specify a comma-separated list of objects
Prefix-based selection: Select objects that share a common prefix
The --nr (or --non-recursive) flag limits the scope of the archiving operation to only include objects at the specified directory level, without descending into subdirectories.
Archive only the files directly in a directory (not its subdirectories):
This will only archive objects directly in the aaa/ directory, skipping any objects in subdirectories like aaa/bbb/.
Compare with recursive archiving (default behavior):
This will archive all objects under the aaa/ prefix, including those in subdirectories like aaa/bbb/.
For a bucket with this structure:
With --nr flag:
Result:
Without --nr flag:
Result:
--append-or-put: Append to an existing archive if it exists; otherwise create new--cont-on-err: Continue archiving despite errors in multi-object transactions--dry-run: Preview the results without executing--include-src-bck: Prefix archived file names with the source bucket name--skip-lookup: Skip checking bucket existence for better performance--wait: Wait for the asynchronous operation to complete1. Archive objects with a specific prefix, non-recursively:
2. Archive objects using a template range:
3. Incrementally append to an existing archive:
4. Archive a list of objects from a given bucket:
Resulting ais://bck/arch.tar contains objects ais://bck/obj1 and ais://bck/obj2.
5. Archive objects from a different bucket, use template (range):
ais://dst/arch.tar now contains 10 objects from bucket ais://src: ais://src/obj-0, ais://src/obj-1 … ais://src/obj-9.
6. Archive 3 objects and then append 2 more:
ais archive bucket must not be confused with ais archive put
archive bucket archives objects in the cluster
archive put archives files from your local or locally accessible (NFS, SMB) directories--wait to wait for completion; see --help for details--nr (non-recursive) flag, only the immediate contents of the specified virtual directory is archivedList archived content as a tree with archive (“shard”) name as a root and archived files as leaves. Filenames are always sorted alphabetically.
For starters, we recursively archive all aistore docs:
To list a virtual subdirectory inside this newly created shard (e.g.):
or, same:
Alternatively, use fully qualified name:
Let’s say, we have a certain shard in a certain bucket:
We can then go ahead to GET and extract it to local directory, e.g.:
But here’s an alternative syntax to achieve the same:
or even:
The difference is that:
Let’s say, there’s a bucket ais://dst with a virtual directory abc/ that in turn contains:
Next, we GET and extract them all in the respective sub-directories (note --verbose option):
For starters, we recursively archive all aistore docs:
To list a virtual subdirectory inside this newly created shard (e.g.):
Now, extract matching files from the bucket to /tmp/out:
Generally, both single and multi-selection from a given source shard is realized using one of the following 4 (four) options:
In particular, ‘—archregx’ and ‘—archmode’ pair defines multiple selection that can be further demonstrated on the following examples.
But first, note that in all multi-selection cases, the result is (currently) invariably formatted as .TAR (that contains the aforementioned selection).
Select all *.jpeg files from a given shard and return them all as 111.tar:
Select all files that have a given WebDataset key; return the result as 222.tar:
Similar to the above except that in this case ‘—archregx’ value specifies virtual subdirectory inside a given named shard:
ais archive gen-shards "BUCKET/TEMPLATE.EXT"
Put randomly generated shards that can be used for dSort testing.
The TEMPLATE must be bash-like brace expansion (see examples) and .EXT must be one of: .tar, .tar.gz.
Warning: Remember to always quote the argument ("...") otherwise the brace expansion will happen in terminal.
Generate 10 shards each containing 100 files of size 256KB and put them inside ais://dsort-testing bucket (creates it if it does not exist).
Shards will be named: shard-0.tar, shard-1.tar, …, shard-9.tar.
Generates 100 shards each containing 5 files of size 256KB and put them inside dsort-testing bucket.
Shards will be compressed and named: super_shard_000_last.tgz, super_shard_001_last.tgz, …, super_shard_099_last.tgz