A bucket is a named container for objects - monolithic files or chunked representations - with associated metadata. It is the fundamental unit of data organization and data management.
AIS buckets are categorized by their provider and origin. Native ais:// buckets managed by this cluster are always created explicitly (via ais create or the respective Go and/or Python APIs).
Remote buckets (including s3://, gs://, etc., and ais:// buckets in remote AIS clusters) are usually discovered and auto-added on-the-fly on first access.
In a cluster, every bucket is assigned a unique, cluster-wide bucket ID (BID). Same-name remote buckets with different namespaces get different IDs.
Every object a) belongs to exactly one bucket and b) is identified by a unique name within that bucket.
Bucket properties define data protection (checksums, mirroring, erasure coding), chunked representation, versioning and synchronization with remote sources, access control, backend linkage, feature flags, rate-limit settings, and more.
For types of supported buckets (AIS, Cloud, remote AIS, etc.), bucket identity, properties, lifecycle, and associated policies, storage services and usage examples, see the comprehensive:
It is easy to see all CLI operations on buckets:
For convenience, a few of the most popular verbs are also aliased:
ais create BUCKET [BUCKET...]
Create bucket(s).
Create buckets bucket_name1 and bucket_name2, both with AIS provider.
Create bucket bucket_name in ml namespace.
Create bucket bucket_name in global namespace of AIS remote cluster with Bghort1l UUID.
Create bucket bucket_name in ml namespace of AIS remote cluster with Bghort1l UUID.
Create bucket bucket_name with custom properties specified.
ais bucket rm BUCKET [BUCKET...]
Delete an ais bucket or buckets.
Remove AIS buckets bucket_name1 and bucket_name2.
Remove bucket bucket_name from ml namespace.
Remove bucket bucket_name from global namespace of AIS remote cluster with Bghort1l UUID.
Remove bucket bucket_name from ml namespace of AIS remote cluster with Bghort1l UUID.
Removing remote buckets is not supported.
ais ls PROVIDER:[//BUCKET_NAME] [command options]
Notice the optional [//BUCKET_NAME]. When there’s no bucket, ais ls will list buckets. Otherwise, it’ll list objects.
The options are numerous. Here’s a non-exhaustive list (for the most recent update, run ais ls --help)
ais ls --regex "ngn*"List all buckets matching the ngn* regex expression.
ais ls aws: or (same) ais ls s3List all existing buckets for the specific provider.
ais ls aws --all or (same) ais ls s3: --allList absolutely all buckets that cluster can “see” including those that are not necessarily present in the cluster.
ais ls ais:// or (same) ais ls aisList all AIS buckets.
ais ls ais://#nameList all buckets for the ais provider and name namespace.
ais ls ais://@uuid#namespaceList all remote AIS buckets that have uuid#namespace namespace. Note that:
uuid must be the remote cluster UUID (or its alias)namespace is optional name of the remote namespaceAs a rule of thumb, when a (logical) #namespace in the bucket’s name is omitted we use the global namespace that always exists.
ais ls is one of those commands that only keeps growing, in terms of supported options and capabilities.
The command:
ais ls PROVIDER:[//BUCKET_NAME] [command options]
can conveniently list buckets (with or without “summarizing” object counts and sizes) and objects.
Notice the optional [//BUCKET_NAME]. When there’s no bucket, ais ls will list buckets. Otherwise, it’ll list objects.
The command’s inline help is also quite extensive, with (inline) examples followed by numerous supported options:
When listing objects, a footer will be displayed showing:
--cached option: number of objects present in-cluster--paged option: current page number--count-only option: time elapsed to fetch the listExamples of footer variations:
Listed 12345 namesListed 12345 names (in-cluster: 456)Page 123: 1000 names (in-cluster: none)1. List objects in the AIS bucket bucket_name.
2. List objects in the remote bucket bucket_name.
3. List objects from a remote AIS cluster with a namespace:
4. List objects with paged output (showing page numbers):
5. List cached objects from a remote bucket:
6. Count objects in a bucket:
7. Count objects with paged output:
--paged with remote buckets, the footer will show both page number and in-cluster object count when applicable--diff option requires remote backends supporting some form of versioning (e.g., object version, checksum, and/or ETag)List objects in the bucket bucket_name and ml namespace contained on AIS remote cluster with Bghort1l UUID.
List objects which match given prefix.
Here’s a quick 4-steps sequence to demonstrate the functionality:
1. In the beginning, the bucket is accessible (notice --all) and empty, as far as its in-cluster content
2. The first (remote) list-objects will have the side-effect of loading remote inventory
3. The second and later list-objects will run much faster
4. Finally, observe that at in-cluster content now includes the inventory (.csv) itself
For starters, we archive all aistore docs:
To list a certain virtual subdirectory inside this newly created shard:
or, same:
AIS supports multiple storage backends:
See Unified Namespace for details on remote AIS clusters.
One major distinction between an AIS bucket (e.g., ais://mybucket) and a remote bucket (e.g., ais://@cluster/mybucket, s3://dataset, etc.) boils down to the fact that - for a variety of real-life reasons - in-cluster content of the remote bucket may be different from its remote content.
Note that the terms in-cluster and cached are used interchangeably throughout the entire documentation and CLI.
Remote buckets can be prefetched and evicted from AIS, entirely or selectively:
Some of the supported functionality can be quickly demonstrated with the following examples:
Here’s a more complete example that lists remote bucket, then reads and evicts a given object:
ais bucket mv BUCKET NEW_BUCKET
Move (ie. rename) an AIS bucket.
If the NEW_BUCKET already exists, the mv operation will not proceed.
Cloud bucket move is not supported.
Move AIS bucket bucket_name to AIS bucket new_bucket_name.
ais cp SRC_BUCKET[/OBJECT_NAME_or_TEMPLATE] DST_BUCKET [command options]
Source bucket must exist. When the destination bucket is remote (e.g. in the Cloud) it must also exist and be writeable.
NOTE: there’s no requirement that either of the buckets is present in aistore.
NOTE: not to confuse in-cluster presence and existence. Remote object may exist (remotely), etc.
NOTE: to fully synchronize in-cluster content with remote backend, please refer to out of band updates.
Moreover, when the destination is AIS (ais://) or remote AIS (ais://@remote-alias) bucket, the existence is optional: the destination will be created on the fly, with bucket properties copied from the source (SRC_BUCKET).
NOTE: similar to delete, evict and prefetch operations,
cpalso supports embedded prefix - see disambiguating multi-object operation
Finally, the option to copy remote bucket onto itself is also supported - syntax-wise. Here’s an example that’ll shed some light:
Incidentally, notice the
--cacheddifference:
Copy AIS bucket src_bucket to AIS bucket dst_bucket.
The same as above, but wait until copying is finished.
Copy AWS bucket src_bucket to AWS bucket dst_bucket.
Example 1. Copy objects obj1.tar and obj1.info from bucket ais://bck1 to ais://bck2, and wait until the operation finishes
Example 2. Copy objects matching Bash brace-expansion obj{2..4}, do not wait for the operation is done.
Example 3. Use --sync option to copy remote virtual subdirectory
In the example, --sync synchronizes destination bucket with its remote (e.g., Cloud) source.
In particular, the option will make sure that aistore has the latest versions of remote objects and may also entail removing of the objects that no longer exist remotely
This example demonstrates how to copy objects between buckets using the AIStore CLI, and how to monitor the progress of the copy operation. AIStore supports all possible permutations of copying: Cloud to AIStore, Cloud to another (or same) Cloud, AIStore to Cloud, and between AIStore buckets.
To copy all objects with a common prefix from an S3 bucket to an AIStore bucket:
Note: The “Warning” message is benign and will only appear if the destination bucket does not exist.
You can monitor the progress of the copy operation using the ais show job copy command. Add the --refresh flag followed by a time in seconds to get automatic updates:
The output shows statistics for each node in the AIStore cluster:
The output also includes a “Total” row at the bottom that provides cluster-wide aggregated values for the number of objects processed and bytes transferred. The checkmark (✓) indicates that all nodes are reporting byte statistics.
To stop all in-progress jobs:
In our example, there’d be a single job ID
tco-goDbhCxtf
There’s a script that we use for testing. When run, it produces the following output:
The script executes a sequence of steps (above).
Notice a certain limitation (that also shows up as the last step #15):
As of the version 3.22, aistore cp commands will always synchronize deleted and updated remote content.
However, to see an out-of-band added content, you currently need to run multi-object copy, with multiple source objects specified using --list or --template.
ais cp --help for the most recently updated options`ais storage summary PROVIDER:[//BUCKET_NAME] - show bucket sizes and the respective percentages of used capacity on a per-bucket basis [command options]
ais bucket summary - same as above.
If BUCKET is omitted, the command applies to all AIS buckets.
The output includes the total number of objects in a bucket, the bucket’s size (bytes, megabytes, etc.), and the percentage of the total capacity used by the bucket.
A few additional words must be said about --validate. The option is provided to run integrity checks, namely: locations of objects, replicas, and EC slices in the bucket, the number of replicas (and whether this number agrees with the bucket configuration), and more.
Location of each stored object must at any point in time correspond to the current cluster map and, within each storage target, to the target’s mountpaths. A failure to abide by location rules is called misplacement; misplaced objects - if any - must be migrated to their proper locations via automated processes called
global rebalanceandresilver:
--validatemay take considerable time to execute (depending, of course, on sizes of the datasets in question and the capabilities of the underlying hardware); non-zero misplaced objects in the (validated) output is a direct indication that the cluster requires rebalancing and/or resilvering; an alternative way to execute validation is to runais storage validateor (simply)ais scrub:
For details and additional examples, please see:
ais start mirror BUCKET --copies <value>
Start an extended action to bring a given bucket to a certain redundancy level (value copies). Read more about this feature here.
ais start ec-encode BUCKET --data-slices <value> --parity-slices <value>
Start an extended action that encodes and recovers all objects and slices in a given bucket. The action enables erasure coding if it is disabled, and runs the encoding for all objects in the bucket in the background. If erasure coding for the bucket was enabled beforehand, the extended action recovers missing objects and slices if possible.
In case of running the extended action for a bucket that has already erasure coding enabled, you must pass the correct number of parity and data slices in the command-line.
Run ais bucket props show <bucket-name> ec to get the current erasure coding settings.
Read more about this feature here.
All options are required and must be greater than 0.
Overall, the topic called “bucket properties” is rather involved and includes sub-topics “bucket property inheritance” and “cluster-wide global defaults”. For background, please first see:
Now, as far as CLI, run the following to list properties of the specified bucket. By default, a certain compact form of bucket props sections is presented.
ais bucket props show BUCKET [PROP_PREFIX] [command options]
When PROP_PREFIX is set, only props that start with PROP_PREFIX will be displayed.
Useful PROP_PREFIX are: access, checksum, ec, lru, mirror, provider, versioning.
ais bucket showis an alias forais show bucket- both can be used interchangeably.
Show only lru section of bucket props for bucket_name bucket.
ais bucket props set [OPTIONS] BUCKET JSON_SPECIFICATION|KEY=VALUE [KEY=VALUE...]
Set bucket properties. For the available options, see bucket-properties.
If JSON_SPECIFICATION is used, all properties of the bucket are set based on the values in the JSON object.
When JSON specification is not used, some properties support user-friendly aliases:
Set the mirror.enabled and mirror.copies properties to true and 2 respectively, for the bucket bucket_name
Set read-only access to the bucket bucket_name.
All PUT and DELETE requests will fail.
When a bucket is hosted by an S3 compliant backend (such as, e.g., minio), we may want to specify an alternative S3 endpoint, so that AIS nodes use it when reading, writing, listing, and generally, performing all operations on remote S3 bucket(s).
Globally, S3 endpoint can be overridden for all S3 buckets via “S3_ENDPOINT” environment. If you decide to make the change, you may need to restart AIS cluster while making sure that “S3_ENDPOINT” is available for the AIS nodes when they are starting up.
But it can be also be done - and will take precedence over the global setting - on a per-bucket basis.
Here are some examples:
Global
export S3_ENDPOINT=...override is static and readonly. Use it with extreme caution as it applies to all buckets.
On the other hand, for any given
s3://bucketits S3 endpoint can be set, unset, and otherwise changed at any time - at runtime. As shown above.
Set backend bucket for AIS bucket bucket_name to the GCP cloud bucket cloud_bucket.
Once the backend bucket is set, operations (get, put, list, etc.) with ais://bucket_name will be exactly as we would do with gcp://cloud_bucket.
It’s like a symlink to a cloud bucket.
The only difference is that all objects will be cached into ais://bucket_name (and reflected in the cloud as well) instead of gcp://cloud_bucket.
To disconnect cloud bucket do:
To create an erasure-encoded bucket or enable EC for an existing bucket, AIS requires at least ec.data_slices + ec.parity_slices + 1 targets.
At the same time, for small objects (size is less than ec.objsize_limit) it is sufficient to have only ec.parity_slices + 1 targets.
Option --force allows creating erasure-encoded buckets when the number of targets is not enough but the number exceeds ec.parity_slices.
Note that if the number of targets is less than ec.data_slices + ec.parity_slices + 1, the cluster accepts only objects smaller than ec.objsize_limit.
Bigger objects are rejected on PUT.
In examples a cluster with 6 targets is used:
Once erasure encoding is enabled for a bucket, the number of data and parity slices cannot be modified.
The minimum object size ec.objsize_limit can be changed on the fly.
To avoid accidental modification when EC for a bucket is enabled, the option --force must be used.
Set all bucket properties for bucket_name bucket based on the provided JSON specification.
If not all properties are mentioned in the JSON, the missing ones are set to zero values (empty / false / nil):
ais archive bucket - Archive selected or matching objects from SRC_BUCKET[/OBJECT_NAME_or_TEMPLATE] as (.tar, .tgz or .tar.gz, .zip, .tar.lz4)-formatted object (a.k.a. shard).
See also:
AIStore supports AWS-specific configuration on a per s3 bucket basis. Any bucket that is backed up by an AWS S3 bucket (**) can be configured to use alternative:
For background and usage examples, please see AWS-specific bucket configuration.
(**) Terminology-wise, “s3 bucket” is a shortcut phrase indicating a bucket in an AIS cluster that either (A) has the same name (e.g.
s3://abc) or (B) a differently named AIS bucket that hasbackend_bckproperty that specifies the s3 bucket in question.
ais bucket props reset BUCKET
Reset bucket properties to cluster defaults.
ais show cluster bmd
Show bucket metadata (BMD).