For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Blog
DocsAPI Reference
DocsAPI Reference
    • AIStore
    • Documentation
  • Core Documentation
    • In-depth Overview
    • Terminology and core abstractions
    • Getting Started
    • Networking model
    • Buckets: design, operations, namespaces, and system buckets
    • Observability overview
    • CLI overview
    • Production deployment
    • Technical Blog
  • APIs, SDKs, and Compatibility
    • Go API
    • Python SDK
    • PyPI package
    • Python SDK reference guide
    • PyTorch integration
    • TensorFlow integration
    • HTTP API reference
    • curl examples
    • Easy URL
    • S3 compatibility
    • s3cmd quick start
    • Presigned S3 requests
    • Boto3 support
  • Command-Line Interface
    • CLI overview
    • ais help
    • CLI reference guide
    • Bucket operations
    • Cluster and remote-cluster management
    • Storage and mountpath management
    • Monitoring and ais show
    • Downloads
    • Jobs
    • Authentication and access control
    • Configuration via CLI
    • ETL CLI
    • Distributed shuffle CLI
    • ML / get-batch CLI
    • GCP credentials
    • TLS certificate management
  • Storage and Data Management
    • Storage services
    • Buckets: design, operations, namespaces, and system buckets
    • Native Bucket Inventory (NBI)
    • Backend providers
    • On-disk layout
    • Virtual directories
    • System files
    • Evicting remote buckets and cached data
  • Cluster Operations
    • Node lifecycle: maintenance, shutdown, decommission
    • Global rebalance
    • Resilver
    • AIS in Containerized Environments
    • Highly available control plane
    • Information Center (IC)
    • Out-of-band updates
    • Troubleshooting
  • Configuration and Security
    • Configuration
    • Environment variables
    • Feature flags
    • AuthN and access control
    • Authentication validation
    • HTTPS and certificates
    • Switching a cluster to HTTPS
  • ETL and Advanced Workflows
    • ETL overview
    • ETL CLI docs
    • ETL Python SDK examples
    • Custom transformers
    • ETL Python webserver SDK
    • ETL Go webserver package
    • Archives: read, write, and list
    • Distributed shuffle (dsort)
    • Initial sharding utility (ishard)
    • Downloader
    • Blob Downloader
    • Batch object retrieval (get-batch)
    • Batch operations
    • Tools and utilities
    • Extended actions (xactions)
  • Observability, Monitoring, and Performance
    • Observability overview
    • Monitoring with CLI
    • Logs
    • Prometheus integration
    • Metrics reference
    • Grafana dashboards
    • Kubernetes monitoring
    • Distributed tracing
    • Monitoring get-batch
    • AIS load generator (aisloader)
    • Benchmarking AIStore
    • Performance tuning and testing
    • Performance monitoring via CLI
    • Rate limiting
    • Checksumming
    • Filesystem Health Checker (FSHC)
    • Traffic patterns
  • Networking
    • Networking: multi-homing, network separation, IPv6
    • HTTPS configuration
    • Switching to HTTPS
    • Idle connections
    • MessagePack protocol
  • Deployment
    • AIStore on Kubernetes
    • Kubernetes Operator
    • Ansible playbooks
    • Helm charts
    • Deployment monitoring
    • Docker
  • Developer Resources
    • Development guide
    • aisnode command line
    • Build tags
  • Object and Bucket Naming
    • Unicode and special symbols in object and bucket names
    • Extremely long object names
Blog
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoAIStore
On this page
  • Resilver
  • Resilver vs. Global Rebalance
  • When Resilver Runs
  • Object Placement
  • Misplaced Objects
  • Mirrored Objects
  • Chunked Objects
  • Monitoring and Progress
  • CLI Usage
  • Resilver vs Scrubbing
  • References
Cluster Operations

Resilver

||View as Markdown|
Previous

Global rebalance

Next

AIS in Containerized Environments

Resilver

First, etymology:

“Resilvering” originally referred to restoring the reflective silver backing of a glass mirror. In modern storage systems the word “resilver” is commonly used to mean “rebuild data redundancy / restore intended layout after a device or topology change” (you’ll see it prominently in ZFS/OpenZFS documentation and tooling).

In AIStore, resilvering (or simply “resilver”) is the mechanism for redistributing objects to their correct locations after volume changes within a storage target.

When mountpaths are attached, detached, enabled, or disabled, objects may no longer reside at their proper HRW locations. Resilver walks all objects and relocates them as needed to restore data placement and redundancy under the current configuration.

HRW: a variant of consistent hash based on rendezvous (highest‑random‑weight) algorithm by Thaler and Ravishankar.

Resilver vs. Global Rebalance

AIStore uses the same conceptual model:

At the cluster level, global rebalance distributes objects across targets using HRW.

At the (local) node level, resilver distributes objects across mountpaths using the same HRW algorithm.

This symmetry is intentional. The same reasoning applies at both levels:

  • HRW provides deterministic placement.
  • Only objects whose placement changes need to move.
  • Work can be parallelized.
  • Operations are preemptible.
  • Misplaced objects remain accessible.

The difference is scope and mechanics: rebalance moves data over the network across targets, while resilver - locally across mountpaths on the same machine.

This document is structured as follows:

  • When Resilver Runs
  • Object Placement
  • Misplaced Objects
  • Mirrored Objects
  • Chunked Objects
  • Monitoring and Progress
  • CLI Usage
  • Resilver vs Scrubbing
  • References

Resilver is AIStore’s mechanism for restoring correct object placement and redundancy on a given (or any given) AIS target. The system guarantees that objects end up:

  • on their correct mountpaths, and
  • with the configured number of replicas (when mirroring is enabled), or
  • chunks (when objects are chunked).

Resilver walks the target’s data and performs the minimum work needed to get back to a consistent state:

  • Main replica at the correct HRW mountpath for the current volume topology.
  • Mirroring restored to the configured replica count (as much as possible given the number of available mountpaths).
  • Chunked objects validated and repaired so that all chunks are where the system would look for them under the current topology.

Resilver is a local process - it never moves data between targets and never requires cluster-wide coordination.


When Resilver Runs

Resilver runs on demand (via ais storage resilver) and automatically in response to mountpath lifecycle events:

  • attaching a new mountpath,
  • detaching a mountpath,
  • enabling a previously disabled mountpath,
  • disabling a mountpath temporarily.
1$ ais show job resilver
2
3resilver[MgyTbYxm7]
4NODE ID KIND OBJECTS BYTES START END STATE
5VNgt8085 MgyTbYxm7 resilver 268 273.15MiB 20:15:34 - Running
6------------------------------------------------------------------------
7resilver[GgZZGERh7]
8NODE ID KIND OBJECTS BYTES START END STATE
9VNgt8085 GgZZGERh7 resilver 40 246.49MiB 20:15:35 - Running
10------------------------------------------------------------------------
11resilver[GgZZGERh7]
12NODE ID KIND OBJECTS BYTES START END STATE
13VNgt8085 GgZZGERh7 resilver 461 1.97GiB 20:15:35 - Running

These events change the set of available mountpaths and therefore change HRW placement decisions. Resilver starts immediately to reconcile existing data with the new volume topology.

Operationally, disable/detach reduce the set of available mountpaths. Resilver’s job is to restore the target’s intended placement and redundancy using only the currently-available mountpaths.

Resilver is also preemptible. If a second mountpath event occurs while a resilver is running, the current run is aborted and a new one starts using the updated configuration. This ensures that work is never completed based on stale assumptions.

If a resilver is interrupted — by another mountpath event, a restart, or an abort — AIStore resumes the work later. Resilver is convergent by design: as long as mountpath configuration eventually stabilizes, object placement will converge to the correct state.

You can also trigger resilver manually using the CLI, for example after recovering from disk failures or interrupted maintenance.


Object Placement

Resilver relies on HRW (Highest Random Weight) to determine where objects belong.

For a given object name and a given set of available mountpaths, HRW deterministically selects a single mountpath. Every component in the system computes the same answer independently; no shared state or coordination is required.

When mountpaths are added or removed, HRW placement changes only for a subset of objects. Resilver identifies those objects and relocates them. Objects whose HRW placement does not change are left untouched.


Misplaced Objects

An object is considered misplaced if it does not reside on its HRW mountpath under the current configuration.

Misplacement is expected and benign. It occurs naturally when mountpaths change, or when resilvering is interrupted. Reads continue to work: AIStore can locate objects regardless of where they physically reside.

Resilver is the mechanism that restores optimal placement and eliminates long-term imbalance.


Mirrored Objects

Buckets may be configured with N-way mirroring. In that case, each object consists of:

  • one main replica, placed at the object’s HRW mountpath, and
  • additional copies, placed on other mountpaths selected using the same HRW principles.

When a mountpath is disabled or detached, copies on that mountpath become unavailable. Resilver removes stale metadata entries for those copies and creates replacements on other mountpaths if possible.

If there are fewer available mountpaths than required copies, resilvering creates as many copies as it can and leaves the system in a degraded but consistent state. When more mountpaths become available later, resilvering completes the replication.


Chunked Objects

Large objects may be stored as multiple chunks. All chunks are equal in structure and size. None of them is special.

What distinguishes chunk #1 is only its location: it is stored where a non-chunked object of the same name would be stored — at the object’s HRW mountpath. This allows existing lookup logic to locate the object efficiently.

The object itself is defined by its chunk manifest - the metadata that describes all chunks and their placement. Each chunk is placed independently using HRW derived from the object name and chunk index.

During resilvering, chunks are verified independently. A chunked object is considered correct only if all chunks and the manifest are at their correct locations under the current mountpath configuration. If any part is misplaced, the object is repaired as a unit.

See also: Blob Downloader


Monitoring and Progress

Resilver runs as a batch job (or xaction). Progress is visible through standard job monitoring (ais show job) command, e.g.:

1ais show job resilver

The primary progress metric counts main replicas restored to their HRW locations. This reflects actual repair work performed, not just objects visited.

Objects skipped due to locking contention are tracked separately. Skipped objects are not lost; they are handled in subsequent resilver runs if needed.


CLI Usage

Resilver is usually triggered implicitly by mountpath operations, but it can also be started manually:

1# Resilver all targets
2ais storage resilver
3
4# Resilver a specific target
5ais start resilver t[XYZ]
6
7# Wait for completion
8ais start resilver t[XYZ] --wait --timeout 30m

Mountpath lifecycle commands are the most common trigger:

1ais storage mountpath disable t[XYZ]=/mnt/disk1
2ais storage mountpath enable t[XYZ]=/mnt/disk1
3ais storage mountpath detach t[XYZ]=/mnt/disk1
4ais storage mountpath attach t[XYZ]=/mnt/newdisk

Additionally, there’s (an advanced-usage capability) to manipulate mountpaths without triggering resilver - useful when batching multiple changes:

# Disable without resilvering
ais storage mountpath disable t[XYZ]=/mnt/disk1 --no-resilver

Resilver vs Scrubbing

In ZFS terminology, scrubbing means deep validation: reading every block, verifying checksums, detecting silent corruption, and repairing bit rot. It’s a thorough health check of data integrity at the physical level.

AIStore’s resilver is intentionally much narrower in scope - it focuses exclusively on data placement and redundancy under the current mountpath volume, and it does not verify checksums or read object contents.

This makes resilver fast and topology-focused. It runs after mountpath changes (attach/detach/enable/disable) to restore correct layout, not to detect corruption.

AIS provides APIs and configuration to validate checksums during normal I/O operations (reads, writes, copies). Full end-to-end validation - the equivalent of ZFS scrub - would combine resilver’s placement checks with explicit checksum verification of all objects. Such functionality could be added in the future.

For now, AIStore separates concerns clearly:

  • Resilver handles the positive task: restoring correct placement and redundancy.
  • Space cleanup handles the negative task: removing obsolete, orphaned, or no-longer-reachable data:
1$ ais space-cleanup --help
2NAME:
3 ais space-cleanup - (alias for "storage cleanup") Remove:
4 - deleted objects and buckets;
5 - old/obsolete workfiles;
6 - misplaced objects (see command line option below);
7 - orphan chunks and partial chunk manifests;
8 - optionally, remove zero-size objects as well.
9
10 By default, any stored content with invalid or unrecognized FQN is treated as obsolete and is removed.
11 To preserve, use the cluster feature flag 'Keep-Unknown-FQN'.
12
13USAGE:
14 ais space-cleanup [BUCKET[/PREFIX]] [PROVIDER] [command options]
15
16OPTIONS:
17 force,f Proceed with removing misplaced objects even if global rebalance (or local resilver) is running or was interrupted,
18 or the node has recently restarted. Does not override the 'dont_cleanup_time' window or other flags
19 keep-misplaced Do not remove misplaced objects (default: remove after 'dont_cleanup_time' grace period)
20 Tip: use 'ais config cluster log.modules space' to enable logging for dry-run visibility
21 rm-zero-size Remove zero size objects (caution: advanced usage only)
22 timeout Maximum time to wait for a job to finish; if omitted: wait forever or until Ctrl-C;
23 valid time units: ns, us (or µs), ms, s (default), m, h
24 wait Wait for an asynchronous operation to finish (optionally, use '--timeout' to limit the waiting time)
25 help, h Show help

References

  • AIStore Overview
  • Buckets: Design and Operations
  • Observability
  • Technical Blog
  • Batch Jobs
  • CLI