For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Blog
DocsAPI Reference
DocsAPI Reference
    • AIStore
    • Documentation
  • Core Documentation
    • In-depth Overview
    • Terminology and core abstractions
    • Getting Started
    • Networking model
    • Buckets: design, operations, namespaces, and system buckets
    • Observability overview
    • CLI overview
    • Production deployment
    • Technical Blog
  • APIs, SDKs, and Compatibility
    • Go API
    • Python SDK
    • PyPI package
    • Python SDK reference guide
    • PyTorch integration
    • TensorFlow integration
    • HTTP API reference
    • curl examples
    • Easy URL
    • S3 compatibility
    • s3cmd quick start
    • Presigned S3 requests
    • Boto3 support
  • Command-Line Interface
    • CLI overview
    • ais help
    • CLI reference guide
    • Bucket operations
    • Cluster and remote-cluster management
    • Storage and mountpath management
    • Monitoring and ais show
    • Downloads
    • Jobs
    • Authentication and access control
    • Configuration via CLI
    • ETL CLI
    • Distributed shuffle CLI
    • ML / get-batch CLI
    • GCP credentials
    • TLS certificate management
  • Storage and Data Management
    • Storage services
    • Buckets: design, operations, namespaces, and system buckets
    • Native Bucket Inventory (NBI)
    • Backend providers
    • On-disk layout
    • Virtual directories
    • System files
    • Evicting remote buckets and cached data
  • Cluster Operations
    • Node lifecycle: maintenance, shutdown, decommission
    • Global rebalance
    • Resilver
    • AIS in Containerized Environments
    • Highly available control plane
    • Information Center (IC)
    • Out-of-band updates
    • Troubleshooting
  • Configuration and Security
    • Configuration
    • Environment variables
    • Feature flags
    • AuthN and access control
    • Authentication validation
    • HTTPS and certificates
    • Switching a cluster to HTTPS
  • ETL and Advanced Workflows
    • ETL overview
    • ETL CLI docs
    • ETL Python SDK examples
    • Custom transformers
    • ETL Python webserver SDK
    • ETL Go webserver package
    • Archives: read, write, and list
    • Distributed shuffle (dsort)
    • Initial sharding utility (ishard)
    • Downloader
    • Blob Downloader
    • Batch object retrieval (get-batch)
    • Batch operations
    • Tools and utilities
    • Extended actions (xactions)
  • Observability, Monitoring, and Performance
    • Observability overview
    • Monitoring with CLI
    • Logs
    • Prometheus integration
    • Metrics reference
    • Grafana dashboards
    • Kubernetes monitoring
    • Distributed tracing
    • Monitoring get-batch
    • AIS load generator (aisloader)
    • Benchmarking AIStore
    • Performance tuning and testing
    • Performance monitoring via CLI
    • Rate limiting
    • Checksumming
    • Filesystem Health Checker (FSHC)
    • Traffic patterns
  • Networking
    • Networking: multi-homing, network separation, IPv6
    • HTTPS configuration
    • Switching to HTTPS
    • Idle connections
    • MessagePack protocol
  • Deployment
    • AIStore on Kubernetes
    • Kubernetes Operator
    • Ansible playbooks
    • Helm charts
    • Deployment monitoring
    • Docker
  • Developer Resources
    • Development guide
    • aisnode command line
    • Build tags
  • Object and Bucket Naming
    • Unicode and special symbols in object and bucket names
    • Extremely long object names
Blog
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoAIStore
On this page
  • First Checks: Cluster State via CLI
  • Error Categories
  • Cluster Integrity Errors (CIE)
  • Common Causes
  • CIE Error Reference
  • Recovery (CIE)
  • Storage Integrity Errors (SIE)
  • Key Concepts
  • SIE Error Reference
  • Target Fails to Start: Lost or Mismatched Mountpath (SIE#50)
  • Symptoms
  • Recovery: Offline VMD Edit (Recommended)
  • Workflow
  • Notes
  • References
Cluster Operations

Troubleshooting AIStore

||View as Markdown|
Previous

Out-of-band updates

Next

Configuration

This document describes common AIStore (AIS) failure modes and actionable recovery steps. It is intended for operators and developers troubleshooting clusters that fail to start, fail to join, or exhibit integrity errors.

In most cases, the AIS CLI is the first and best tool to use.

Table of Contents

  • First Checks: Cluster State via CLI
  • Error Categories
  • Cluster Integrity Errors (CIE)
    • Common Causes
    • CIE Error Reference
    • Recovery (CIE)
  • Storage Integrity Errors (SIE)
    • Key Concepts
  • SIE Error Reference
  • Target Fails to Start: Lost or Mismatched Mountpath (SIE#50)
    • Symptoms
  • Recovery: Offline VMD Edit (Recommended)
    • Workflow
    • Notes
  • References

Note: Some example paths in this document may reflect local dev deployments. In production, cluster-wide metadata is stored in the node’s config directory, while BMD and VMD - bucket and volume metadata, respectively - live at the root of each mountpath. See xmeta (tool) README for more details and examples.


First Checks: Cluster State via CLI

AIS provides extensive CLI tab-completion and discovery.

Start with:

1$ ais show cluster

or explore available subcommands interactively:

1$ ais show cluster <TAB><TAB>

Example output:

1PROXY MEM USED % MEM AVAIL UPTIME
2202446p8082 0.06% 31.28GiB 19m
3279128p8080 0.07% 31.28GiB 19m
4928059p8081[P] 0.08% 31.28GiB 19m
5
6TARGET MEM USED % MEM AVAIL CAP USED % CAP AVAIL CPU USED % REBALANCE UPTIME
7147665t8084 0.07% 31.28GiB 14% 2.511TiB 0.00% - 19m
8...

At any time there is exactly one primary proxy. If needed, you can change it administratively:

1$ ais cluster set-primary <TAB><TAB>
2$ ais cluster set-primary p[279128p8080]

Error Categories

AIS integrity errors fall into two distinct categories:

  1. Cluster Integrity Errors (cie#) Inconsistent or conflicting cluster-wide metadata (Smap, BMD, etc.)

  2. Storage Integrity Errors (sie#) Inconsistent, missing, or invalid mountpath metadata on a target

Understanding which category you are dealing with is critical: CIE errors are cluster-scoped; SIE errors are target-scoped.


Cluster Integrity Errors (CIE)

Cluster Integrity Errors are raised when a node attempts to join or operate in a cluster with incompatible cluster-wide metadata.

Example:

cluster integrity error `cie#50`:
Smaps have different origins: Smap v9[...] vs p[232268p8080]: Smap v13[...]

These errors usually indicate that a node:

  • belonged to a different AIS cluster in the past, or
  • has stale local metadata that conflicts with the cluster majority.

Common Causes

  • Reusing disks or nodes from a previous cluster
  • Mixing nodes from different deployments
  • Partial cleanup after redeployments

CIE Error Reference

ErrorWhenMeaning
cie#10Primary startupPrimary’s local Smap conflicts with other nodes
cie#30StartupTargets disagree on cluster UUID
cie#40Startup or BMD updateLocal BMD conflicts with cluster
cie#50Join / metasyncNode is not permitted to join cluster
cie#60Primary startupConflicting incompatible BMD versions
cie#70Primary startupConflicting BMDs with simple majority
cie#80Node joinNode believes it belongs to a different cluster
cie#90MetasyncSplit-brain detected during metadata sync

Recovery (CIE)

Recovery often involves carefully cleaning obsolete metadata:

  • local Smap copies
  • local BMD copies

This must be done with extreme caution. Removing the wrong metadata can permanently orphan data.

CIE recovery is intentionally conservative and usually requires manual inspection and understanding of cluster history.


Storage Integrity Errors (SIE)

Storage Integrity Errors relate to mountpaths attached to a storage target. Each target maintains Volume Metadata (VMD) describing its mountpaths, their filesystems, and the target’s persistent Node ID.

Example:

storage integrity error sie#50:
lost or missing mountpath "/ais/nvme7n1"

Key Concepts

  • VMD is persisted and replicated across all mountpaths of a target

  • Each mountpath records:

    • filesystem identity
    • filesystem type
    • target Node ID
  • VMD validation happens at target startup, before runtime checks (FSHC)


SIE Error Reference

ErrorWhenMeaning
sie#10StartupMountpaths record different Node IDs
sie#20StartupTarget Node ID conflicts with mountpath metadata
sie#30StartupMountpaths disagree on persisted metadata
sie#40StartupCorrupted metadata on a mountpath
sie#50StartupMountpath mismatch between config and VMD

Target Fails to Start: Lost or Mismatched Mountpath (SIE#50)

Symptoms

Target fails during startup with an error similar to:

storage integrity error sie#50:
lost or missing mountpath "<path>"

This commonly occurs after:

  • disk failure or replacement
  • filesystem remounted on the wrong device
  • OS block-device re-enumeration
  • upgrade or restart while a disk is unavailable

In this state:

  • the target cannot reach runtime
  • filesystem health checks (FSHC) cannot run
  • the target exits fatally to prevent data corruption

Recovery: Offline VMD Edit (Recommended)

This recovery method is safe, explicit, and reversible.

Workflow

  1. If applicable: put the AIS target in maintenance mode (or shutdown entire cluster)
  2. Identify a failed mountpath (e.g., /ais/nvme7n1)
  3. Possibly, SSH into the target; use xmeta tool to disable the failed mountpath in a given selected VMD replica:
1xmeta -x -in=/ais/nvme0n1/.ais.vmd -disable /ais/nvme7n1
  1. Copy the updated VMD to all remaining mountpaths, e.g.:
1for mp in /ais/nvme{1,2,3,4,5,6,8,9,10,11}n1; do
2 cp /ais/nvme0n1/.ais.vmd $mp/.ais.vmd
3done
  1. Restart the target (/ cluster)

  2. Verify:

1ais storage mountpath show ### all storage nodes, all mountpaths

or:

1ais storage mountpath TARGET

The target will now restart in a degraded but safe state, with /ais/nvme7n1 disabled.

Notes

  • Backup First

Before troubleshooting that involves inspecting or modifying any on-disk metadata:

  • Archive all AIS metadata periodically (and especially before any manual edits).
  • AIS keeps multiple copies of critical metadata, but redundancy is not a substitute for backups.
  • When in doubt: copy first, edit later.

Speaking of VMD, at minimum back up each mountpath’s metadata and keep the archive somewhere outside /this/ node.

Additionally:

  • Keeping all VMD copies in sync is strongly recommended.
  • Disabled mountpaths are not probed or used.
  • The mountpath can be re-enabled later after disk replacement.
  • Prefer explicit, minimal changes over broad metadata deletion.
  • Avoid ignore-missing-mountpath unless you fully understand the implications.
  • xmeta is a power tool: indispensable for recovery, dangerous if misused.

If unsure, stop to inspect existing metadata before proceeding, and maybe back it up as well.

References

  • AIStore: Terminology
  • On-disk layout
  • xmeta - utility to inspect, extract, format, and (in limited cases) edit internal AIS metadata structures
  • AIS buckets: on-disk layout