Filesystem Health Checker (FSHC)
Filesystem Health Checker (FSHC)
Filesystem Health Checker (FSHC)
Robust detection and isolation of faulty local storage
Table of Contents
The Filesystem Health Checker (FSHC) is AIStore’s subsystem responsible for detecting failures of local mountpaths (local filesystems used by targets to store objects). When the data path observes an I/O-class error that suggests potential filesystem damage or disappearance, FSHC runs a sequence of validation tests. If the tests confirm the issue, the corresponding mountpath is disabled and removed from all future operations until manually re-enabled.
FSHC is designed to:
FSHC does not attempt automatic recovery of disabled mountpaths. Once a mountpath is disabled, the operator (or orchestration) inspects, repairs, and re-enables it manually.
Correctly tuned, FSHC is one of AIS’s important safety mechanisms, ensuring that cluster nodes remain healthy and consistent even under hardware failures.
FSHC operates on local filesystems mounted on the target node. The mountpath must:
stat, open, read, write, fsync, rename, unlink),Filesystems backed by network storage (NFS, SMB, FUSE-based cloud mounts, etc.) are supported, but I/O errors may reflect remote or network issues rather than local disk failures. In such setups, consider relaxing FSHC thresholds or disabling FSHC to avoid false positives.
FSHC behavior is controlled via the fshc section of cluster configuration:
enabledBoolean. Enables or disables FSHC cluster-wide. Disabling is not recommended except in tightly controlled development setups.
test_filesDefault: 4. Number of files to sample per FSHC run: up to test_files files are read (random sampling), and up to test_files temporary files are written and fsync’ed.
Increasing this improves confidence at the cost of more I/O during checks.
error_limit (HardErrs)Default: 2. Maximum number of combined read/write I/O errors allowed during a single FSHC run. Exceeding this threshold disables the mountpath.
Note: Despite the generic name,
error_limitspecifically controls the FSHC run-time behavior (how many sampling errors are tolerated before disabling). It does not control what triggers FSHC in the first place - that’s governed byio_err_limitandio_err_timebelow.
io_err_limit + io_err_timeThese parameters control how often FSHC is triggered, not how it behaves once running.
io_err_limit: maximum number of soft I/O errors allowed during io_err_time.io_err_time: sliding time window (e.g., "10s").These limits prevent a slow trickle of errors from overwhelming a target.
FSHC is triggered primarily by I/O-class errors and mountpath/filesystem identity mismatches. In a few internal paths, it may also be invoked for other unexpected local errors that suggest a possible filesystem issue.
Detected via os.SyscallError wrappers and critical errno values. On Linux, these include:
Note: These errno values are Linux-specific. Other platforms may have different error codes.
For example: FSID changed unexpectedly, device disappeared or remounted differently.
Errors encountered during XAttr operations, local LOM operations, marker moves, rename operations, commit/flush failures.
From capacity checks, mpather directory walkers, internal consistency verifications.
If soft I/O errors exceed the configured io_err_limit within io_err_time, FSHC is invoked.
ErrBackendTimeout, ErrHTTP429, etc.)ErrMv)FSHC classifies failures into two categories:
FAULTED - Critical failures at the mountpath root level that indicate the filesystem is inaccessible or compromised. These immediately trigger disable without further testing:
stat() on mountpath root fails (after retry)DEGRADED - I/O errors during file sampling that exceed the configured threshold. The mountpath is still partially functional but unreliable:
error_limiterror_limiterror_limitAn internal [mpath => ror] structure tracks:
last: time when last check finished,running: non-zero if a test is currently running.FSHC ensures:
minTimeBetweenRuns) between runs by default.This prevents cascading failures from snowballing into constant disk hammering.
FSHC always retries once before declaring a mountpath FAULTED. It performs two passes (see next section) when checking a mountpath for I/O errors. In addition, it applies a single delayed retry to eliminate spurious errors from network-attached storage. This one-shot retry is used for:
- root
staton the mountpath,- opening the mountpath root,
- depth-0 directory operations in the sampler.
If stat() fails, FSHC waits one second and retries. If it fails again, the mountpath is immediately disabled as FAULTED.
Mountpath.CheckFS() compares the recorded FSID/device ID against current values.
Mismatches imply: device lost, device replaced, filesystem corrupted, or mount remapped. These are FAULTED events.
FSHC attempts to open the mountpath root as a file. Failure indicates the filesystem is inaccessible (FAULTED).
FSHC:
FSHC:
test_files count,FloodWriter to write real data,fsync,Write errors often show the earliest signs of underlying corruption.
FSHC runs exactly two passes:
test_files samplesmaxerrs slightlyIf readErrors + writeErrors >= error_limit after both passes, FSHC disables the mountpath as DEGRADED.
If errors exist but remain below the limit, FSHC logs a warning but leaves the mountpath enabled.
Upon disable:
DisableMpath,FlagDisabledByFSHC is set on the mountpath,DiskFault is raised,The disable reason (FAULTED or DEGRADED) is logged for operator reference.
Disable is immediate and permanent until operator intervention.
Auto-recovery is not implemented.
Re-enable is strictly manual:
This triggers:
FlagDisabledByFSHC,Because drive quality and workload environments differ, we define two recommended profiles.
Recommended defaults:
"test_files": 8 or 16"error_limit": 2"io_err_limit": 5"io_err_time": "10s"Rationale: these systems should not produce sporadic I/O errors; strict detection avoids silent data corruption; impact of FSHC overhead is negligible on NVMe.
Recommended:
"test_files": 3 or 4"error_limit": 2 (default)"io_err_limit": 20 or 30"io_err_time": "30s"Cheap consumer SSDs/HDDs often emit transient errors during heavy load. FSHC should stay enabled, but thresholds should be tolerant to reduce false positives.
FSHC is specifically designed to catch:
ENODEV, EUCLEAN),In all these cases, disabling is the safest course of action.
Check target logs around the time of disable. Look for FAULTED vs DEGRADED to understand severity.
Manually probe the device:
Unmount / remount if needed.
Once fixed:
AIS avoids false positives. Only an operator can safely confirm repairs.
FSHC does not disable mountpaths on OOS (out-of-space). OOS handling has its own logic.
Failures during directory walks trigger FSHC.
XAttr or rename failures trigger FSHC.
EC does not affect disable decisions. Disable prevents these subsystems from using unstable mountpaths.
FSHC functionality is accessed through the mountpath subcommands under ais storage.
Disables the mountpath but keeps it in the node’s volume configuration so that it can be re-enabled later.
After operator intervention (filesystem fix, remount, etc.).
Completely removes the path from the node’s volume. This is not an FSHC action, but operators often confuse the two:
Attach a new filesystem to a target node:
Runs the filesystem health checker on a specific mountpath, on demand:
Useful for: validating a mountpath before enabling it, testing a repaired filesystem, confirming a suspected failure.
Advanced operation (used for debugging or after replacing disks under the same mountpoint):
To see disabled or detached mountpaths:
With additional views:
ais show storage mountpath --helpais search mountpath (for discovering command variants)ais search --regex "mountpath|disk" (a superset of the above including “disk” matches)