Split-brain is Inevitable | NVIDIA AIStore

Feb 16, 2025·Alex Aizman

high-availabilitysplit-brainaistore

Split-brain is inevitable. The way it approaches varies greatly but there are telltale signs that, in hindsight, you wish you’d taken more seriously.

Next time, you certainly will.

But what? What exactly could’ve been done differently when keepalives started failing seemingly at random? When one node after another reports that the primary is unreachable, initiating re-election.

In the moment, dealing with it is quite unsettling. Looking at a jumble of intertwined alerts that keep coming in waves, where seemingly sporadic attempts to elect a new primary get voted out by fragmented majorities.

And that’s going on for a while. Feels like a long while but actually it is not. Because split-brain is arriving. In fact, it’s already here, and it will reshape the cluster.

split-brain is inevitable

When exactly will it happen? Well, the sequence entails apparently random node kills (triggered by failing Kubernetes probes), along with users departing and taking their respective workloads with them.

For a new-found stability to emerge, the users must step aside. Needless to say, users typically comply. Eventually, they all do.

And when the smoke clears we see a picture-frame containing two brand-new clusters.

Why only two, one may ask? Honestly, I don’t know. There’s no shame in admitting — I don’t understand why we never see the original cluster splinter into countless infinitesimally small pieces. After all the drama and confusion that unfolded before our eyes…

On the other hand, bifurcation is sort of good news if you look at it from the right perspective. Ultimately, what else could you do when your intra-cluster network starts failing left and right, randomly and unpredictably?

The best you can do is make the best of it — and sometimes that simply means that you go ahead and bifurcate. Plain and simple.

As an aside, it takes time and dedication to develop one’s inner ability to see good news and promise where there is, ostensibly, only despair and devastation. For example, and keeping with the topic at hand, consider the following written argument:

Yes, we have a split-brain situation. Yes, it is maybe-arguably-unfortunate. But (!) we remain available. After all, even when the nodes splinter, each island elected its own leader (its primary), gathered what remains of user content, and immediately resumed operation.

That’s HA by definition. In a roundabout way, that’s a saving grace.

The Double-Edged Sword

HA systems are built to keep running despite failures — whether due to network partitions, hardware malfunctions, software bugs, DDoS attacks, human errors, or a mixed bag of all of the above.

The wisdom, therefore, in striving for high availability is widely accepted, incontrovertible, and ultimately self-evident.

Yet, some of the most entrenched skeptics among us may still attempt to claim that there’s a dark downside, that those same mechanisms can make the system more fragile.

In fact, those same skeptics may further counter that the idea to keep systems operational may have consequences. Such as split-brain, for instance.

After all, the road that leads to you-know-where is firmly paved with good intentions. As was aptly noted almost a millennium ago: L’enfer est plein de bonnes volontés ou désirs.

The Prescription

Conventional split-brain handling wisdom typically entails a variety of configurable heuristics along with associated prevention or isolation (impact-limiting) logic.

The approach relies on implementing failure detection logic that establishes boundary conditions for what constitutes a “normal” vs “abnormal” behavior.

Beware: “failure detection” often rhymes with “false positives”

Now, for illustration purposes let’s make it really simple:

Losing two drives within an hour can be regarded unlikely but plausible
Losing two nodes during the same interval raises (or, must raise) immediate red flags, even when each of those nodes was able to quickly come up and rejoin

Clearly, the corresponding bit can be written into the system, along with proper knobs and their respective defaults (that I won’t enumerate due to severe space-and-scope limitations), as well as unit tests, integration tests, and user docs. Nine yards, as we always say in such circumstances. The whole nine yards.

Why only nine, by the way?

Anyway, speaking of circumstances — some of those may be quite involved.

Take, for instance, network latency or spikes thereof. Could be intermittent. Could also be misdiagnosed. Misattributed. Misinterpreted. Totally innocuous after all. And could also indicate an impending partition.

Anything’s possible. And if it is possible it will surely happen.

But still, suppose we - eventually — heuristicize our ways to detect the unexpected. Suppose, we nail it. Suppose, we already have. What do we do about it?

Well, again, the aforementioned conventional wisdom comes to the rescue, offering a menu of decisive militaristic actions:

Brickify — i.e., rapidly convert the entire cluster (and each AIS target in particular) into a read-only brick.
Fence off — the Linux-HA (circa 2001) idea, better known to the broader public as shooting in the head.
Arbitrate — namely, deploy AIS proxies outside the hosting Kubernetes cluster, granting them additional weight to decide primary elections and whether to initiate a global rebalance.

Those are some of the recommended actions. The only problem with all of them? Complexity.

Or, more precisely, complexity greatly amplified by limited time.

Evolve!

The old adage goes, “The complexity of a large system tends to grow exponentially.”

Take, for instance, AIStore. Over the past seven years [and counting], it has grown in every measurable dimension - adding modules, components, layers, extensions, APIs, backends; you name it. AIStore is a microcosm of how complexity emerges in a gradually evolving system.

And having emerged, it requires respect. Demands it, in fact, when not making choices consistent with incremental and balanced evolution.

High Availability (HA) is at the core of it. HA isn’t a feature you can “slide in” using semi-automated code-writing mode that comes with long practice.

No, HA is never incremental.

And so we face a binary choice:

Option 1: Support conventional, time-honored, heuristics-driven split-brain detection and prevention.
Option 2: Accept the risk and focus on handling split-brain when it occurs — in the most efficient and user-friendly way possible.

The first option is highly appealing, even seductive. The promise of automated failover and five-nines (or N-nines, for your preferred integer N > 5) availability is compelling — it conforms to the well-established expectation. It even offers peace of mind, if only temporarily.

But again (the inner skeptic wouldn’t shut up!) — each layer of distributed prevention logic creates new failure modes, and every heuristic carries its own risks of false positives. And ultimately, all of it adds up. ‘What if?’ — the eternal question. What if, by instrumenting yet another piece of elaborate logic, we inadvertently make the system fragile? What if??

What Have We Done

And so, in v3.26, we chose option 2: ex post facto cluster reunification.

When writing it, the first challenge was to convince nodes in one cluster to accept the nodes from another. The latter arrive with a different cluster UUID and different, sometimes greater, versions of cluster-level metadata that includes:

cluster map
global configuration, cluster-level defaults
buckets, and their respective configurations and properties

In normal circumstances, any version and/or UUID mismatch would trigger an immediate rejection.

But here we are — recovering from split-brain, merging clusters. Far from any normal.

Long story short, the actual implementation comprises about 10 steps with 3 rollbacks, contained primarily in a single source file — making it easy to revisit and maintain if needed.

Further, to reliably unify splintered clusters (while maintaining data consistency), we needed a way to split them in the first place.

And that was the second task: developing fast and reliable cluster-splitting tooling. The way it’s supposed to work (and does work) is easy to demonstrate.

Here’s a look at a fresh dev cluster, ready for testing:

1 $ ais show cluster
2 
3 PROXY            MEM USED(%)     MEM AVAIL    LOAD AVERAGE    UPTIME  STATUS
4 p[CkipqfqdD]     0.16%           23.96GiB     [0.4 0.5 0.5]   -       online
5 p[JbJpCIVwB]     0.15%           23.96GiB     [0.4 0.5 0.5]   -       online
6 p[KzjpEKshn]     0.15%           23.96GiB     [0.4 0.5 0.5]   -       online
7 p[YHnpZGMEh][P]  0.18%           23.96GiB     [0.4 0.5 0.5]   -       online
8 p[gOCproqcw]     0.16%           23.96GiB     [0.4 0.5 0.5]   -       online
9 p[xjTpNwRln]     0.16%           23.96GiB     [0.4 0.5 0.5]   -       online
10 
11 TARGET           MEM USED(%)     MEM AVAIL    CAP USED(%)     CAP AVAIL       LOAD AVERAGE    REBALANCE   UPTIME  STATUS
12 t[FRRtFguSh]     0.16%           23.96GiB     17%             361.151GiB      [0.4 0.5 0.5]   -           -       online
13 t[IzPtBtMQt]     0.17%           23.96GiB     17%             361.151GiB      [0.4 0.5 0.5]   -           -       online
14 t[JwbtJAsuH]     0.16%           23.96GiB     17%             361.151GiB      [0.4 0.5 0.5]   -           -       online
15 t[VhDtmHGqR]     0.17%           23.96GiB     17%             361.151GiB      [0.4 0.5 0.5]   -           -       online
16 t[lLItLeRio]     0.17%           23.96GiB     17%             361.151GiB      [0.4 0.5 0.5]   -           -       online
17 t[mAXtDAuhj]     0.17%           23.96GiB     17%             361.151GiB      [0.4 0.5 0.5]   -           -       online
18 
19 Summary:
20    Proxies:             6 (all electable)
21    Targets:             6
22    Cluster Map:         version 16, UUID DW0rviO0T, primary p[YHnp8080]
23    Software:            3.27.11a6e3419 (build: 2025-02-15T11:24:12-0500)
24    Deployment:          dev
25    Status:              12 online
26    Rebalance:           n/a
27    Version:             3.27.11a6e3419
28    Build:               2025-02-15T11:24:12-0500

And here’s the same set of nodes cleanly divided into two separate islands (after cluster-splitting tool gets applied):

1 PROXY            MEM USED(%)     MEM AVAIL    LOAD AVERAGE    UPTIME  STATUS
2 p[KzjpbMDzp]     0.15%           23.76GiB     [0.5 0.5 0.5]   1m40s   online
3 p[YHnpanVFj]     0.19%           23.76GiB     [0.5 0.5 0.5]   1m40s   online
4 p[gOCpOerlc][P]  0.16%           23.76GiB     [0.5 0.5 0.5]   1m40s   online
5 
6 TARGET           MEM USED(%)     MEM AVAIL    CAP USED(%)     CAP AVAIL       LOAD AVERAGE    REBALANCE   UPTIME  STATUS
7 t[FRRtPOCIo]     0.17%           23.76GiB     17%             361.150GiB      [0.5 0.5 0.5]   -           1m40s   online
8 t[VhDtgtYKv]     0.17%           23.76GiB     17%             361.150GiB      [0.5 0.5 0.5]   -           1m40s   online
9 t[lLItiislP]     0.17%           23.76GiB     17%             361.150GiB      [0.5 0.5 0.5]   -           1m40s   online
10 
11 Summary:
12    Proxies:             3 (all electable)
13    Targets:             3
14 ...

and

1 PROXY            MEM USED(%)     MEM AVAIL    LOAD AVERAGE    UPTIME  STATUS
2 p[CkipCKjBV]     0.16%           23.91GiB     [0.4 0.5 0.5]   4m10s   online
3 p[JbJpGtObG][P]  0.17%           23.91GiB     [0.4 0.5 0.5]   4m10s   online
4 p[xjTpiyTkM]     0.16%           23.91GiB     [0.4 0.5 0.5]   4m10s   online
5 
6 TARGET           MEM USED(%)     MEM AVAIL    CAP USED(%)     CAP AVAIL       LOAD AVERAGE    REBALANCE   UPTIME  STATUS
7 t[IzPtcbLpi]     0.17%           23.91GiB     17%             361.149GiB      [0.4 0.5 0.5]   -           4m10s   online
8 t[JwbtULxfI]     0.17%           23.91GiB     17%             361.149GiB      [0.4 0.5 0.5]   -           4m10s   online
9 t[mAXtzxIcX]     0.17%           23.91GiB     17%             361.149GiB      [0.4 0.5 0.5]   -           4m10s   online
10 
11 Summary:
12    Proxies:             3 (all electable)
13    Targets:             3
14 ...

Finally, what about usage? Well, usage is easy:

Run the ais cluster set-primary --force command (but only after carefully reviewing the user manual and accepting all responsibilities).
Observe as the splintered clusters merge back into one.
Optionally, consider ais start rebalance and/or ais prefetch --latest - the latter assuming remote backend(s).

Postscriptum

As for the root cause behind the avalanche of heartbeats and, eventually, split-brain? Well, yes, we experienced that for a while in one of our production setups. It was a simple net.ipv4.tcp_mem misconfiguration.

References

[1] Bernard of Clairvaux, 12th century: “Hell is full of good wishes or desires.”