First, thanks for the awesome product. I started using it a couple of days ago and it’s very awesome.
Now, on to the actual issue: I have deployed the MariaDB Galera cluster (from the catalog) and I was trying some simple stuff: shutting down random servers to see if the cluster is still functioning. Now, it seems that my second server (out of the three) is master, as in when I turn the second server down, MariaDB stops working (I’m using the default load balancer to connect to galera).
When I turn the other two servers (one and three) down, MariaDB still works. I know it’s still in experimantal, but am I missing something really obvious here?
I’m not exactly clear on the order of testing here, but I’ll outline what we believe the current behavior is.
It should survive any single server failing in a 3 node cluster. I believe if 2/3 are out of commission it stops taking writes. If the master fails, Galera should handle the failure, a new leader is selected. The leader through the proxy is determined essentially be create order. (More specifically create_index in metadata) There is probably a lag in that recovery because Rancher would need to know the previous server is no longer available.
Autorecovering from failure is where the experimental really comes from, because in most cases its lacking ‘auto’. If all of them go down, you need to manually re-initialize. If a node drops out, it comes back in Galera’s eyes as a new node, and I believe needs to be added again.
Its been a while since we have visited this template, but it needs some love.