I’ve been trying to set up a multi server Vault cluster using Rancher and have run into an issue.
The way vault works when in a cluster, the standby nodes will redirect a request to the active master by responding with a 307 to a request with the advertised public address of the active node. This redirect will be followed only 1 time.
The issue is that if you use a service to define the servers and set the scale to 3 then you have 3 identical servers. If you put a load balancer in front of them, then the request is round-robin between them. This means in some cases where you hit a standby node, followed by another standby node, then the Vault client will error because of too many redirects.
To get around this, I tried setting a health check to only mark the active node as healthy and set the strategy to take no action. This works fine, except in the case when launching the services because they never become healthy since they are standby nodes. Thus Rancher keeps destroying and recreating the instance (apparently the strategy only applies for a healthy -> unhealthy transition). Furthermore this causes issues when trying to upgrade the service as the upgrade never completes since the standby nodes are never marked as healthy.
It appears that what I need is a way to mark a service as healthy/unhealthy for load balancing but not for health of the instance itself. Right now my only recourse is to run a single server cluster and look into alternate load balancers that contain their own health checks for the backend, however this would require integration with Rancher to populate the configuration.
Any thoughts on this?