Catalog deployment fails

My catalog deployment failing with error " Failed to find existing service: service10 "

Following are the error from rancher logs(v1.6.5). I am running the rancher in cluster mode with 2 servers, I stopped the one of the rancher container right now and running with just one server. It happens with 2 servers in cluster mode too, so its not the issue with number of servers. What is the minimum servers need to be run in fully active-active HA? Also I have around 20 services and 1000 lines in rancher-compose.yml

level=error msg=“Failed Creating service10 : Get http://localhost:8080/v2-beta/projects/1a86937/services?name=service10&r
emoved_null=%3Cnil%3E&stackId=1st586: net/http: request canceled (Client.Timeout exceeded while awaiting headers)”
2017/11/16 17:19:03 http: proxy error: net/http: request canceled
level=error msg=“Failed to start: batch : Get http://localhost:8080/v2-beta/projects/1a86937/services?name=service5&remo
ved_null=%3Cnil%3E&stackId=1st586: net/http: request canceled (Client.Timeout exceeded while awaiting headers)”
time=“2017-11-16T17:19:04Z” level=error msg=“Stack Create Event Failed: Failed to find existing service: service10” eventId=c9aca8b3-aabf-4b06-a5e3
-be8d5a578322 resourceId=1st586
level=info msg="[base:]: Project created " eventId=c9aca8b3-aabf-4b06-a5e3-be8d5a578322 resourceId=1st586
ERROR [fcea916d-435c-49b1-9649-041689de94fa:7718312] [stack:586] [stack.create] [utorService-925] [c.p.e.p.i.DefaultPr
ocessInstanceImpl] Agent error for [stack.create.reply;handler=rancher-compose-executor]: Failed to find existing service: service10
ERROR [:] [utorService-925] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [stack.create.reply;handler=r
ancher-compose-executor]: Failed to find existing service: service10

Your error trace provides the clues you need. As for HA, typically these are odd number from 3. So 3 HA servers allows failure of 1 node without loss of service, a 5 node cluster allows failure of 2, and so on. In most cases 5 would be plenty and 3 more typical. Also beyond that number there is a overhead in managing the consistency of the cluster. Of course you should also take account of your distribution of nodes within your implementation platform. For example for AWS, you need to consider the AZ’s and Region deployment.

I’m sure you are aware but, a loss in your HA nodes does not impact the business apps running in your environments, only your ability to change or add to them (which is very important in our CD world but may not be as catastrophic an outage as it can sometimes be perceived).

Thank you Fraser_Goffin, it make sense, I will sure try the three servers. But as I remember, rancher even support a single server as HA in old versions incase if we dont want a failover and nowwhere in document says about the odd number(minimum 3) concept. In one of rancher meetups, I seems to hear an answer supporting two servers. As per the error, I also have a suspicion over the websocket service timeout due to handling bigger input from docker- or rancher-compose.yml files.

These are the errors I get nowadays, strangely sometimes it works too.

http: proxy error: net/http: request canceled
net/http: request canceled (Client.Timeout exceeded while reading body)
net/http: request canceled (Client.Timeout exceeded while awaiting headers)


Of course you can run Rancher server on single node, but you can’t really call that HA since if you lose it your cluster is ‘toast’. The HA part is really about how many nodes you can lose before the cluster itself might not behave as expected. This relates in many ways to the idea of a ‘quorum’ which in this context, basically means a majority vote. This is the reason why odd numbers are preferred so, in a 3 node cluster there would always be a majority vote of 2 members to the remaining 1. For a 5 node cluster, a majority of 3. Without getting too deep into the theory, successful operation of the cluster relies on this mechanism to allocate tasks and ensure that the cluster doesn’t suffer from ‘split brain’ (ie. an indecisive vote). Anyway, that’s the basis of HA and Rancher supports it as a common pattern. Can you run Rancher on a single node or 2 or 4, sure you can.

I am really not worried about the fail over, but the issue I am getting is while deploying catalogs. I dont mind rancher itself goes down if I am running with one rancher container. If this is the case of majority vote, rancher shouldnt come up and no catalog deployment will be possible right?

Its not an issue with HA, even a standalone version with remote mysql is failing. Standalone rancher with internal mysql is working perfectly fine. All recommended parameters are set of mysql, still failing for some reason