Stand up cluster on Rancher v1.1, containers can't connect

Supposedly it has stable HA with 3 hosts, and all containers can be seen.

But there are 2 issues, which could likely be interelated:

  1. The go-machine-service and rancher-compose-executor have 1 container out of the two defaults that won’t start and stay up; the perpetually loop while other identical containers as part of the scale settings are running. Try to fix it, and everything breaks.
  2. Any given container in System HA cannot be connected to via shell or log options, and it won’t ever connect and display metrics.

Any idea what would cause this?

The stats things sounds like a bug fixed in later 1.1.x versions. What’s in the logs for RCE and GMS?

There is limited support we and the community are going to be able to offer here… You’re using a 7 month old version, not even the newest patch of it (1.1.4), and HA works in a completely different way that is significantly simpler in 1.2+.

Indeed, we understand support is limited. This cluster (which I will call the Upgrade Cluster) is being stood up to match our current cluster (Live Cluster), because we want to test the upgrade process before doing it where it will affect far more important services orchestrated by Rancher.

The Live Cluster is in a similar config but working, in contract to the Upgrade Cluster.

We recently identified the load balancer had a strong issue with the cert it had applied, we are correcting the issue now and retesting.