Hello! There is a problem that I can not solve and I do not understand why it arises.
I have test environment on Hyper-V, 4 hosts on RancherOS and Rancher v.1.6.14 (Cattle orchestration) with clear installation. Before RancherOS i’m truing with Ubuntu and CoreOS, problem is the same in all systems.
My steps: In node with name ROS-MASTER i’m runned Rancher server with internal MySQL DB. After that i’m running rancer agent on ROS-MASTER and on 3 nodes ROS-01/02/03. I see how all infrastructure services are deployed and have a “green status”, network and healthchecs are work. Next, for example, I leave everything for the night and see the next picture in the morning
As I understand it for myself. Services are in “initializing” state because they can’t be healchecked → Healthcheck containers can’t see other nodes because there are problems with IPsec…
What is the status of the hosts (Infrastructure -> Hosts)? From the logging it looks like there is a network interruption between the hosts, but ipsec should be able to recover from this.
Usualy status of Hosts is Active, this morning they were Disconected. I checked the connection between the hosts, it is present, but with high latency (20-400 ms). After rebooting the host, the latency becomes normal. What would it be, the performance of the host or Hyper-V?
UPD:
I looked at the load average on the host, with 4 virtual cores and 2 GB of RAM, the load average varies from 20 to 30. I will investigate what causes such a load. Host with such parameters should be enough or need to increase resources?
Now we just reboot the server about every 12 hours. I believe that you need to get rid of this product. It breaks, there is no good support. Bad choice. : (
I think the problem is the following:
When it breaks, agent on first host changes ip to 172.17.0.1 (or some such, I don’t remember exactly)
After reboot it restore to normal ip addr. On other hosts this does not happen.
At the moment I don’t know why the agent to change the ip and how can I prevent him to do it.
In my case it turned out that the problem is in the containers. We migrate to ASP .Net Core, and the problem turned out to be in the applications that live in the containers. Applications consumed resources in a geometric progression, as a result of which the average load grew. We found a decision to turn off ServerGarbageCollector and the system has been working steadily for a couple of days. But we have not gotten to the production yet)
@nexcode Are you still having trouble with IPSec? Where are your hosts running? Cloud/Datacenter? Can you check the output of cat /proc/net/xfrm_stats inside the ipsec containers? Do you see the errors going up? What version of rancher/server are you running?