Some problems with network

chelius · February 1, 2018, 9:26am

Hello! There is a problem that I can not solve and I do not understand why it arises.
I have test environment on Hyper-V, 4 hosts on RancherOS and Rancher v.1.6.14 (Cattle orchestration) with clear installation. Before RancherOS i’m truing with Ubuntu and CoreOS, problem is the same in all systems.
My steps: In node with name ROS-MASTER i’m runned Rancher server with internal MySQL DB. After that i’m running rancer agent on ROS-MASTER and on 3 nodes ROS-01/02/03. I see how all infrastructure services are deployed and have a “green status”, network and healthchecs are work. Next, for example, I leave everything for the night and see the next picture in the morning

As I understand it for myself. Services are in “initializing” state because they can’t be healchecked → Healthcheck containers can’t see other nodes because there are problems with IPsec…

ipsec-ipsec-router-2 logs 01.02.2018 11:17:3408[JOB] CHILD_SA ESP/0x00000000/10.0.20.70 not found for dele - Pastebin.com
healthcheck-healthcheck-3 logs 01.02.2018 11:22:17time="2018-02-01T09:22:17Z" level=info msg="Starting haproxy - Pastebin.com

I will be glad to any help! Thx!

superseb · February 1, 2018, 5:55pm

What is the status of the hosts (Infrastructure -> Hosts)? From the logging it looks like there is a network interruption between the hosts, but ipsec should be able to recover from this.

chelius · February 2, 2018, 8:36am

Usualy status of Hosts is Active, this morning they were Disconected. I checked the connection between the hosts, it is present, but with high latency (20-400 ms). After rebooting the host, the latency becomes normal. What would it be, the performance of the host or Hyper-V?

UPD:
I looked at the load average on the host, with 4 virtual cores and 2 GB of RAM, the load average varies from 20 to 30. I will investigate what causes such a load. Host with such parameters should be enough or need to increase resources?

nexcode · February 5, 2018, 9:26am

We decided to use it on production. Approximately every 12 hours falls ipsec.
Now we are very sorry that we spent time on this not a stable solution.

nexcode · February 5, 2018, 9:30am

Now we just reboot the server about every 12 hours. I believe that you need to get rid of this product. It breaks, there is no good support. Bad choice. : (

nexcode · February 5, 2018, 9:35am

We use bare metal. Network between servers is always working perfectly.

nexcode · February 5, 2018, 10:14am

I think the problem is the following:
When it breaks, agent on first host changes ip to 172.17.0.1 (or some such, I don’t remember exactly)
After reboot it restore to normal ip addr. On other hosts this does not happen.

At the moment I don’t know why the agent to change the ip and how can I prevent him to do it.

nexcode · February 5, 2018, 10:38am

On your server agents are changing the IPs?

nexcode · February 5, 2018, 10:08pm

In this section there is a dirty solution to this problem:

github.com/rancher/rancher

"Timeout getting IP address" on container with network "managed"

opened 07:49PM - 14 Feb 17 UTC

closed 07:16PM - 30 May 17 UTC

Kilhog

status/more-info

**Rancher Versions:** Server: 1.4.1 (same with 1.3.x) healthcheck: 0.2.3 ipse…c: 0.0.4 scheduler: 0.4.0 **Docker Version:** 1.12.6 & 1.13.1 **OS and where are the hosts located? (cloud, bare metal, etc):** Fresh install of Debian 8 bare metal (OVH) **Setup Details: (single node rancher vs. HA rancher, internal DB vs. external DB)** single node rancher **Environment Type: (Cattle/Kubernetes/Swarm/Mesos)** Cattle **Steps to Reproduce:** * Clear install of Debian 8 on dedicated-server (Debian 8.7 stable (Jessie) Server HOST-64-H - 64G Xeon D-1540) * Install docker-engine 1.12.6 or 1.13.1 * `docker run -d --restart=unless-stopped -p 8080:8080 rancher/server` * `docker run -d --privileged -v /var/run/docker.sock:/var/run/docker.sock -v /var/lib/rancher:/var/lib/rancher rancher/agent:v1.2.0 http://***:8080/v1/scripts/*******` **Results:** "Timeout getting IP address" on stack "healthcheck", "ipsec" and "scheduler". If I create a standalone container "redis" with network "bridge" it's start and run perfectly, but if I create one with network "managed" I have a error "Timeout getting IP address" If you could give me any solution to get around this problem I will be very grateful 😀 (I have already tried `{ "dns": ["8.8.8.8", "8.8.4.4"], "dns-search": ["example.org"] }` in daemon.json, and change DNS in resolv.conf I also try to put the rancher host on another server)

The developers just advised to update…

nexcode · February 5, 2018, 10:09pm

I use rancher/server 1.6.14 and docker 17.06.2-ce (rancheros 1.1.3)
And what should I update?

chelius · February 6, 2018, 8:16am

In my case it turned out that the problem is in the containers. We migrate to ASP .Net Core, and the problem turned out to be in the applications that live in the containers. Applications consumed resources in a geometric progression, as a result of which the average load grew. We found a decision to turn off ServerGarbageCollector and the system has been working steadily for a couple of days. But we have not gotten to the production yet)

leodotcloud · April 28, 2018, 8:32pm

@nexcode Are you still having trouble with IPSec? Where are your hosts running? Cloud/Datacenter? Can you check the output of cat /proc/net/xfrm_stats inside the ipsec containers? Do you see the errors going up? What version of rancher/server are you running?

Topic		Replies	Views
Hybrid Cloud/Home network Rancher 1.x	12	2334	January 17, 2018
Mutually exclusive IPsec health Rancher 1.x	0	689	May 2, 2018
IPSec network fails silently on a host Rancher 1.x	24	7855	November 6, 2017
Rancher IPsec network does not work after router hardware issues Rancher 1.x	5	4668	November 6, 2017
New Rancher 1.2.1 cluster, does not come up Rancher 1.x	5	1004	December 21, 2016

Some problems with network

Related topics