Over 200 hours of troubleshooting but still unable to get etcd containers to come up

Greatings from England!

Being prevalent in providing technical support elsewhere, I am an avid fan of thrashing the issue out to the best of ones abilities first. Consequently, I have put in over 200 hours and easily over a 1000 research tabs to try to resolve the issue myself but feel it is now time to turn to the lovely people here.

Thank you massively in advance if you can spare a few moments to help me out - it would be truly appreciated! Oh, and feel free to let me know if this can be better submitted.

My 1st ever k8s setup is hopefully using 4x Oracle instances and one private VPS with 6gb ram, all of which are running CentOS 7. Ideally, I’d like to test with 3x control planes and 2x workers but the logs bellow are for just 3x control planes and 1x worker as I am temporarily using one of the Oracle instances to rke up.

I’ve tried dozens of things from ensuring the servers times align, removing firewalld in place of IPTables, completely disabling firewalld, through to using different baremetals/virtuals to rke up, almost every attempt on fresh OS installs but all to no avail.

No matter what I try, I always end up with an error like the following:

Host [140.238.87.137] is not able to connect to the following ports: [140.238.67.47:2380, 140.238.67.47:2379, 140.238.65.66:2379, 140.238.65.66:2380, 140.238.87.137:2380, 140.238.87.137:2379]. Please check network policies and firewall rules]

Despite trying to decipher the rke up log below myself, I am sadly not adept enough to work out why the etcd containers never come up.

rke -d up log from one of the Oracle instances with 3x control plans and 1x worker:

https://gist.githubusercontent.com/binvius/00e75fef1e61c3ff1c4198677eafcba8/raw/4032f01b9e147f689b70030aecb4f8c0f8c76621/gistfile1.txt

Please do let me know if further files/logs are required beyond the cluster.yml below and I will get the uploaded.

cluster.yml from one of the Oracle instances with 3x control plans and 1x worker:

https://gist.githubusercontent.com/binvius/ef054bff1229a05d1c79205bf103461e/raw/96c8a2577b386ad91f6908a2a3fedb0d8c659486/gistfile1.txt

kubectl version --client
Client Version: version.Info{Major:“1”, Minor:“17”, GitVersion:“v1.17.3”, GitCommit:“06ad960bfd03b39c8310aaf92d1e7c12ce618213”, GitTreeState:“clean”, BuildDate:“2020-02-11T18:14:22Z”, GoVersion:“go1.13.6”, Compiler:“gc”, Platform:“linux/amd64”}

rke --version
rke version v1.0.4

helm version
version.BuildInfo{Version:“v3.1.1”, GitCommit:“afe70585407b420d0097d07b21c47dc511525ac8”, GitTreeState:“clean”, GoVersion:“go1.13.8”}

docker version
Client: Docker Engine - Community
Version: 19.03.6
API version: 1.39 (downgraded from 1.40)
Go version: go1.12.16
Git commit: 369ce74a3c
Built: Thu Feb 13 01:29:29 2020
OS/Arch: linux/amd64
Experimental: false

Server: Docker Engine - Community
Engine:
Version: 18.09.2
API version: 1.39 (minimum version 1.12)
Go version: go1.10.6
Git commit: 6247962
Built: Sun Feb 10 03:47:25 2019
OS/Arch: linux/amd64
Experimental: false

Not sure why the docker client and engine versions are different having used the Rancher instal script but my research suggested that is no longer an issue these days.

Again, I cannot thank you enough in advance for any pointer as to how to progress as as you can probably guess, I’ve been pulling my hair out for some time now.

Warmest regards,

-binvius-

Do you have Rancher Server up and running, and you are trying to provision a workload/user cluster? Or is this cluster the intended cluster for Rancher Server?

The port check is running a Docker container that listen on a port and, another container that tries to connect to that port and this for all hosts involved for that port. Getting to the root cause of why this error keeps popping up would be:

  • Test network connectivity from host to host (without Docker)
    Example using netcat
Host 1: nc -l ip_addres_of_host1 2379
Host 2: curl ip_address_of_host1:2379

This should show a response on host 1.

  • Test network connectivity using Docker (as this is what RKE does)
Host 1: docker run -d -p ip_address_of_host1:2379:80 nginx
Host 2: docker run appropriate/curl ip_address_of_host1:2379

This should show a response on host 1.

If this actually works, there must be something else going on. In that case I can give you the exact command that is run (which is just using a different listening port inside the container), but looking in the system logging could also reveal more information. Please also share what you exactly executed to removing firewalld in place of IPTables, completely disabling firewalld.

Many thanks for your response - very much appreciated!

Sadly, I don’t believe I have even reached that stage yet. The manuals are telling me to use rke to get a high availability cluster up and running. I am guessing once it’s up, that would be the time I’d introduce a seperate instance solely for rancher.

I would most definatly welcome any further thoughts you may have regarding my response to the other lovely poster. I’m sure it must be something simple as through my exstensive resarch, have yet to find a solution.

Many thanks once again.

Warmest regards.

Cheers!

Many thanks for your response - very much appreciated!

Apologies for the slight tardiness in my response. I was hoping to show some initiative and report back with a chunk of my findings but alas, despite several more 8-10 hour days grinding away at it, I have failed.

I had previously tried testing the ports by running nginx but your commands look much more magically useful so thank you!

Now using your commands, I’ve been primarily attempting to get nginx up and have been trying each time on fresh OS installs with nothing but docker installed which appears to have worked on the vps (which was probably previously erroring due to ntp possibly occupying port 80.)

On the Oracle instances, when trying to start the nginx container on any opened port, I am receiving the following error:

docker: Error response from daemon: driver failed programming external connectivity on endpoint zen_mclean (ef59aafd228711fcd1940c94d552c32d6aeb7e0533d36b8078a849346e75ea13): Error starting userland proxy: listen tcp 140.238.87.95:2379: bind: cannot assign requested address.

I receive similar when trying with ngnix outside of docker.

When running the ncat/curl commands on any opened port, I am receiving the following error:

Ncat: bind to 140.238.87.95:2379: Cannot assign requested address. QUITTING.

Although I believe it not necessary due to managing rules in firewalld, the error still shows despite also adding ingress and engress rules manually within my Oracle instances Web UI to cover everything, some examples being:

Ingress: Source: 0.0.0.0/0
IP Protocol: TCP
Source Port Range: All
Destination Port Range: All
Allows: TCP traffic for ports: All

Engress: Source: 0.0.0.0/0
IP Protocol: TCP
Source Port Range: All
Destination Port Range: All
Allows: TCP traffic for ports: All

I have eventually found a channel to reach out to Oracle but their customer service is notoriously atrocious so I am still awaiting response.

I suspect there is something awry with how Oracle expects the network to be setup but would have thought me opening ports to protocols sufficient so shall have to wait to see what they say.

In the meantime and as requested, the commands I used to disable and/or replace firewalls with IPTables, were:

sudo systemctl stop firewalld
sudo systemctl disable firewalld
sudo systemctl mask --now firewalld
sudo reboot
sudo yum install iptables-services
sudo systemctl start iptables
sudo systemctl enable iptables
sudo systemctl status iptables
sudo reboot

Should it serve useful, when running sudo netstat -tunlp on host1, it shows nginx on port 80 so when I curl that port from host2, it outputs the CentOS welcome screen as if one were to go to it within a browser.

Alsonot sure if it helps but, a the output from /sbin/iptables -L is as follows:
Chain INPUT (policy ACCEPT)
target prot opt source destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
DOCKER-USER  all  --  anywhere             anywhere
DOCKER-ISOLATION-STAGE-1  all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Chain DOCKER (1 references)
target     prot opt source               destination

Chain DOCKER-ISOLATION-STAGE-1 (1 references)
target     prot opt source               destination
DOCKER-ISOLATION-STAGE-2  all  --  anywhere             anywhere
RETURN     all  --  anywhere             anywhere

Chain DOCKER-ISOLATION-STAGE-2 (1 references)
target     prot opt source               destination
DROP       all  --  anywhere             anywhere
RETURN     all  --  anywhere             anywhere

Chain DOCKER-USER (1 references)
target     prot opt source               destination
RETURN     all  --  anywhere             anywhere

Again unsure of it’s usefullness, the output from firewall-cmd --list-all is as follows:

public (active)
  target: default
  icmp-block-inversion: no
  interfaces: ens3
  sources:
  services: dhcpv6-client ssh
  ports: 22/tcp 80/tcp 179/tcp 443/tcp 2376/tcp 2377/tcp 2378/tcp 2379/tcp 2380/tcp 3389/tcp 5473/tcp 6443/tcp 7946/tcp 8472/tcp 8473/tcp 9099/tcp 10250/tcp 10251/tcp 10252/tcp 10253/tcp 10254/tcp 30000-32767/tcp 123/udp 4789/udp 7946/udp 8285/udp 8472/udp 30000-32767/udp
  protocols:
  masquerade: yes
  forward-ports:
  source-ports:
  icmp-blocks:
  rich rules:

Finally, I’ve also run /sbin/iptables-save > /tmp/ipsave.txt and the output looks interesting and can be found here:

https://gist.githubusercontent.com/binvius/2d09b0c6e690fa7208935d573929268f/raw/7a0bcbf13cc80ead8ccb810dd1f24ac4020bff91/gistfile1.txt

Bearing in mind I’ve opened all ports to all protocols within Oracle, I would certainly welcome any other magical commands you may have in your arsenal.

I would just like to take this opportunity to thank you greatly for your help - it’s so lovely to receive having given it out so much - teehee.

Warmest regards.

Cheers!

I’d have to dive in whatever Oracle puts in their firewalld config to make things more secure, but when you disabled it, and only had iptables, can you share the error from the rke up after? So that’s:

  • Disable firewalld
  • Stop firewalld
  • firewallcmd --list-all should return that FirewallD is not running
  • Restart Docker
  • Run rke remove to make sure nodes are clean (destroys data)
  • Run rke up to initialize cluster

Given you can curl between the hosts I think the network should be fine (although it could be more complex if this is some custom setup). The places where you need to allow traffic is the Network Security Group which is attached to the VCN and the Security List attached to the subnet.

Any news on this issue?

i have same issue ,
but with a little bit different way.
i install rancher with rke on debian 10 then after cluster come’s up my security team install iptables on cluster node and config them with rancher iptables guide (Rancher Docs: Port Requirements).
then remove one of the node and add it again i give the same error!

Hi superseb I have same issue I have stopped firewalld and ufw. My servers are in VMware and opened all the ports but still it is showing above error.
while doing nc -l 2379 from one host to another host is connecting.
But while rke up same issue