I’m new to Docker/Rancher/Kubernetes in general. I’m setting up a POC for an internal team and they want to try and use Rancher. I’m setting up a Rancher cluster right now but keep running into issues and don’t know what to do. Here’s what I have:
6 VM’s (CentOS 7.6) all with DHCP reservations.
Docker 17.03.2
Nginx load balancer on one of the nodes
All VM’s are in the same host cluster and in the same subnet
I have installed Docker on all 6 nodes and on the 4th node I installed the Nginx load balancer. I have been going thru the directions scattered all over the Rancher website, and am now stuck on the rke part. For whatever reason if I used wget to copy over rke it didn’t copy over the whole file, so I had to manually download it and use SCP to copy it to each machine (Is that needed or on just one machine only?). I’m using rke v0.1.15. I’ve modified the file and given it the proper permissions. When I run the rke config command I’ve tried specifying 6 nodes and also 3 nodes. When I did 6 nodes I didn’t specify anything for node 1 or node 4 since I thought if I did that node 1 would become the master somehow and node 4 was the load balancer node, so I didn’t think anything should be installed on there, but I’m not sure as I can’t find any documentation that’s clear on that. That failed as I didn’t specify anything on node 1. So I tried just creating on the first 3 nodes and did different combos of which one was etcd, worker and control plane, but no matter what I do it always fails. I basically get something like this:
[root@Cent7Dock1 bin]# rke up --config cluster.yml --ssh-agent-auth
INFO[0000] Building Kubernetes cluster
INFO[0000] [dialer] Setup tunnel for host [10.0.0.2]
WARN[0000] Failed to set up SSH tunneling for host [10.0.0.2]: Can’t retrieve Docker Info: error during connect: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info: Unable to access node with address [10.0.0.2:22] using SSH. Please check if the configured key or specified key file is a valid SSH Private Key. Error: Error configuring SSH: ssh: no key found
INFO[0000] [dialer] Setup tunnel for host [10.0.0.3]
WARN[0000] Failed to set up SSH tunneling for host [10.0.0.3]: Can’t retrieve Docker Info: error during connect: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info: Unable to access node with address [10.0.0.3:22] using SSH. Please check if the configured key or specified key file is a valid SSH Private Key. Error: Error configuring SSH: ssh: no key found
INFO[0000] [dialer] Setup tunnel for host [10.0.0.1]
WARN[0000] Failed to set up SSH tunneling for host [10.0.0.1]: Can’t retrieve Docker Info: error during connect: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info: Unable to access node with address [10.0.0.1:22] using SSH. Please check if the configured key or specified key file is a valid SSH Private Key. Error: Error configuring SSH: ssh: no key found
WARN[0000] Removing host [10.0.0.2] from node lists
WARN[0000] Removing host [10.0.0.3] from node lists
WARN[0000] Removing host [10.0.0.1] from node lists
FATA[0000] Cluster must have at least one etcd plane host: failed to connect to the following etcd host(s) [10.0.0.2]
But if I go to the first machine (10.0.0.1) I can SSH to the other machines without issue. Firewalld is turned off on all machines and disabled, docker service is started and enabled on all machines as well.
I’ve googled around for some fixes for this, but nothing seems to work. What am I missing? I was really hoping to get the cluster setup today and finish installing Rancher, integrating it with vCenter, etc so I can complete the POC and let the other team test it.
I should also mention that I’ve done the part of adding the user to the docker group. I’ve tried it with the user and root, but no difference. When I setup everything initially I was using the root user (installing Docker, nginx, etc).
Just tried again, added the IP’s and host names of each machine to /etc/hosts, but no difference. Went thru the rke config again, chose 3 hosts, went thru hitting mostly defaults other than the IP’s needed for the machines. I get this for errors now:
INFO[0000] Building Kubernetes cluster
INFO[0000] [dialer] Setup tunnel for host [10.0.0.1]
WARN[0000] Failed to set up SSH tunneling for host [10.0.0.1]: Can’t establish dialer connection: Error while reading SSH key file: open /root/.ssh/id_rsa: no such file or directory
INFO[0000] [dialer] Setup tunnel for host [10.0.0.2]
WARN[0000] Failed to set up SSH tunneling for host [10.0.0.2]: Can’t establish dialer connection: Error while reading SSH key file: open /root/.ssh/id_rsa: no such file or directory
INFO[0000] [dialer] Setup tunnel for host [10.0.0.3]
WARN[0000] Failed to set up SSH tunneling for host [10.0.0.3]: Can’t establish dialer connection: Error while reading SSH key file: open /root/.ssh/id_rsa: no such file or directory
WARN[0000] Removing host [10.0.0.1] from node lists
WARN[0000] Removing host [10.0.0.2] from node lists
WARN[0000] Removing host [10.0.0.3] from node lists
FATA[0000] Cluster must have at least one etcd plane host: failed to connect to the following etcd host(s) [10.0.0.1]
First node is supposed to be an etcd host along with a control plane. Node 2 is all 3, node 3 is just worker and etcd.
Tried this again with the newest version of rke (v0.2.0-rc5). I see this now:
INFO[0000] Initiating Kubernetes cluster
INFO[0000] [certificates] Generating CA kubernetes certificates
INFO[0000] [certificates] Generating Kubernetes API server aggregation layer requestheader client CA certificates
INFO[0000] [certificates] Generating Kubernetes API server certificates
INFO[0001] [certificates] Generating Service account token key
INFO[0001] [certificates] Generating Kube Controller certificates
INFO[0001] [certificates] Generating Kube Scheduler certificates
INFO[0002] [certificates] Generating Kube Proxy certificates
INFO[0003] [certificates] Generating Node certificate
INFO[0003] [certificates] Generating admin certificates and kubeconfig
INFO[0003] [certificates] Generating Kubernetes API server proxy client certificates
INFO[0003] [certificates] Generating etcd-10.0.0.1 certificate and key
INFO[0004] [certificates] Generating etcd-10.0.0.2 certificate and key
INFO[0004] [certificates] Generating etcd-10.0.0.3 certificate and key
INFO[0005] Successfully Deployed state file at [./cluster.rkestate]
INFO[0005] Building Kubernetes cluster
But then I get the same errors as I originally stated. Is there any help for this? Any fix? This is really preventing me from completing and anything else I’ve found as help hasn’t worked. Makes me disappointed in the product and the lack of clear documentation to get it working properly.
Still running into issues and not finding anything that really helps. I’ve now done this:
Created a ssh at ~/.ssh/id_rsa with my user account and copied it using ssh-copy-id username@remotehost command for the nodes. I can then run ssh ‘username@remotehost’ and connect right away, yet I still keep getting the same errors when I try to run sudo ./rke up. This is getting very annoying very quickly and the lack of any clear documentation makes it hard to complete this. Getting this error:
WARN[0004] Failed to set up SSH tunneling for host [10.0.0.1]: Can’t retrieve Docker Info: error during connect: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info: Unable to access node with address [10.0.0.1:22] using SSH. Please check if you are able to SSH to the node using the specified SSH Private Key and if you have configured the correct SSH username. Error: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
That’s even on the host I’m running the command from. I get the same thing for the other 2 hosts I’m trying to setup as well.
If I do this from the docker website it just kills the service:
Configuring remote access with systemd unit file
Use the command sudo systemctl edit docker.service to open an override file for docker.service in a text editor.
Add or modify the following lines, substituting your own values.
So anybody have any idea on this? What am I missing? It sure seems like a lot of effort to get Rancher setup to do this whole kubernetes install and I’m not impressed at all.
So I’m just trying the sudo ./rke up right now from the user account. I run sudo ./rke config to create the cluster.yml. I’ve tried specifying 6 hosts, 3 hosts and 1 host. The most recent error I posted above was for 3 hosts.
Open SSHD version is OpenSSH_7.4p1, OpenSSL 1.0.2k-fips 26 Jan 2017
I didn’t set anything up for an SSH agent. I have looked at that second link you have with those errors and nothing has helped me out.
For id I get: uid=1000(myuser) gid=1000(myuser) groups=1000(myuser) 10(wheel) 993(docker)
docker ps gives:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ls -la /var/run/docker.sock gives:
srw-rw---- 1 root docker 0 Feb 6 09:10 /var/run/docker.sock
Hope this helps. Please let me know if you need something else.
I will still need the cluster.yml posted (you can mask IPs if that’s sensitive info for you). And why are you running rke up using sudo? Do you need elevated rights to access the SSH private key?
Here’s the most recent one I did using a single node. Other nodes were 10.0.0.2 and 10.0.0.3, all basically the same as node 1, except node 2 was Control Plane, Worker and etcd and node 3 was just a Worker and etcd.
I was trying to run it as a regular user instead as the root user since there seem to be certain issues about running things as root. The ssh cert was located at ~/.ssh/id_rsa which was under my emcclure account shown below, so I wanted to make sure I ran it under that. I’m able to do ssh ‘emcclure@remotehost’ without being prompted for a sudo or a password of any type if that’s what you mean. I added my emcclure user to the docker group and can run docker commands without sudo.
I’ve tried running the setup as root and as the emcclure account, I get the same results either way.
Do you have a public IP addresses for these nodes? You dont need to copy RKE around to the nodes your installing to, it can just be on your local machine as long as the proper ports are open. For setting up the initial Rancher cluster you would want either 1 or 3 nodes for the RKE install.
I just have the one IP address for them, no internal/external stuff. Can I run this from my windows machine to setup? Should I be running it from one of the nodes? Does that make a difference?
Can your machine resolve those IPs? And do those machines have internet access somehow? Where your run RKE doesnt matter as long as you can resolve/SSH to those machines from where you are running.
They should all be able to. I’ve added the IP and host name in /etc/hosts on each of the machines. They are all in the same subnet as well and have internet access.
Is there a specific way I need to setup the certificates on the nodes? Any certain commands I need to run? Anything I need to copy from node to node? I haven’t found anything that’s totally clear on that, so if that’s something I’m missing somehow I’d like to eliminate that first.