Cannot get a Rancher cluster setup


#1

Hello,

I’m new to Docker/Rancher/Kubernetes in general. I’m setting up a POC for an internal team and they want to try and use Rancher. I’m setting up a Rancher cluster right now but keep running into issues and don’t know what to do. Here’s what I have:

6 VM’s (CentOS 7.6) all with DHCP reservations.
Docker 17.03.2
Nginx load balancer on one of the nodes
All VM’s are in the same host cluster and in the same subnet

I have installed Docker on all 6 nodes and on the 4th node I installed the Nginx load balancer. I have been going thru the directions scattered all over the Rancher website, and am now stuck on the rke part. For whatever reason if I used wget to copy over rke it didn’t copy over the whole file, so I had to manually download it and use SCP to copy it to each machine (Is that needed or on just one machine only?). I’m using rke v0.1.15. I’ve modified the file and given it the proper permissions. When I run the rke config command I’ve tried specifying 6 nodes and also 3 nodes. When I did 6 nodes I didn’t specify anything for node 1 or node 4 since I thought if I did that node 1 would become the master somehow and node 4 was the load balancer node, so I didn’t think anything should be installed on there, but I’m not sure as I can’t find any documentation that’s clear on that. That failed as I didn’t specify anything on node 1. So I tried just creating on the first 3 nodes and did different combos of which one was etcd, worker and control plane, but no matter what I do it always fails. I basically get something like this:

[root@Cent7Dock1 bin]# rke up --config cluster.yml --ssh-agent-auth
INFO[0000] Building Kubernetes cluster
INFO[0000] [dialer] Setup tunnel for host [10.0.0.2]
WARN[0000] Failed to set up SSH tunneling for host [10.0.0.2]: Can’t retrieve Docker Info: error during connect: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info: Unable to access node with address [10.0.0.2:22] using SSH. Please check if the configured key or specified key file is a valid SSH Private Key. Error: Error configuring SSH: ssh: no key found
INFO[0000] [dialer] Setup tunnel for host [10.0.0.3]
WARN[0000] Failed to set up SSH tunneling for host [10.0.0.3]: Can’t retrieve Docker Info: error during connect: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info: Unable to access node with address [10.0.0.3:22] using SSH. Please check if the configured key or specified key file is a valid SSH Private Key. Error: Error configuring SSH: ssh: no key found
INFO[0000] [dialer] Setup tunnel for host [10.0.0.1]
WARN[0000] Failed to set up SSH tunneling for host [10.0.0.1]: Can’t retrieve Docker Info: error during connect: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info: Unable to access node with address [10.0.0.1:22] using SSH. Please check if the configured key or specified key file is a valid SSH Private Key. Error: Error configuring SSH: ssh: no key found
WARN[0000] Removing host [10.0.0.2] from node lists
WARN[0000] Removing host [10.0.0.3] from node lists
WARN[0000] Removing host [10.0.0.1] from node lists
FATA[0000] Cluster must have at least one etcd plane host: failed to connect to the following etcd host(s) [10.0.0.2]

But if I go to the first machine (10.0.0.1) I can SSH to the other machines without issue. Firewalld is turned off on all machines and disabled, docker service is started and enabled on all machines as well.

I’ve googled around for some fixes for this, but nothing seems to work. What am I missing? I was really hoping to get the cluster setup today and finish installing Rancher, integrating it with vCenter, etc so I can complete the POC and let the other team test it.


#2

I should also mention that I’ve done the part of adding the user to the docker group. I’ve tried it with the user and root, but no difference. When I setup everything initially I was using the root user (installing Docker, nginx, etc).


#3

Just tried again, added the IP’s and host names of each machine to /etc/hosts, but no difference. Went thru the rke config again, chose 3 hosts, went thru hitting mostly defaults other than the IP’s needed for the machines. I get this for errors now:

INFO[0000] Building Kubernetes cluster
INFO[0000] [dialer] Setup tunnel for host [10.0.0.1]
WARN[0000] Failed to set up SSH tunneling for host [10.0.0.1]: Can’t establish dialer connection: Error while reading SSH key file: open /root/.ssh/id_rsa: no such file or directory
INFO[0000] [dialer] Setup tunnel for host [10.0.0.2]
WARN[0000] Failed to set up SSH tunneling for host [10.0.0.2]: Can’t establish dialer connection: Error while reading SSH key file: open /root/.ssh/id_rsa: no such file or directory
INFO[0000] [dialer] Setup tunnel for host [10.0.0.3]
WARN[0000] Failed to set up SSH tunneling for host [10.0.0.3]: Can’t establish dialer connection: Error while reading SSH key file: open /root/.ssh/id_rsa: no such file or directory
WARN[0000] Removing host [10.0.0.1] from node lists
WARN[0000] Removing host [10.0.0.2] from node lists
WARN[0000] Removing host [10.0.0.3] from node lists
FATA[0000] Cluster must have at least one etcd plane host: failed to connect to the following etcd host(s) [10.0.0.1]

First node is supposed to be an etcd host along with a control plane. Node 2 is all 3, node 3 is just worker and etcd.


#4

Tried this again with the newest version of rke (v0.2.0-rc5). I see this now:

INFO[0000] Initiating Kubernetes cluster
INFO[0000] [certificates] Generating CA kubernetes certificates
INFO[0000] [certificates] Generating Kubernetes API server aggregation layer requestheader client CA certificates
INFO[0000] [certificates] Generating Kubernetes API server certificates
INFO[0001] [certificates] Generating Service account token key
INFO[0001] [certificates] Generating Kube Controller certificates
INFO[0001] [certificates] Generating Kube Scheduler certificates
INFO[0002] [certificates] Generating Kube Proxy certificates
INFO[0003] [certificates] Generating Node certificate
INFO[0003] [certificates] Generating admin certificates and kubeconfig
INFO[0003] [certificates] Generating Kubernetes API server proxy client certificates
INFO[0003] [certificates] Generating etcd-10.0.0.1 certificate and key
INFO[0004] [certificates] Generating etcd-10.0.0.2 certificate and key
INFO[0004] [certificates] Generating etcd-10.0.0.3 certificate and key
INFO[0005] Successfully Deployed state file at [./cluster.rkestate]
INFO[0005] Building Kubernetes cluster

But then I get the same errors as I originally stated. Is there any help for this? Any fix? This is really preventing me from completing and anything else I’ve found as help hasn’t worked. Makes me disappointed in the product and the lack of clear documentation to get it working properly.


#5

Still running into issues and not finding anything that really helps. I’ve now done this:

Created a ssh at ~/.ssh/id_rsa with my user account and copied it using ssh-copy-id username@remotehost command for the nodes. I can then run ssh ‘username@remotehost’ and connect right away, yet I still keep getting the same errors when I try to run sudo ./rke up. This is getting very annoying very quickly and the lack of any clear documentation makes it hard to complete this. Getting this error:

WARN[0004] Failed to set up SSH tunneling for host [10.0.0.1]: Can’t retrieve Docker Info: error during connect: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info: Unable to access node with address [10.0.0.1:22] using SSH. Please check if you are able to SSH to the node using the specified SSH Private Key and if you have configured the correct SSH username. Error: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

That’s even on the host I’m running the command from. I get the same thing for the other 2 hosts I’m trying to setup as well.

If I do this from the docker website it just kills the service:

Configuring remote access with systemd unit file

  1. Use the command sudo systemctl edit docker.service to open an override file for docker.service in a text editor.
  2. Add or modify the following lines, substituting your own values.
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fd:// -H tcp://127.0.0.1:2375
  1. Save the file.
  2. Reload the systemctl configuration.
 $ sudo systemctl daemon-reload
  1. Restart Docker.
$ sudo systemctl restart docker.service

So anybody have any idea on this? What am I missing? It sure seems like a lot of effort to get Rancher setup to do this whole kubernetes install and I’m not impressed at all.


#6

The outputs from all the tries are mostly different, this is usually odd if you are executing the same sequence of commands. If you are specifying --ssh-agent-auth it tries to use SSH agent as described on https://rancher.com/docs/rke/v0.1.x/en/config-options/#ssh-agent. Most SSH errors are also described on https://rancher.com/docs/rke/v0.1.x/en/troubleshooting/ssh-connectivity-errors/.

I can look into it but need more and consistent info:

  • cluster.yml used
  • OpenSSH version on the host(s) (sshd -V, nc IP 22)
  • id, docker ps, ls -la /var/run/docker.sock output when you are logged in to a host using SSH on the command line
  • If using SSH agent, output of env | grep SSH_AUTH_SOCK and ssh-add -l.

#7

So I’m just trying the sudo ./rke up right now from the user account. I run sudo ./rke config to create the cluster.yml. I’ve tried specifying 6 hosts, 3 hosts and 1 host. The most recent error I posted above was for 3 hosts.

Open SSHD version is OpenSSH_7.4p1, OpenSSL 1.0.2k-fips 26 Jan 2017

I didn’t set anything up for an SSH agent. I have looked at that second link you have with those errors and nothing has helped me out.

For id I get: uid=1000(myuser) gid=1000(myuser) groups=1000(myuser) 10(wheel) 993(docker)

docker ps gives:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES

ls -la /var/run/docker.sock gives:
srw-rw---- 1 root docker 0 Feb 6 09:10 /var/run/docker.sock

Hope this helps. Please let me know if you need something else.


#8

I will still need the cluster.yml posted (you can mask IPs if that’s sensitive info for you). And why are you running rke up using sudo? Do you need elevated rights to access the SSH private key?


#9

Here’s the most recent one I did using a single node. Other nodes were 10.0.0.2 and 10.0.0.3, all basically the same as node 1, except node 2 was Control Plane, Worker and etcd and node 3 was just a Worker and etcd.

I was trying to run it as a regular user instead as the root user since there seem to be certain issues about running things as root. The ssh cert was located at ~/.ssh/id_rsa which was under my emcclure account shown below, so I wanted to make sure I ran it under that. I’m able to do ssh ‘emcclure@remotehost’ without being prompted for a sudo or a password of any type if that’s what you mean. I added my emcclure user to the docker group and can run docker commands without sudo.

I’ve tried running the setup as root and as the emcclure account, I get the same results either way.

nodes:

  • address: 10.0.0.1
    port: “22”
    internal_address: “”
    role:
    • controlplane
    • worker
    • etcd
      hostname_override: “”
      user: emcclure
      docker_socket: /var/run/docker.sock
      ssh_key: “”
      ssh_key_path: ~/.ssh/id_rsa
      ssh_cert: “”
      ssh_cert_path: “”
      labels: {}
      services:
      etcd:
      image: “”
      extra_args: {}
      extra_binds: []
      extra_env: []
      external_urls: []
      ca_cert: “”
      cert: “”
      key: “”
      path: “”
      snapshot: null
      retention: “”
      creation: “”
      backup_config: null
      kube-api:
      image: “”
      extra_args: {}
      extra_binds: []
      extra_env: []
      service_cluster_ip_range: 10.43.0.0/16
      service_node_port_range: “”
      pod_security_policy: false
      always_pull_images: false
      kube-controller:
      image: “”
      extra_args: {}
      extra_binds: []
      extra_env: []
      cluster_cidr: 10.42.0.0/16
      service_cluster_ip_range: 10.43.0.0/16
      scheduler:
      image: “”
      extra_args: {}
      extra_binds: []
      extra_env: []
      kubelet:
      image: “”
      extra_args: {}
      extra_binds: []
      extra_env: []
      cluster_domain: cluster.local
      infra_container_image: “”
      cluster_dns_server: 10.43.0.10
      fail_swap_on: false
      kubeproxy:
      image: “”
      extra_args: {}
      extra_binds: []
      extra_env: []
      network:
      plugin: flannel
      options: {}
      authentication:
      strategy: x509
      sans: []
      webhook: null
      addons: “”
      addons_include: []
      system_images:
      etcd: rancher/coreos-etcd:v3.2.24
      alpine: rancher/rke-tools:v0.1.23
      nginx_proxy: rancher/rke-tools:v0.1.23
      cert_downloader: rancher/rke-tools:v0.1.23
      kubernetes_services_sidecar: rancher/rke-tools:v0.1.23
      kubedns: rancher/k8s-dns-kube-dns-amd64:1.15.0
      dnsmasq: rancher/k8s-dns-dnsmasq-nanny-amd64:1.15.0
      kubedns_sidecar: rancher/k8s-dns-sidecar-amd64:1.15.0
      kubedns_autoscaler: rancher/cluster-proportional-autoscaler-amd64:1.0.0
      coredns: coredns/coredns:1.2.6
      coredns_autoscaler: rancher/cluster-proportional-autoscaler-amd64:1.0.0
      kubernetes: rancher/hyperkube:v1.13.1-rancher1
      flannel: rancher/coreos-flannel:v0.10.0
      flannel_cni: rancher/coreos-flannel-cni:v0.3.0
      calico_node: rancher/calico-node:v3.4.0
      calico_cni: rancher/calico-cni:v3.4.0
      calico_controllers: “”
      calico_ctl: rancher/calico-ctl:v2.0.0
      canal_node: rancher/calico-node:v3.4.0
      canal_cni: rancher/calico-cni:v3.4.0
      canal_flannel: rancher/coreos-flannel:v0.10.0
      weave_node: weaveworks/weave-kube:2.5.0
      weave_cni: weaveworks/weave-npc:2.5.0
      pod_infra_container: rancher/pause-amd64:3.1
      ingress: rancher/nginx-ingress-controller:0.21.0-rancher1
      ingress_backend: rancher/nginx-ingress-controller-defaultbackend:1.4
      metrics_server: rancher/metrics-server-amd64:v0.3.1
      ssh_key_path: ~/.ssh/id_rsa
      ssh_cert_path: “”
      ssh_agent_auth: false
      authorization:
      mode: rbac
      options: {}
      ignore_docker_version: false
      kubernetes_version: “”
      private_registries: []
      ingress:
      provider: “”
      options: {}
      node_selector: {}
      extra_args: {}
      cluster_name: “”
      cloud_provider:
      name: “”
      prefix_path: “”
      addon_job_timeout: 0
      bastion_host:
      address: “”
      port: “”
      user: “”
      ssh_key: “”
      ssh_key_path: “”
      ssh_cert: “”
      ssh_cert_path: “”
      monitoring:
      provider: “”
      options: {}
      restore:
      restore: false
      snapshot_name: “”
      dns:
      provider: “”
      upstreamnameservers: []
      reversecidrs: []
      node_selector: {}

#10

Do you have a public IP addresses for these nodes? You dont need to copy RKE around to the nodes your installing to, it can just be on your local machine as long as the proper ports are open. For setting up the initial Rancher cluster you would want either 1 or 3 nodes for the RKE install.


#11

I just have the one IP address for them, no internal/external stuff. Can I run this from my windows machine to setup? Should I be running it from one of the nodes? Does that make a difference?


#12

Can your machine resolve those IPs? And do those machines have internet access somehow? Where your run RKE doesnt matter as long as you can resolve/SSH to those machines from where you are running.


#13

They should all be able to. I’ve added the IP and host name in /etc/hosts on each of the machines. They are all in the same subnet as well and have internet access.


#14

Is there a specific way I need to setup the certificates on the nodes? Any certain commands I need to run? Anything I need to copy from node to node? I haven’t found anything that’s totally clear on that, so if that’s something I’m missing somehow I’d like to eliminate that first.


#15

Ok I’m making progress, but still stuck. I found this was similar to the error I got: https://github.com/hashicorp/terraform/issues/18450 and I went here: https://wiki.centos.org/HowTos/Network/SecuringSSH and I did these steps:
Now set permissions on your private key:

chmod 700 ~/.ssh chmod 600 ~/.ssh/id_rsa

Copy the public key (id_rsa.pub) to the server and install it to the authorized_keys list:

$ cat id_rsa.pub >> ~/.ssh/authorized_keys

Note: once you’ve imported the public key, you can delete it from the server.

and finally set file permissions on the server:

chmod 700 ~/.ssh chmod 600 ~/.ssh/authorized_keys

And I created a new yml file with 3 hosts this time and got further but it still fails.

INFO[0000] Building Kubernetes cluster
INFO[0000] [dialer] Setup tunnel for host [10.0.0.1]
INFO[0000] [dialer] Setup tunnel for host [10.0.0.2]
INFO[0000] [dialer] Setup tunnel for host [10.0.0.3]
INFO[0000] [network] Deploying port listener containers
INFO[0000] [network] Pulling image [rancher/rke-tools:v0.1.15] on host [10.0.0.1]
INFO[0000] [network] Pulling image [rancher/rke-tools:v0.1.15] on host [10.0.0.2]
INFO[0000] [network] Pulling image [rancher/rke-tools:v0.1.15] on host [10.0.0.3]
INFO[0002] [network] Successfully pulled image [rancher/rke-tools:v0.1.15] on host [10.0.0.1]
INFO[0008] [network] Successfully pulled image [rancher/rke-tools:v0.1.15] on host [10.0.0.3]
INFO[0009] [network] Successfully updated [rke-etcd-port-listener] container on host [10.0.0.3]
INFO[0009] [network] Successfully pulled image [rancher/rke-tools:v0.1.15] on host [10.0.0.2]
INFO[0009] [network] Successfully updated [rke-etcd-port-listener] container on host [10.0.0.2]
FATA[0009] Failed to create [rke-etcd-port-listener] container on host [10.0.0.1]: Error: No such image: rancher/rke-tools:v0.1.15

I’ve also tried this with the latest version but I get the same error.


#17

And if you need to see the cluster.yml here it is:

nodes:

  • address: 10.0.0.1
    port: “22”
    internal_address: “”
    role:
    • controlplane
    • etcd
      hostname_override: “”
      user: emcclure
      docker_socket: /var/run/docker.sock
      ssh_key: “”
      ssh_key_path: ~/.ssh/id_rsa
      labels: {}
  • address: 10.0.0.2
    port: “22”
    internal_address: “”
    role:
    • controlplane
    • worker
    • etcd
      hostname_override: “”
      user: emcclure
      docker_socket: /var/run/docker.sock
      ssh_key: “”
      ssh_key_path: ~/.ssh/id_rsa
      labels: {}
  • address: 10.0.0.3
    port: “22”
    internal_address: “”
    role:
    • worker
    • etcd
      hostname_override: “”
      user: emcclure
      docker_socket: /var/run/docker.sock
      ssh_key: “”
      ssh_key_path: ~/.ssh/id_rsa
      labels: {}
      services:
      etcd:
      image: “”
      extra_args: {}
      extra_binds: []
      extra_env: []
      external_urls: []
      ca_cert: “”
      cert: “”
      key: “”
      path: “”
      snapshot: null
      retention: “”
      creation: “”
      kube-api:
      image: “”
      extra_args: {}
      extra_binds: []
      extra_env: []
      service_cluster_ip_range: 10.43.0.0/16
      service_node_port_range: “”
      pod_security_policy: false
      kube-controller:
      image: “”
      extra_args: {}
      extra_binds: []
      extra_env: []
      cluster_cidr: 10.42.0.0/16
      service_cluster_ip_range: 10.43.0.0/16
      scheduler:
      image: “”
      extra_args: {}
      extra_binds: []
      extra_env: []
      kubelet:
      image: “”
      extra_args: {}
      extra_binds: []
      extra_env: []
      cluster_domain: cluster.local
      infra_container_image: “”
      cluster_dns_server: 10.43.0.10
      fail_swap_on: false
      kubeproxy:
      image: “”
      extra_args: {}
      extra_binds: []
      extra_env: []
      network:
      plugin: canal
      options: {}
      authentication:
      strategy: x509
      options: {}
      sans: []
      addons: “”
      addons_include: []
      system_images:
      etcd: rancher/coreos-etcd:v3.2.18
      alpine: rancher/rke-tools:v0.1.15
      nginx_proxy: rancher/rke-tools:v0.1.15
      cert_downloader: rancher/rke-tools:v0.1.15
      kubernetes_services_sidecar: rancher/rke-tools:v0.1.15
      kubedns: rancher/k8s-dns-kube-dns-amd64:1.14.10
      dnsmasq: rancher/k8s-dns-dnsmasq-nanny-amd64:1.14.10
      kubedns_sidecar: rancher/k8s-dns-sidecar-amd64:1.14.10
      kubedns_autoscaler: rancher/cluster-proportional-autoscaler-amd64:1.0.0
      kubernetes: rancher/hyperkube:v1.11.6-rancher1
      flannel: rancher/coreos-flannel:v0.10.0
      flannel_cni: rancher/coreos-flannel-cni:v0.3.0
      calico_node: rancher/calico-node:v3.1.3
      calico_cni: rancher/calico-cni:v3.1.3
      calico_controllers: “”
      calico_ctl: rancher/calico-ctl:v2.0.0
      canal_node: rancher/calico-node:v3.1.3
      canal_cni: rancher/calico-cni:v3.1.3
      canal_flannel: rancher/coreos-flannel:v0.10.0
      wave_node: weaveworks/weave-kube:2.1.2
      weave_cni: weaveworks/weave-npc:2.1.2
      pod_infra_container: rancher/pause-amd64:3.1
      ingress: rancher/nginx-ingress-controller:0.16.2-rancher1
      ingress_backend: rancher/nginx-ingress-controller-defaultbackend:1.4
      metrics_server: rancher/metrics-server-amd64:v0.2.1
      ssh_key_path: ~/.ssh/id_rsa
      ssh_agent_auth: false
      authorization:
      mode: rbac
      options: {}
      ignore_docker_version: false
      kubernetes_version: “”
      private_registries: []
      ingress:
      provider: “”
      options: {}
      node_selector: {}
      extra_args: {}
      cluster_name: “”
      cloud_provider:
      name: “”
      prefix_path: “”
      addon_job_timeout: 0
      bastion_host:
      address: “”
      port: “”
      user: “”
      ssh_key: “”
      ssh_key_path: “”
      monitoring:
      provider: “”
      options: {}

#18

Can you share docker info from the hosts as well? I am also on User Slack (https://slack.rancher.io), probably a bit easier/faster.