Cannot get a Rancher cluster setup

emcclure · February 5, 2019, 3:51pm

Hello,

I’m new to Docker/Rancher/Kubernetes in general. I’m setting up a POC for an internal team and they want to try and use Rancher. I’m setting up a Rancher cluster right now but keep running into issues and don’t know what to do. Here’s what I have:

6 VM’s (CentOS 7.6) all with DHCP reservations.
Docker 17.03.2
Nginx load balancer on one of the nodes
All VM’s are in the same host cluster and in the same subnet

I have installed Docker on all 6 nodes and on the 4th node I installed the Nginx load balancer. I have been going thru the directions scattered all over the Rancher website, and am now stuck on the rke part. For whatever reason if I used wget to copy over rke it didn’t copy over the whole file, so I had to manually download it and use SCP to copy it to each machine (Is that needed or on just one machine only?). I’m using rke v0.1.15. I’ve modified the file and given it the proper permissions. When I run the rke config command I’ve tried specifying 6 nodes and also 3 nodes. When I did 6 nodes I didn’t specify anything for node 1 or node 4 since I thought if I did that node 1 would become the master somehow and node 4 was the load balancer node, so I didn’t think anything should be installed on there, but I’m not sure as I can’t find any documentation that’s clear on that. That failed as I didn’t specify anything on node 1. So I tried just creating on the first 3 nodes and did different combos of which one was etcd, worker and control plane, but no matter what I do it always fails. I basically get something like this:

[root@Cent7Dock1 bin]# rke up --config cluster.yml --ssh-agent-auth
INFO[0000] Building Kubernetes cluster
INFO[0000] [dialer] Setup tunnel for host [10.0.0.2]
WARN[0000] Failed to set up SSH tunneling for host [10.0.0.2]: Can’t retrieve Docker Info: error during connect: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info: Unable to access node with address [10.0.0.2:22] using SSH. Please check if the configured key or specified key file is a valid SSH Private Key. Error: Error configuring SSH: ssh: no key found
INFO[0000] [dialer] Setup tunnel for host [10.0.0.3]
WARN[0000] Failed to set up SSH tunneling for host [10.0.0.3]: Can’t retrieve Docker Info: error during connect: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info: Unable to access node with address [10.0.0.3:22] using SSH. Please check if the configured key or specified key file is a valid SSH Private Key. Error: Error configuring SSH: ssh: no key found
INFO[0000] [dialer] Setup tunnel for host [10.0.0.1]
WARN[0000] Failed to set up SSH tunneling for host [10.0.0.1]: Can’t retrieve Docker Info: error during connect: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info: Unable to access node with address [10.0.0.1:22] using SSH. Please check if the configured key or specified key file is a valid SSH Private Key. Error: Error configuring SSH: ssh: no key found
WARN[0000] Removing host [10.0.0.2] from node lists
WARN[0000] Removing host [10.0.0.3] from node lists
WARN[0000] Removing host [10.0.0.1] from node lists
FATA[0000] Cluster must have at least one etcd plane host: failed to connect to the following etcd host(s) [10.0.0.2]

But if I go to the first machine (10.0.0.1) I can SSH to the other machines without issue. Firewalld is turned off on all machines and disabled, docker service is started and enabled on all machines as well.

I’ve googled around for some fixes for this, but nothing seems to work. What am I missing? I was really hoping to get the cluster setup today and finish installing Rancher, integrating it with vCenter, etc so I can complete the POC and let the other team test it.

emcclure · February 5, 2019, 4:00pm

I should also mention that I’ve done the part of adding the user to the docker group. I’ve tried it with the user and root, but no difference. When I setup everything initially I was using the root user (installing Docker, nginx, etc).

emcclure · February 5, 2019, 4:07pm

Just tried again, added the IP’s and host names of each machine to /etc/hosts, but no difference. Went thru the rke config again, chose 3 hosts, went thru hitting mostly defaults other than the IP’s needed for the machines. I get this for errors now:

INFO[0000] Building Kubernetes cluster
INFO[0000] [dialer] Setup tunnel for host [10.0.0.1]
WARN[0000] Failed to set up SSH tunneling for host [10.0.0.1]: Can’t establish dialer connection: Error while reading SSH key file: open /root/.ssh/id_rsa: no such file or directory
INFO[0000] [dialer] Setup tunnel for host [10.0.0.2]
WARN[0000] Failed to set up SSH tunneling for host [10.0.0.2]: Can’t establish dialer connection: Error while reading SSH key file: open /root/.ssh/id_rsa: no such file or directory
INFO[0000] [dialer] Setup tunnel for host [10.0.0.3]
WARN[0000] Failed to set up SSH tunneling for host [10.0.0.3]: Can’t establish dialer connection: Error while reading SSH key file: open /root/.ssh/id_rsa: no such file or directory
WARN[0000] Removing host [10.0.0.1] from node lists
WARN[0000] Removing host [10.0.0.2] from node lists
WARN[0000] Removing host [10.0.0.3] from node lists
FATA[0000] Cluster must have at least one etcd plane host: failed to connect to the following etcd host(s) [10.0.0.1]

First node is supposed to be an etcd host along with a control plane. Node 2 is all 3, node 3 is just worker and etcd.

emcclure · February 5, 2019, 6:31pm

Tried this again with the newest version of rke (v0.2.0-rc5). I see this now:

INFO[0000] Initiating Kubernetes cluster
INFO[0000] [certificates] Generating CA kubernetes certificates
INFO[0000] [certificates] Generating Kubernetes API server aggregation layer requestheader client CA certificates
INFO[0000] [certificates] Generating Kubernetes API server certificates
INFO[0001] [certificates] Generating Service account token key
INFO[0001] [certificates] Generating Kube Controller certificates
INFO[0001] [certificates] Generating Kube Scheduler certificates
INFO[0002] [certificates] Generating Kube Proxy certificates
INFO[0003] [certificates] Generating Node certificate
INFO[0003] [certificates] Generating admin certificates and kubeconfig
INFO[0003] [certificates] Generating Kubernetes API server proxy client certificates
INFO[0003] [certificates] Generating etcd-10.0.0.1 certificate and key
INFO[0004] [certificates] Generating etcd-10.0.0.2 certificate and key
INFO[0004] [certificates] Generating etcd-10.0.0.3 certificate and key
INFO[0005] Successfully Deployed state file at [./cluster.rkestate]
INFO[0005] Building Kubernetes cluster

But then I get the same errors as I originally stated. Is there any help for this? Any fix? This is really preventing me from completing and anything else I’ve found as help hasn’t worked. Makes me disappointed in the product and the lack of clear documentation to get it working properly.

emcclure · February 6, 2019, 5:44pm

Still running into issues and not finding anything that really helps. I’ve now done this:

Created a ssh at ~/.ssh/id_rsa with my user account and copied it using ssh-copy-id username@remotehost command for the nodes. I can then run ssh ‘username@remotehost’ and connect right away, yet I still keep getting the same errors when I try to run sudo ./rke up. This is getting very annoying very quickly and the lack of any clear documentation makes it hard to complete this. Getting this error:

WARN[0004] Failed to set up SSH tunneling for host [10.0.0.1]: Can’t retrieve Docker Info: error during connect: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info: Unable to access node with address [10.0.0.1:22] using SSH. Please check if you are able to SSH to the node using the specified SSH Private Key and if you have configured the correct SSH username. Error: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

That’s even on the host I’m running the command from. I get the same thing for the other 2 hosts I’m trying to setup as well.

If I do this from the docker website it just kills the service:

Configuring remote access with `systemd` unit file

Use the command sudo systemctl edit docker.service to open an override file for docker.service in a text editor.
Add or modify the following lines, substituting your own values.

[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fd:// -H tcp://127.0.0.1:2375

Save the file.
Reload the systemctl configuration.

 $ sudo systemctl daemon-reload

Restart Docker.

$ sudo systemctl restart docker.service

So anybody have any idea on this? What am I missing? It sure seems like a lot of effort to get Rancher setup to do this whole kubernetes install and I’m not impressed at all.

superseb · February 6, 2019, 8:06pm

The outputs from all the tries are mostly different, this is usually odd if you are executing the same sequence of commands. If you are specifying --ssh-agent-auth it tries to use SSH agent as described on https://rancher.com/docs/rke/v0.1.x/en/config-options/#ssh-agent. Most SSH errors are also described on https://rancher.com/docs/rke/v0.1.x/en/troubleshooting/ssh-connectivity-errors/.

I can look into it but need more and consistent info:

cluster.yml used
OpenSSH version on the host(s) (sshd -V, nc IP 22)
id, docker ps, ls -la /var/run/docker.sock output when you are logged in to a host using SSH on the command line
If using SSH agent, output of env | grep SSH_AUTH_SOCK and ssh-add -l.

emcclure · February 6, 2019, 8:15pm

So I’m just trying the sudo ./rke up right now from the user account. I run sudo ./rke config to create the cluster.yml. I’ve tried specifying 6 hosts, 3 hosts and 1 host. The most recent error I posted above was for 3 hosts.

Open SSHD version is OpenSSH_7.4p1, OpenSSL 1.0.2k-fips 26 Jan 2017

I didn’t set anything up for an SSH agent. I have looked at that second link you have with those errors and nothing has helped me out.

For id I get: uid=1000(myuser) gid=1000(myuser) groups=1000(myuser) 10(wheel) 993(docker)

docker ps gives:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES

ls -la /var/run/docker.sock gives:
srw-rw---- 1 root docker 0 Feb 6 09:10 /var/run/docker.sock

Hope this helps. Please let me know if you need something else.

superseb · February 6, 2019, 11:09pm

I will still need the cluster.yml posted (you can mask IPs if that’s sensitive info for you). And why are you running rke up using sudo? Do you need elevated rights to access the SSH private key?

emcclure · February 7, 2019, 12:15am

Here’s the most recent one I did using a single node. Other nodes were 10.0.0.2 and 10.0.0.3, all basically the same as node 1, except node 2 was Control Plane, Worker and etcd and node 3 was just a Worker and etcd.

I was trying to run it as a regular user instead as the root user since there seem to be certain issues about running things as root. The ssh cert was located at ~/.ssh/id_rsa which was under my emcclure account shown below, so I wanted to make sure I ran it under that. I’m able to do ssh ‘emcclure@remotehost’ without being prompted for a sudo or a password of any type if that’s what you mean. I added my emcclure user to the docker group and can run docker commands without sudo.

I’ve tried running the setup as root and as the emcclure account, I get the same results either way.

nodes:

address: 10.0.0.1
port: “22”
internal_address: “”
role:
- controlplane
- worker
- etcd
  hostname_override: “”
  user: emcclure
  docker_socket: /var/run/docker.sock
  ssh_key: “”
  ssh_key_path: ~/.ssh/id_rsa
  ssh_cert: “”
  ssh_cert_path: “”
  labels: {}
  services:
  etcd:
  image: “”
  extra_args: {}
  extra_binds: []
  extra_env: []
  external_urls: []
  ca_cert: “”
  cert: “”
  key: “”
  path: “”
  snapshot: null
  retention: “”
  creation: “”
  backup_config: null
  kube-api:
  image: “”
  extra_args: {}
  extra_binds: []
  extra_env: []
  service_cluster_ip_range: 10.43.0.0/16
  service_node_port_range: “”
  pod_security_policy: false
  always_pull_images: false
  kube-controller:
  image: “”
  extra_args: {}
  extra_binds: []
  extra_env: []
  cluster_cidr: 10.42.0.0/16
  service_cluster_ip_range: 10.43.0.0/16
  scheduler:
  image: “”
  extra_args: {}
  extra_binds: []
  extra_env: []
  kubelet:
  image: “”
  extra_args: {}
  extra_binds: []
  extra_env: []
  cluster_domain: cluster.local
  infra_container_image: “”
  cluster_dns_server: 10.43.0.10
  fail_swap_on: false
  kubeproxy:
  image: “”
  extra_args: {}
  extra_binds: []
  extra_env: []
  network:
  plugin: flannel
  options: {}
  authentication:
  strategy: x509
  sans: []
  webhook: null
  addons: “”
  addons_include: []
  system_images:
  etcd: rancher/coreos-etcd:v3.2.24
  alpine: rancher/rke-tools:v0.1.23
  nginx_proxy: rancher/rke-tools:v0.1.23
  cert_downloader: rancher/rke-tools:v0.1.23
  kubernetes_services_sidecar: rancher/rke-tools:v0.1.23
  kubedns: rancher/k8s-dns-kube-dns-amd64:1.15.0
  dnsmasq: rancher/k8s-dns-dnsmasq-nanny-amd64:1.15.0
  kubedns_sidecar: rancher/k8s-dns-sidecar-amd64:1.15.0
  kubedns_autoscaler: rancher/cluster-proportional-autoscaler-amd64:1.0.0
  coredns: coredns/coredns:1.2.6
  coredns_autoscaler: rancher/cluster-proportional-autoscaler-amd64:1.0.0
  kubernetes: rancher/hyperkube:v1.13.1-rancher1
  flannel: rancher/coreos-flannel:v0.10.0
  flannel_cni: rancher/coreos-flannel-cni:v0.3.0
  calico_node: rancher/calico-node:v3.4.0
  calico_cni: rancher/calico-cni:v3.4.0
  calico_controllers: “”
  calico_ctl: rancher/calico-ctl:v2.0.0
  canal_node: rancher/calico-node:v3.4.0
  canal_cni: rancher/calico-cni:v3.4.0
  canal_flannel: rancher/coreos-flannel:v0.10.0
  weave_node: weaveworks/weave-kube:2.5.0
  weave_cni: weaveworks/weave-npc:2.5.0
  pod_infra_container: rancher/pause-amd64:3.1
  ingress: rancher/nginx-ingress-controller:0.21.0-rancher1
  ingress_backend: rancher/nginx-ingress-controller-defaultbackend:1.4
  metrics_server: rancher/metrics-server-amd64:v0.3.1
  ssh_key_path: ~/.ssh/id_rsa
  ssh_cert_path: “”
  ssh_agent_auth: false
  authorization:
  mode: rbac
  options: {}
  ignore_docker_version: false
  kubernetes_version: “”
  private_registries: []
  ingress:
  provider: “”
  options: {}
  node_selector: {}
  extra_args: {}
  cluster_name: “”
  cloud_provider:
  name: “”
  prefix_path: “”
  addon_job_timeout: 0
  bastion_host:
  address: “”
  port: “”
  user: “”
  ssh_key: “”
  ssh_key_path: “”
  ssh_cert: “”
  ssh_cert_path: “”
  monitoring:
  provider: “”
  options: {}
  restore:
  restore: false
  snapshot_name: “”
  dns:
  provider: “”
  upstreamnameservers: []
  reversecidrs: []
  node_selector: {}

krome · February 7, 2019, 2:07pm

Do you have a public IP addresses for these nodes? You dont need to copy RKE around to the nodes your installing to, it can just be on your local machine as long as the proper ports are open. For setting up the initial Rancher cluster you would want either 1 or 3 nodes for the RKE install.

emcclure · February 7, 2019, 4:02pm

I just have the one IP address for them, no internal/external stuff. Can I run this from my windows machine to setup? Should I be running it from one of the nodes? Does that make a difference?

krome · February 7, 2019, 4:57pm

Can your machine resolve those IPs? And do those machines have internet access somehow? Where your run RKE doesnt matter as long as you can resolve/SSH to those machines from where you are running.

emcclure · February 7, 2019, 5:30pm

They should all be able to. I’ve added the IP and host name in /etc/hosts on each of the machines. They are all in the same subnet as well and have internet access.

emcclure · February 7, 2019, 5:40pm

Is there a specific way I need to setup the certificates on the nodes? Any certain commands I need to run? Anything I need to copy from node to node? I haven’t found anything that’s totally clear on that, so if that’s something I’m missing somehow I’d like to eliminate that first.

emcclure · February 7, 2019, 6:08pm

Ok I’m making progress, but still stuck. I found this was similar to the error I got: https://github.com/hashicorp/terraform/issues/18450 and I went here: https://wiki.centos.org/HowTos/Network/SecuringSSH and I did these steps:
Now set permissions on your private key:

chmod 700 ~/.ssh chmod 600 ~/.ssh/id_rsa

Copy the public key (id_rsa.pub) to the server and install it to the authorized_keys list:

$ cat id_rsa.pub >> ~/.ssh/authorized_keys

Note: once you’ve imported the public key, you can delete it from the server.

and finally set file permissions on the server:

chmod 700 ~/.ssh chmod 600 ~/.ssh/authorized_keys

And I created a new yml file with 3 hosts this time and got further but it still fails.

INFO[0000] Building Kubernetes cluster
INFO[0000] [dialer] Setup tunnel for host [10.0.0.1]
INFO[0000] [dialer] Setup tunnel for host [10.0.0.2]
INFO[0000] [dialer] Setup tunnel for host [10.0.0.3]
INFO[0000] [network] Deploying port listener containers
INFO[0000] [network] Pulling image [rancher/rke-tools:v0.1.15] on host [10.0.0.1]
INFO[0000] [network] Pulling image [rancher/rke-tools:v0.1.15] on host [10.0.0.2]
INFO[0000] [network] Pulling image [rancher/rke-tools:v0.1.15] on host [10.0.0.3]
INFO[0002] [network] Successfully pulled image [rancher/rke-tools:v0.1.15] on host [10.0.0.1]
INFO[0008] [network] Successfully pulled image [rancher/rke-tools:v0.1.15] on host [10.0.0.3]
INFO[0009] [network] Successfully updated [rke-etcd-port-listener] container on host [10.0.0.3]
INFO[0009] [network] Successfully pulled image [rancher/rke-tools:v0.1.15] on host [10.0.0.2]
INFO[0009] [network] Successfully updated [rke-etcd-port-listener] container on host [10.0.0.2]
FATA[0009] Failed to create [rke-etcd-port-listener] container on host [10.0.0.1]: Error: No such image: rancher/rke-tools:v0.1.15

I’ve also tried this with the latest version but I get the same error.

emcclure · February 7, 2019, 6:14pm

And if you need to see the cluster.yml here it is:

nodes:

address: 10.0.0.1
port: “22”
internal_address: “”
role:
- controlplane
- etcd
  hostname_override: “”
  user: emcclure
  docker_socket: /var/run/docker.sock
  ssh_key: “”
  ssh_key_path: ~/.ssh/id_rsa
  labels: {}
address: 10.0.0.2
port: “22”
internal_address: “”
role:
- controlplane
- worker
- etcd
  hostname_override: “”
  user: emcclure
  docker_socket: /var/run/docker.sock
  ssh_key: “”
  ssh_key_path: ~/.ssh/id_rsa
  labels: {}
address: 10.0.0.3
port: “22”
internal_address: “”
role:
- worker
- etcd
  hostname_override: “”
  user: emcclure
  docker_socket: /var/run/docker.sock
  ssh_key: “”
  ssh_key_path: ~/.ssh/id_rsa
  labels: {}
  services:
  etcd:
  image: “”
  extra_args: {}
  extra_binds: []
  extra_env: []
  external_urls: []
  ca_cert: “”
  cert: “”
  key: “”
  path: “”
  snapshot: null
  retention: “”
  creation: “”
  kube-api:
  image: “”
  extra_args: {}
  extra_binds: []
  extra_env: []
  service_cluster_ip_range: 10.43.0.0/16
  service_node_port_range: “”
  pod_security_policy: false
  kube-controller:
  image: “”
  extra_args: {}
  extra_binds: []
  extra_env: []
  cluster_cidr: 10.42.0.0/16
  service_cluster_ip_range: 10.43.0.0/16
  scheduler:
  image: “”
  extra_args: {}
  extra_binds: []
  extra_env: []
  kubelet:
  image: “”
  extra_args: {}
  extra_binds: []
  extra_env: []
  cluster_domain: cluster.local
  infra_container_image: “”
  cluster_dns_server: 10.43.0.10
  fail_swap_on: false
  kubeproxy:
  image: “”
  extra_args: {}
  extra_binds: []
  extra_env: []
  network:
  plugin: canal
  options: {}
  authentication:
  strategy: x509
  options: {}
  sans: []
  addons: “”
  addons_include: []
  system_images:
  etcd: rancher/coreos-etcd:v3.2.18
  alpine: rancher/rke-tools:v0.1.15
  nginx_proxy: rancher/rke-tools:v0.1.15
  cert_downloader: rancher/rke-tools:v0.1.15
  kubernetes_services_sidecar: rancher/rke-tools:v0.1.15
  kubedns: rancher/k8s-dns-kube-dns-amd64:1.14.10
  dnsmasq: rancher/k8s-dns-dnsmasq-nanny-amd64:1.14.10
  kubedns_sidecar: rancher/k8s-dns-sidecar-amd64:1.14.10
  kubedns_autoscaler: rancher/cluster-proportional-autoscaler-amd64:1.0.0
  kubernetes: rancher/hyperkube:v1.11.6-rancher1
  flannel: rancher/coreos-flannel:v0.10.0
  flannel_cni: rancher/coreos-flannel-cni:v0.3.0
  calico_node: rancher/calico-node:v3.1.3
  calico_cni: rancher/calico-cni:v3.1.3
  calico_controllers: “”
  calico_ctl: rancher/calico-ctl:v2.0.0
  canal_node: rancher/calico-node:v3.1.3
  canal_cni: rancher/calico-cni:v3.1.3
  canal_flannel: rancher/coreos-flannel:v0.10.0
  wave_node: weaveworks/weave-kube:2.1.2
  weave_cni: weaveworks/weave-npc:2.1.2
  pod_infra_container: rancher/pause-amd64:3.1
  ingress: rancher/nginx-ingress-controller:0.16.2-rancher1
  ingress_backend: rancher/nginx-ingress-controller-defaultbackend:1.4
  metrics_server: rancher/metrics-server-amd64:v0.2.1
  ssh_key_path: ~/.ssh/id_rsa
  ssh_agent_auth: false
  authorization:
  mode: rbac
  options: {}
  ignore_docker_version: false
  kubernetes_version: “”
  private_registries: []
  ingress:
  provider: “”
  options: {}
  node_selector: {}
  extra_args: {}
  cluster_name: “”
  cloud_provider:
  name: “”
  prefix_path: “”
  addon_job_timeout: 0
  bastion_host:
  address: “”
  port: “”
  user: “”
  ssh_key: “”
  ssh_key_path: “”
  monitoring:
  provider: “”
  options: {}

superseb · February 7, 2019, 6:20pm

Can you share docker info from the hosts as well? I am also on User Slack (https://slack.rancher.io), probably a bit easier/faster.

Topic		Replies	Views
How to make Rancher to talk with all cluster masters instead of just one? Rancher	8	864	July 8, 2021
Not able to setup a local single node cluster - Failed to setup ssh tunneling Rancher 1.x	0	1405	January 31, 2021
Not Able to setup the Rancher K8s cluster using RKE	5	5426	March 17, 2022
Starter questions for 5. Load balancing / RKE HA Rancher	1	813	July 23, 2020
RKE Cluster setup failed Rancher	2	1371	April 9, 2020

Cannot get a Rancher cluster setup

Configuring remote access with systemd unit file

Related topics

Configuring remote access with `systemd` unit file