Not Able to setup the Rancher K8s cluster using RKE

I am trying to setup 3 node cluster with RKE (all controlplane, all worker and all etcd plane) with below things:

  1. Docker version is 20.10.x
  2. RKE version is v1.2.8
    3 RHEL 8.2 OS
  3. A super user with sudo privilage on all three nodes and ssh_key pair copied in the home directory of the user (ssh-copy-id)
  4. VM’s are provisioned in Azure Infrastructure

Below is my cluster.yml file setup (i am hiding the here
[username@Rancher-VM ~]$ rke config --name cluster.yml
[+] Cluster Level SSH Private Key Path [~/.ssh/id_rsa]: /home/user/.ssh/id_rsa
[+] Number of Hosts [1]: 3
[+] SSH Address of host (1) [none]: 10.0.1.14
[+] SSH Port of host (1) [22]:
[+] SSH Private Key Path of host (10.0.1.14) [none]: /home/user/.ssh/id_rsa
[+] SSH User of host (10.0.1.14) [ubuntu]: user
[+] Is host (10.0.1.14) a Control Plane host (y/n)? [y]: y
[+] Is host (10.0.1.14) a Worker host (y/n)? [n]: y
[+] Is host (10.0.1.14) an etcd host (y/n)? [n]: y
[+] Override Hostname of host (10.0.1.14) [none]: master
[+] Internal IP of host (10.0.1.14) [none]: 10.0.1.14
[+] Docker socket path on host (10.0.1.14) [/var/run/docker.sock]:
[+] SSH Address of host (2) [none]: 10.0.1.15
[+] SSH Port of host (2) [22]:
[+] SSH Private Key Path of host (10.0.1.15) [none]: /home/user/.ssh/id_rsa
[+] SSH User of host (10.0.1.15) [ubuntu]: user
[+] Is host (10.0.1.15) a Control Plane host (y/n)? [y]: y
[+] Is host (10.0.1.15) a Worker host (y/n)? [n]: y
[+] Is host (10.0.1.15) an etcd host (y/n)? [n]: y
[+] Override Hostname of host (10.0.1.15) [none]: worker1
[+] Internal IP of host (10.0.1.15) [none]: 10.0.1.15
[+] Docker socket path on host (10.0.1.15) [/var/run/docker.sock]:
[+] SSH Address of host (3) [none]: 10.0.1.16
[+] SSH Port of host (3) [22]:
[+] SSH Private Key Path of host (10.0.1.16) [none]: /home/user/.ssh/id_rsa
[+] SSH User of host (10.0.1.16) [ubuntu]: user
[+] Is host (10.0.1.16) a Control Plane host (y/n)? [y]: y
[+] Is host (10.0.1.16) a Worker host (y/n)? [n]: y
[+] Is host (10.0.1.16) an etcd host (y/n)? [n]: y
[+] Override Hostname of host (10.0.1.16) [none]: worker2
[+] Internal IP of host (10.0.1.16) [none]: 10.0.1.16
[+] Docker socket path on host (10.0.1.16) [/var/run/docker.sock]:
[+] Network Plugin Type (flannel, calico, weave, canal, aci) [canal]:
[+] Authentication Strategy [x509]:
[+] Authorization Mode (rbac, none) [rbac]:
[+] Kubernetes Docker image [rancher/hyperkube:v1.20.6-rancher1]:
[+] Cluster domain [cluster.local]:
[+] Service Cluster IP Range [10.43.0.0/16]:
[+] Enable PodSecurityPolicy [n]:
[+] Cluster Network CIDR [10.42.0.0/16]:
[+] Cluster DNS Service IP [10.43.0.10]:
[+] Add addon manifest URLs or YAML files [no]:

Below is the output of
rke up --config ./cluster.yml

INFO[0150] [sync] Syncing nodes Labels and Taints
INFO[0150] [sync] Successfully synced nodes Labels and Taints
INFO[0150] [network] Setting up network plugin: canal
INFO[0150] [addons] Saving ConfigMap for addon rke-network-plugin to Kubernetes
INFO[0150] [addons] Successfully saved ConfigMap for addon rke-network-plugin to Kubernetes
INFO[0150] [addons] Executing deploy job rke-network-plugin
INFO[0161] [addons] Setting up coredns
INFO[0161] [addons] Saving ConfigMap for addon rke-coredns-addon to Kubernetes
INFO[0161] [addons] Successfully saved ConfigMap for addon rke-coredns-addon to Kubernetes
INFO[0161] [addons] Executing deploy job rke-coredns-addon
INFO[0172] [addons] CoreDNS deployed successfully
INFO[0172] [dns] DNS provider coredns deployed successfully
INFO[0172] [addons] Setting up Metrics Server
INFO[0172] [addons] Saving ConfigMap for addon rke-metrics-addon to Kubernetes
INFO[0172] [addons] Successfully saved ConfigMap for addon rke-metrics-addon to Kubernetes
INFO[0172] [addons] Executing deploy job rke-metrics-addon
INFO[0177] [addons] Metrics Server deployed successfully
INFO[0177] [ingress] Setting up nginx ingress controller
INFO[0177] [addons] Saving ConfigMap for addon rke-ingress-controller to Kubernetes
INFO[0177] [addons] Successfully saved ConfigMap for addon rke-ingress-controller to Kubernetes
INFO[0177] [addons] Executing deploy job rke-ingress-controller
INFO[0193] [ingress] ingress controller nginx deployed successfully
INFO[0193] [addons] Setting up user addons
INFO[0193] [addons] no user addons defined
FATA[0193] Provisioning incomplete, host(s) [10.0.1.14] skipped because they could not be contacted

10.0.1.14 i want to create as master and other two IP’s as worker.
There is passwordless ssh from 10.0.1.14 to 15/16 but reverse is not there, could it be the reason for the same

Please suggest

A redacted cluster.yml and the full log of rke up (or at least the beginning part where the connection is made and where it shows the error for 10.0.1.14 would help here. If you are using the exact same image with the exact same configuration, then there is already something going on as that can’t possibly happen (if they are all 100% identical)

Hi Suberseb,

Please find the detailed cluster.yaml file post creation:

nodes:

  • address: 10.0.1.14
    port: “22”
    internal_address: 10.0.1.14
    role:
    • controlplane
    • worker
    • etcd
      hostname_override: master
      user: user
      docker_socket: /var/run/docker.sock
      ssh_key: “”
      ssh_key_path: /home/user/.ssh/id_rsa
      ssh_cert: “”
      ssh_cert_path: “”
      labels: {}
      taints: []
  • address: 10.0.1.15
    port: “22”
    internal_address: 10.0.1.15
    role:
    • controlplane
    • worker
    • etcd
      hostname_override: worker1
      user: user
      docker_socket: /var/run/docker.sock
      ssh_key: “”
      ssh_key_path: /home/user/.ssh/id_rsa
      ssh_cert: “”
      ssh_cert_path: “”
      labels: {}
      taints: []
  • address: 10.0.1.16
    port: “22”
    internal_address: 10.0.1.16
    role:
    • controlplane
    • worker
    • etcd
      hostname_override: worker2
      user: user
      docker_socket: /var/run/docker.sock
      ssh_key: “”
      ssh_key_path: /home/user/.ssh/id_rsa
      ssh_cert: “”
      ssh_cert_path: “”
      labels: {}
      taints: []
      services:
      etcd:
      image: “”
      extra_args: {}
      extra_binds: []
      extra_env: []
      win_extra_args: {}
      win_extra_binds: []
      win_extra_env: []
      external_urls: []
      ca_cert: “”
      cert: “”
      key: “”
      path: “”
      uid: 0
      gid: 0
      snapshot: null
      retention: “”
      creation: “”
      backup_config: null
      kube-api:
      image: “”
      extra_args: {}
      extra_binds: []
      extra_env: []
      win_extra_args: {}
      win_extra_binds: []
      win_extra_env: []
      service_cluster_ip_range: 10.43.0.0/16
      service_node_port_range: “”
      pod_security_policy: false
      always_pull_images: false
      secrets_encryption_config: null
      audit_log: null
      admission_configuration: null
      event_rate_limit: null
      kube-controller:
      image: “”
      extra_args: {}
      extra_binds: []
      extra_env: []
      win_extra_args: {}
      win_extra_binds: []
      win_extra_env: []
      cluster_cidr: 10.42.0.0/16
      service_cluster_ip_range: 10.43.0.0/16
      scheduler:
      image: “”
      extra_args: {}
      extra_binds: []
      extra_env: []
      win_extra_args: {}
      win_extra_binds: []
      win_extra_env: []
      kubelet:
      image: “”
      extra_args: {}
      extra_binds: []
      extra_env: []
      win_extra_args: {}
      win_extra_binds: []
      win_extra_env: []
      cluster_domain: cluster.local
      infra_container_image: “”
      cluster_dns_server: 10.43.0.10
      fail_swap_on: false
      generate_serving_certificate: false
      kubeproxy:
      image: “”
      extra_args: {}
      extra_binds: []
      extra_env: []
      win_extra_args: {}
      win_extra_binds: []
      win_extra_env: []
      network:
      plugin: canal
      options: {}
      mtu: 0
      node_selector: {}
      update_strategy: null
      tolerations: []
      authentication:
      strategy: x509
      sans: []
      webhook: null
      addons: “”
      addons_include: []
      system_images:
      etcd: rancher/mirrored-coreos-etcd:v3.4.15-rancher1
      alpine: rancher/rke-tools:v0.1.74
      nginx_proxy: rancher/rke-tools:v0.1.74
      cert_downloader: rancher/rke-tools:v0.1.74
      kubernetes_services_sidecar: rancher/rke-tools:v0.1.74
      kubedns: rancher/mirrored-k8s-dns-kube-dns:1.15.10
      dnsmasq: rancher/mirrored-k8s-dns-dnsmasq-nanny:1.15.10
      kubedns_sidecar: rancher/mirrored-k8s-dns-sidecar:1.15.10
      kubedns_autoscaler: rancher/mirrored-cluster-proportional-autoscaler:1.8.1
      coredns: rancher/mirrored-coredns-coredns:1.8.0
      coredns_autoscaler: rancher/mirrored-cluster-proportional-autoscaler:1.8.1
      nodelocal: rancher/mirrored-k8s-dns-node-cache:1.15.13
      kubernetes: rancher/hyperkube:v1.20.6-rancher1
      flannel: rancher/coreos-flannel:v0.13.0-rancher1
      flannel_cni: rancher/flannel-cni:v0.3.0-rancher6
      calico_node: rancher/mirrored-calico-node:v3.17.2
      calico_cni: rancher/mirrored-calico-cni:v3.17.2
      calico_controllers: rancher/mirrored-calico-kube-controllers:v3.17.2
      calico_ctl: rancher/mirrored-calico-ctl:v3.17.2
      calico_flexvol: rancher/mirrored-calico-pod2daemon-flexvol:v3.17.2
      canal_node: rancher/mirrored-calico-node:v3.17.2
      canal_cni: rancher/mirrored-calico-cni:v3.17.2
      canal_controllers: rancher/mirrored-calico-kube-controllers:v3.17.2
      canal_flannel: rancher/coreos-flannel:v0.13.0-rancher1
      canal_flexvol: rancher/mirrored-calico-pod2daemon-flexvol:v3.17.2
      weave_node: weaveworks/weave-kube:2.8.1
      weave_cni: weaveworks/weave-npc:2.8.1
      pod_infra_container: rancher/mirrored-pause:3.2
      ingress: rancher/nginx-ingress-controller:nginx-0.43.0-rancher3
      ingress_backend: rancher/mirrored-nginx-ingress-controller-defaultbackend:1.5-rancher1
      metrics_server: rancher/mirrored-metrics-server:v0.4.1
      windows_pod_infra_container: rancher/kubelet-pause:v0.1.6
      aci_cni_deploy_container: noiro/cnideploy:5.1.1.0.1ae238a
      aci_host_container: noiro/aci-containers-host:5.1.1.0.1ae238a
      aci_opflex_container: noiro/opflex:5.1.1.0.1ae238a
      aci_mcast_container: noiro/opflex:5.1.1.0.1ae238a
      aci_ovs_container: noiro/openvswitch:5.1.1.0.1ae238a
      aci_controller_container: noiro/aci-containers-controller:5.1.1.0.1ae238a
      aci_gbp_server_container: noiro/gbp-server:5.1.1.0.1ae238a
      aci_opflex_server_container: noiro/opflex-server:5.1.1.0.1ae238a
      ssh_key_path: /home/user/.ssh/id_rsa
      ssh_cert_path: “”
      ssh_agent_auth: false
      authorization:
      mode: rbac
      options: {}
      ignore_docker_version: null
      kubernetes_version: “”
      private_registries: []
      ingress:
      provider: “”
      options: {}
      node_selector: {}
      extra_args: {}
      dns_policy: “”
      extra_envs: []
      extra_volumes: []
      extra_volume_mounts: []
      update_strategy: null
      http_port: 0
      https_port: 0
      network_mode: “”
      tolerations: []
      default_backend: null
      default_http_backend_priority_class_name: “”
      nginx_ingress_controller_priority_class_name: “”
      cluster_name: “”
      cloud_provider:
      name: “”
      prefix_path: “”
      win_prefix_path: “”
      addon_job_timeout: 0
      bastion_host:
      address: “”
      port: “”
      user: “”
      ssh_key: “”
      ssh_key_path: “”
      ssh_cert: “”
      ssh_cert_path: “”
      monitoring:
      provider: “”
      options: {}
      node_selector: {}
      update_strategy: null
      replicas: null
      tolerations: []
      metrics_server_priority_class_name: “”
      restore:
      restore: false
      snapshot_name: “”
      rotate_encryption_key: false
      dns: null

Today when i tried to bring up the cluster using rke up -config ./cluster.yml

[radium@Rancher-VM ~]$ rke up -config ./cluster.yml
INFO[0000] Running RKE version: v1.2.8
INFO[0000] Initiating Kubernetes cluster
INFO[0000] [dialer] Setup tunnel for host [10.0.1.16]
INFO[0000] [dialer] Setup tunnel for host [10.0.1.14]
INFO[0000] [dialer] Setup tunnel for host [10.0.1.15]
WARN[0000] Failed to set up SSH tunneling for host [10.0.1.14]: Can’t retrieve Docker Info: error during connect: Get “http://%2Fvar%2Fr
un%2Fdocker.sock/v1.24/info”: Unable to access node with address [10.0.1.14:22] using SSH. Please check if you are able to SSH to the no
de using the specified SSH Private Key and if you have configured the correct SSH username. Error: ssh: handshake failed: ssh: unable to
authenticate, attempted methods [none publickey], no supported methods remain
WARN[0000] Removing host [10.0.1.14] from node lists
INFO[0000] Checking if container [cluster-state-deployer] is running on host [10.0.1.16], try #1
INFO[0000] Checking if container [cluster-state-deployer] is running on host [10.0.1.15], try #1
INFO[0000] [certificates] Generating CA kubernetes certificates
INFO[0000] [certificates] Generating Kubernetes API server aggregation layer requestheader client CA certificates
INFO[0000] [certificates] GenerateServingCertificate is disabled, checking if there are unused kubelet certificates
INFO[0000] [certificates] Generating Kubernetes API server certificates
INFO[0001] [certificates] Generating Service account token key
INFO[0001] [certificates] Generating Kube Controller certificates
INFO[0001] [certificates] Generating Kube Scheduler certificates
INFO[0001] [certificates] Generating Kube Proxy certificates
INFO[0001] [certificates] Generating Node certificate
INFO[0001] [certificates] Generating admin certificates and kubeconfig
INFO[0002] [certificates] Generating Kubernetes API server proxy client certificates
INFO[0002] [certificates] Generating kube-etcd-10-0-1-15 certificate and key
INFO[0002] [certificates] Generating kube-etcd-10-0-1-16 certificate and key
INFO[0002] Successfully Deployed state file at [./cluster.rkestate]
INFO[0002] Building Kubernetes cluster
INFO[0002] [dialer] Setup tunnel for host [10.0.1.16]
INFO[0002] [dialer] Setup tunnel for host [10.0.1.15]
INFO[0002] [network] Deploying port listener containers
INFO[0002] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.15]
INFO[0002] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.16]
INFO[0003] Starting container [rke-etcd-port-listener] on host [10.0.1.15], try #1
INFO[0003] Starting container [rke-etcd-port-listener] on host [10.0.1.16], try #1
INFO[0003] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.16]
INFO[0003] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.15]
INFO[0004] Starting container [rke-cp-port-listener] on host [10.0.1.16], try #1
INFO[0004] Starting container [rke-cp-port-listener] on host [10.0.1.15], try #1
INFO[0004] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.16]
INFO[0004] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.15]
INFO[0005] Starting container [rke-worker-port-listener] on host [10.0.1.16], try #1
INFO[0005] Starting container [rke-worker-port-listener] on host [10.0.1.15], try #1
INFO[0005] [network] Port listener containers deployed successfully
INFO[0005] [network] Running etcd <-> etcd port checks
INFO[0005] [network] Checking if host [10.0.1.15] can connect to host(s) [10.0.1.15 10.0.1.16] on port(s) [2379 2380], try #1
INFO[0005] [network] Checking if host [10.0.1.16] can connect to host(s) [10.0.1.15 10.0.1.16] on port(s) [2379 2380], try #1
INFO[0005] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.16]
INFO[0005] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.15]
INFO[0005] Starting container [rke-port-checker] on host [10.0.1.16], try #1
INFO[0005] Starting container [rke-port-checker] on host [10.0.1.15], try #1
INFO[0006] [network] Successfully started [rke-port-checker] container on host [10.0.1.16]
INFO[0006] [network] Successfully started [rke-port-checker] container on host [10.0.1.15]
INFO[0006] Removing container [rke-port-checker] on host [10.0.1.16], try #1
INFO[0006] Removing container [rke-port-checker] on host [10.0.1.15], try #1
INFO[0006] [network] Running control plane → etcd port checks
INFO[0006] [network] Checking if host [10.0.1.15] can connect to host(s) [10.0.1.15 10.0.1.16] on port(s) [2379], try #1
INFO[0006] [network] Checking if host [10.0.1.16] can connect to host(s) [10.0.1.15 10.0.1.16] on port(s) [2379], try #1
INFO[0006] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.16]
INFO[0006] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.15]
INFO[0006] Starting container [rke-port-checker] on host [10.0.1.15], try #1
INFO[0006] Starting container [rke-port-checker] on host [10.0.1.16], try #1
INFO[0007] [network] Successfully started [rke-port-checker] container on host [10.0.1.15]
INFO[0007] Removing container [rke-port-checker] on host [10.0.1.15], try #1
INFO[0007] [network] Successfully started [rke-port-checker] container on host [10.0.1.16]
INFO[0007] Removing container [rke-port-checker] on host [10.0.1.16], try #1
INFO[0007] [network] Running control plane → worker port checks
INFO[0007] [network] Checking if host [10.0.1.15] can connect to host(s) [10.0.1.15 10.0.1.16] on port(s) [10250], try #1
INFO[0007] [network] Checking if host [10.0.1.16] can connect to host(s) [10.0.1.15 10.0.1.16] on port(s) [10250], try #1
INFO[0007] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.16]
INFO[0007] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.15]
INFO[0007] Starting container [rke-port-checker] on host [10.0.1.16], try #1
INFO[0007] Starting container [rke-port-checker] on host [10.0.1.15], try #1
INFO[0008] [network] Successfully started [rke-port-checker] container on host [10.0.1.16]
INFO[0008] [network] Successfully started [rke-port-checker] container on host [10.0.1.15]
INFO[0008] Removing container [rke-port-checker] on host [10.0.1.16], try #1
INFO[0008] Removing container [rke-port-checker] on host [10.0.1.15], try #1
INFO[0008] [network] Running workers → control plane port checks
INFO[0008] [network] Checking if host [10.0.1.15] can connect to host(s) [10.0.1.15 10.0.1.16] on port(s) [6443], try #1
INFO[0008] [network] Checking if host [10.0.1.16] can connect to host(s) [10.0.1.15 10.0.1.16] on port(s) [6443], try #1
INFO[0008] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.16]
INFO[0008] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.15]
INFO[0008] Starting container [rke-port-checker] on host [10.0.1.15], try #1
INFO[0008] Starting container [rke-port-checker] on host [10.0.1.16], try #1
INFO[0009] [network] Successfully started [rke-port-checker] container on host [10.0.1.15]
INFO[0009] Removing container [rke-port-checker] on host [10.0.1.15], try #1
INFO[0009] [network] Successfully started [rke-port-checker] container on host [10.0.1.16]
INFO[0009] Removing container [rke-port-checker] on host [10.0.1.16], try #1
INFO[0009] [network] Checking KubeAPI port Control Plane hosts
INFO[0009] [network] Removing port listener containers
INFO[0009] Removing container [rke-etcd-port-listener] on host [10.0.1.16], try #1
INFO[0009] Removing container [rke-etcd-port-listener] on host [10.0.1.15], try #1
INFO[0009] [remove/rke-etcd-port-listener] Successfully removed container on host [10.0.1.16]
INFO[0009] [remove/rke-etcd-port-listener] Successfully removed container on host [10.0.1.15]
INFO[0009] Removing container [rke-cp-port-listener] on host [10.0.1.16], try #1
INFO[0009] Removing container [rke-cp-port-listener] on host [10.0.1.15], try #1
INFO[0009] [remove/rke-cp-port-listener] Successfully removed container on host [10.0.1.15]
INFO[0009] [remove/rke-cp-port-listener] Successfully removed container on host [10.0.1.16]
INFO[0009] Removing container [rke-worker-port-listener] on host [10.0.1.16], try #1
INFO[0009] Removing container [rke-worker-port-listener] on host [10.0.1.15], try #1
INFO[0009] [remove/rke-worker-port-listener] Successfully removed container on host [10.0.1.16]
INFO[0009] [remove/rke-worker-port-listener] Successfully removed container on host [10.0.1.15]
INFO[0009] [network] Port listener containers removed successfully
INFO[0009] [certificates] Deploying kubernetes certificates to Cluster nodes
INFO[0009] Checking if container [cert-deployer] is running on host [10.0.1.16], try #1
INFO[0009] Checking if container [cert-deployer] is running on host [10.0.1.15], try #1
INFO[0009] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.16]
INFO[0009] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.15]
INFO[0010] Starting container [cert-deployer] on host [10.0.1.16], try #1
INFO[0010] Starting container [cert-deployer] on host [10.0.1.15], try #1
INFO[0010] Checking if container [cert-deployer] is running on host [10.0.1.15], try #1
INFO[0010] Checking if container [cert-deployer] is running on host [10.0.1.16], try #1
INFO[0015] Checking if container [cert-deployer] is running on host [10.0.1.15], try #1
INFO[0015] Removing container [cert-deployer] on host [10.0.1.15], try #1
INFO[0015] Checking if container [cert-deployer] is running on host [10.0.1.16], try #1
INFO[0015] Removing container [cert-deployer] on host [10.0.1.16], try #1
INFO[0015] [reconcile] Rebuilding and updating local kube config
INFO[0015] Successfully Deployed local admin kubeconfig at [./kube_config_cluster.yml]
WARN[0015] [reconcile] host [10.0.1.15] is a control plane node without reachable Kubernetes API endpoint in the cluster
INFO[0015] Successfully Deployed local admin kubeconfig at [./kube_config_cluster.yml]
WARN[0015] [reconcile] host [10.0.1.16] is a control plane node without reachable Kubernetes API endpoint in the cluster
WARN[0015] [reconcile] no control plane node with reachable Kubernetes API endpoint in the cluster found
INFO[0015] [certificates] Successfully deployed kubernetes certificates to Cluster nodes
INFO[0015] [file-deploy] Deploying file [/etc/kubernetes/audit-policy.yaml] to node [10.0.1.15]
INFO[0015] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.15]
INFO[0016] Starting container [file-deployer] on host [10.0.1.15], try #1
INFO[0016] Successfully started [file-deployer] container on host [10.0.1.15]
INFO[0016] Waiting for [file-deployer] container to exit on host [10.0.1.15]
INFO[0016] Waiting for [file-deployer] container to exit on host [10.0.1.15]
INFO[0017] Container [file-deployer] is still running on host [10.0.1.15]: stderr: [], stdout: []
INFO[0018] Waiting for [file-deployer] container to exit on host [10.0.1.15]
INFO[0018] Removing container [file-deployer] on host [10.0.1.15], try #1
INFO[0018] [remove/file-deployer] Successfully removed container on host [10.0.1.15]
INFO[0018] [file-deploy] Deploying file [/etc/kubernetes/audit-policy.yaml] to node [10.0.1.16]
INFO[0018] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.16]
INFO[0018] Starting container [file-deployer] on host [10.0.1.16], try #1
INFO[0019] Successfully started [file-deployer] container on host [10.0.1.16]
INFO[0019] Waiting for [file-deployer] container to exit on host [10.0.1.16]
INFO[0019] Waiting for [file-deployer] container to exit on host [10.0.1.16]
INFO[0019] Container [file-deployer] is still running on host [10.0.1.16]: stderr: [], stdout: []
INFO[0020] Waiting for [file-deployer] container to exit on host [10.0.1.16]
INFO[0020] Removing container [file-deployer] on host [10.0.1.16], try #1
INFO[0020] [remove/file-deployer] Successfully removed container on host [10.0.1.16]
INFO[0020] [/etc/kubernetes/audit-policy.yaml] Successfully deployed audit policy file to Cluster control nodes
INFO[0020] [reconcile] Reconciling cluster state
INFO[0020] [reconcile] This is newly generated cluster
INFO[0020] Pre-pulling kubernetes images
INFO[0020] Image [rancher/hyperkube:v1.20.6-rancher1] exists on host [10.0.1.15]
INFO[0020] Image [rancher/hyperkube:v1.20.6-rancher1] exists on host [10.0.1.16]
INFO[0020] Kubernetes images pulled successfully
INFO[0020] [etcd] Building up etcd plane…
INFO[0020] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.15]
INFO[0020] Starting container [etcd-fix-perm] on host [10.0.1.15], try #1
INFO[0021] Successfully started [etcd-fix-perm] container on host [10.0.1.15]
INFO[0021] Waiting for [etcd-fix-perm] container to exit on host [10.0.1.15]
INFO[0021] Waiting for [etcd-fix-perm] container to exit on host [10.0.1.15]
INFO[0021] Removing container [etcd-fix-perm] on host [10.0.1.15], try #1
INFO[0021] [remove/etcd-fix-perm] Successfully removed container on host [10.0.1.15]
INFO[0021] [etcd] Running rolling snapshot container [etcd-snapshot-once] on host [10.0.1.15]
INFO[0021] Removing container [etcd-rolling-snapshots] on host [10.0.1.15], try #1
INFO[0021] [remove/etcd-rolling-snapshots] Successfully removed container on host [10.0.1.15]
INFO[0021] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.15]
INFO[0022] Starting container [etcd-rolling-snapshots] on host [10.0.1.15], try #1
INFO[0022] [etcd] Successfully started [etcd-rolling-snapshots] container on host [10.0.1.15]
INFO[0027] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.15]
INFO[0028] Starting container [rke-bundle-cert] on host [10.0.1.15], try #1
INFO[0028] [certificates] Successfully started [rke-bundle-cert] container on host [10.0.1.15]
INFO[0028] Waiting for [rke-bundle-cert] container to exit on host [10.0.1.15]
INFO[0029] [certificates] successfully saved certificate bundle [/opt/rke/etcd-snapshots//pki.bundle.tar.gz] on host [10.0.1.15]
INFO[0029] Removing container [rke-bundle-cert] on host [10.0.1.15], try #1
INFO[0029] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.15]
INFO[0029] Starting container [rke-log-linker] on host [10.0.1.15], try #1
INFO[0030] [etcd] Successfully started [rke-log-linker] container on host [10.0.1.15]
INFO[0030] Removing container [rke-log-linker] on host [10.0.1.15], try #1
INFO[0030] [remove/rke-log-linker] Successfully removed container on host [10.0.1.15]
INFO[0030] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.15]
INFO[0030] Starting container [rke-log-linker] on host [10.0.1.15], try #1
INFO[0031] [etcd] Successfully started [rke-log-linker] container on host [10.0.1.15]
INFO[0031] Removing container [rke-log-linker] on host [10.0.1.15], try #1
INFO[0031] [remove/rke-log-linker] Successfully removed container on host [10.0.1.15]
INFO[0031] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.16]
INFO[0031] Starting container [etcd-fix-perm] on host [10.0.1.16], try #1
INFO[0032] Successfully started [etcd-fix-perm] container on host [10.0.1.16]
INFO[0032] Waiting for [etcd-fix-perm] container to exit on host [10.0.1.16]
INFO[0032] Waiting for [etcd-fix-perm] container to exit on host [10.0.1.16]
INFO[0032] Removing container [etcd-fix-perm] on host [10.0.1.16], try #1
INFO[0032] [remove/etcd-fix-perm] Successfully removed container on host [10.0.1.16]
INFO[0032] [etcd] Running rolling snapshot container [etcd-snapshot-once] on host [10.0.1.16]
INFO[0032] Removing container [etcd-rolling-snapshots] on host [10.0.1.16], try #1
INFO[0032] [remove/etcd-rolling-snapshots] Successfully removed container on host [10.0.1.16]
INFO[0032] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.16]
INFO[0033] Starting container [etcd-rolling-snapshots] on host [10.0.1.16], try #1
INFO[0033] [etcd] Successfully started [etcd-rolling-snapshots] container on host [10.0.1.16]
INFO[0038] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.16]
INFO[0039] Starting container [rke-bundle-cert] on host [10.0.1.16], try #1
INFO[0039] [certificates] Successfully started [rke-bundle-cert] container on host [10.0.1.16]
INFO[0039] Waiting for [rke-bundle-cert] container to exit on host [10.0.1.16]
INFO[0039] Container [rke-bundle-cert] is still running on host [10.0.1.16]: stderr: [], stdout: []
INFO[0040] Waiting for [rke-bundle-cert] container to exit on host [10.0.1.16]
INFO[0040] [certificates] successfully saved certificate bundle [/opt/rke/etcd-snapshots//pki.bundle.tar.gz] on host [10.0.1.16]
INFO[0040] Removing container [rke-bundle-cert] on host [10.0.1.16], try #1
INFO[0040] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.16]
INFO[0041] Starting container [rke-log-linker] on host [10.0.1.16], try #1
INFO[0041] [etcd] Successfully started [rke-log-linker] container on host [10.0.1.16]
INFO[0041] Removing container [rke-log-linker] on host [10.0.1.16], try #1
INFO[0041] [remove/rke-log-linker] Successfully removed container on host [10.0.1.16]
INFO[0042] Image [rancher/rke-tools:v0.1.74] exists on host [10.0.1.16]
INFO[0042] Starting container [rke-log-linker] on host [10.0.1.16], try #1
INFO[0042] [etcd] Successfully started [rke-log-linker] container on host [10.0.1.16]
INFO[0042] Removing container [rke-log-linker] on host [10.0.1.16], try #1
INFO[0043] [remove/rke-log-linker] Successfully removed container on host [10.0.1.16]
INFO[0043] [etcd] Successfully started etcd plane… Checking etcd cluster health
WARN[0136] [etcd] host [10.0.1.15] failed to check etcd health: failed to get /health for host [10.0.1.15]: Get “https://10.0.1.15:2379/
health”: remote error: tls: bad certificate
WARN[0228] [etcd] host [10.0.1.16] failed to check etcd health: failed to get /health for host [10.0.1.16]: Get “https://10.0.1.16:2379/
health”: remote error: tls: bad certificate
FATA[0228] [etcd] Failed to bring up Etcd Plane: etcd cluster is unhealthy: hosts [10.0.1.15,10.0.1.16] failed to report healthy. Check
etcd container logs on each host for more information

Please suggest.

Thanks
Ankit

This is another error at the end, there is more going on here.

The SSH error should be able to be debugged by checking the SSH daemon logs on the OS, compare the logs when you login manually versus when you run rke up to see why the SSH daemon is not accepting the connection.

The error around etcd cluster is unhealthy can have a number of reasons, running rke up without the correct cluster.rkestate can do this (especially because it says bad certificate), but running docker logs etcd on the affected nodes will reveal more info. This log line also reveals you probably didn’t have the cluster.rkestate present when you ran rke up:

INFO[0020] [reconcile] This is newly generated cluster

Hi SuperSeb,

i was able to identify the issue :slight_smile:
Issue was with the passwordless ssh on the master node itself as other worker nodes were able to connect without the password.
Post copying the public ssh key to authorized_key of the user on the master node, i was able to bring the cluster up successfully…

However, if possible i would request you to please have this added in the Documentation.

Thanks for the Support
Ankit

The problem was with the master node’s passwordless ssh, as other worker nodes were able to connect without it.