RKE on RHEL 8.5 Error : failed to set up SSH tunneling for host (SSH correctly configured)

Hello,

I encountered SSH error on RHEL 8.5. The same configuration was tested on RHEL 7.6 and it worked smoothly. The problem only occurred on RHEL 8

RKE version: v1.2.19

Docker version: (docker version,docker info preferred) Docker version 20.10.12, build e91ed57

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

NAME="Red Hat Enterprise Linux"
VERSION="8.5 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.5"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.5 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.5
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.5"

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) VMware vSphere

cluster.yml file:

# If you intened to deploy Kubernetes in an air-gapped environment,
# please consult the documentation on how to configure custom RKE images.
nodes:
- address: machineadd
  port: "22"
  internal_address: ""
  role:
  - controlplane
  - worker
  - etcd
  hostname_override: ""
  user: docker
  docker_socket: /var/run/docker.sock
  ssh_key: ""
  ssh_key_path: /home/docker/.ssh/id_rsa
  ssh_cert: ""
  ssh_cert_path: ""
  labels: {}
  taints: []
services:
  etcd:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    external_urls: []
    ca_cert: ""
    cert: ""
    key: ""
    path: ""
    uid: 0
    gid: 0
    snapshot: null
    retention: ""
    creation: ""
    backup_config: null
  kube-api:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    service_cluster_ip_range: 10.43.0.0/16
    service_node_port_range: ""
    pod_security_policy: false
    always_pull_images: false
    secrets_encryption_config: null
    audit_log: null
    admission_configuration: null
    event_rate_limit: null
  kube-controller:
    image: ""
    # extra_args: {}
    # Source : https://rancher.com/docs/rke/latest/en/os/#flatcar-container-linux
    extra_args:
      flex-volume-plugin-dir: /opt/kubernetes/kubelet-plugins/volume/exec/
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    cluster_cidr: 10.42.0.0/16
    service_cluster_ip_range: 10.43.0.0/16
  scheduler:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
  kubelet:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    cluster_domain: cluster.local
    infra_container_image: ""
    cluster_dns_server: 10.43.0.10
    fail_swap_on: false
    generate_serving_certificate: false
  kubeproxy:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
network:
  plugin: calico
  # options: {}
  # Source : https://rancher.com/docs/rke/latest/en/os/#flatcar-container-linux
  options:
    calico_flex_volume_plugin_dir: /opt/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
    flannel_backend_type: vxlan
  mtu: 0
  node_selector: {}
  update_strategy: null
  tolerations: []
authentication:
  strategy: x509
  sans: []
  webhook: null
addons: ""
addons_include: []
# Do not configure system_image because we already configure private_registries.
# Source : https://github.com/rancher/rke/issues/2720#issuecomment-950768397
# system_images:
#   etcd: rancher/mirrored-coreos-etcd:v3.4.15-rancher1
#   alpine: rancher/rke-tools:v0.1.80
#   nginx_proxy: rancher/rke-tools:v0.1.80
#   cert_downloader: rancher/rke-tools:v0.1.80
#   kubernetes_services_sidecar: rancher/rke-tools:v0.1.80
#   kubedns: rancher/mirrored-k8s-dns-kube-dns:1.15.10
#   dnsmasq: rancher/mirrored-k8s-dns-dnsmasq-nanny:1.15.10
#   kubedns_sidecar: rancher/mirrored-k8s-dns-sidecar:1.15.10
#   kubedns_autoscaler: rancher/mirrored-cluster-proportional-autoscaler:1.8.1
#   coredns: rancher/mirrored-coredns-coredns:1.8.0
#   coredns_autoscaler: rancher/mirrored-cluster-proportional-autoscaler:1.8.1
#   nodelocal: rancher/mirrored-k8s-dns-node-cache:1.15.13
#   kubernetes: rancher/hyperkube:v1.20.15-rancher1
#   flannel: rancher/mirrored-coreos-flannel:v0.15.1
#   flannel_cni: rancher/flannel-cni:v0.3.0-rancher6
#   calico_node: rancher/mirrored-calico-node:v3.17.2
#   calico_cni: rancher/mirrored-calico-cni:v3.17.2
#   calico_controllers: rancher/mirrored-calico-kube-controllers:v3.17.2
#   calico_ctl: rancher/mirrored-calico-ctl:v3.17.2
#   calico_flexvol: rancher/mirrored-calico-pod2daemon-flexvol:v3.17.2
#   canal_node: rancher/mirrored-calico-node:v3.17.2
#   canal_cni: rancher/mirrored-calico-cni:v3.17.2
#   canal_controllers: rancher/mirrored-calico-kube-controllers:v3.17.2
#   canal_flannel: rancher/mirrored-coreos-flannel:v0.15.1
#   canal_flexvol: rancher/mirrored-calico-pod2daemon-flexvol:v3.17.2
#   weave_node: weaveworks/weave-kube:2.8.1
#   weave_cni: weaveworks/weave-npc:2.8.1
#   pod_infra_container: rancher/mirrored-pause:3.6
#   ingress: rancher/nginx-ingress-controller:nginx-1.1.0-rancher1
#   ingress_backend: rancher/mirrored-nginx-ingress-controller-defaultbackend:1.5-rancher1
#   ingress_webhook: rancher/mirrored-ingress-nginx-kube-webhook-certgen:v1.1.1
#   metrics_server: rancher/mirrored-metrics-server:v0.5.0
#   windows_pod_infra_container: rancher/mirrored-pause:3.6
#   aci_cni_deploy_container: noiro/cnideploy:5.1.1.0.1ae238a
#   aci_host_container: noiro/aci-containers-host:5.1.1.0.1ae238a
#   aci_opflex_container: noiro/opflex:5.1.1.0.1ae238a
#   aci_mcast_container: noiro/opflex:5.1.1.0.1ae238a
#   aci_ovs_container: noiro/openvswitch:5.1.1.0.1ae238a
#   aci_controller_container: noiro/aci-containers-controller:5.1.1.0.1ae238a
#   aci_gbp_server_container: noiro/gbp-server:5.1.1.0.1ae238a
#   aci_opflex_server_container: noiro/opflex-server:5.1.1.0.1ae238a
ssh_key_path: /home/docker/.ssh/id_rsa
ssh_cert_path: ""
ssh_agent_auth: false
authorization:
  mode: rbac
  options: {}
ignore_docker_version: null
kubernetes_version: ""
# private_registries: []
# Source : https://rancher.com/docs/rke/latest/en/config-options/private-registries/#default-registry
private_registries:
  - url: "dockerproxy.company"
    is_default: true # All system images will be pulled using this registry. 
ingress:
  provider: ""
  options: {}
  node_selector: {}
  extra_args: {}
  dns_policy: ""
  extra_envs: []
  extra_volumes: []
  extra_volume_mounts: []
  update_strategy: null
  http_port: 0
  https_port: 0
  network_mode: ""
  tolerations: []
  default_backend: null
  default_http_backend_priority_class_name: ""
  nginx_ingress_controller_priority_class_name: ""
cluster_name: ""
cloud_provider:
  name: ""
prefix_path: ""
win_prefix_path: ""
addon_job_timeout: 0
bastion_host:
  address: ""
  port: ""
  user: ""
  ssh_key: ""
  ssh_key_path: ""
  ssh_cert: ""
  ssh_cert_path: ""
monitoring:
  provider: ""
  options: {}
  node_selector: {}
  update_strategy: null
  replicas: null
  tolerations: []
  metrics_server_priority_class_name: ""
restore:
  restore: false
  snapshot_name: ""
rotate_encryption_key: false
dns: null


Steps to Reproduce:

Firstly, log in as user other than root but having capacity to access docker socket (being able to run docker ps successfully).
Here, I tried to instance a cluster on the node where I am on, as if the target was a remote machine.

# login as user 'docker'
ssh docker@machine

# make sure that id_rsa.pub is in authorized_keys
echo $(cat ~/.ssh/id_rsa.pub) > ~/.ssh/id_rsa.pub

# At this point, I can do ssh docker@localhost without entering any password

# on the folder where cluster.yml is located
rke up

Results:

[docker@mymachine terraform]$ rke -d up
DEBU[0000] Loglevel set to [debug]
INFO[0000] Running RKE version: v1.2.19
DEBU[0000] audit log policy found in cluster.yml
INFO[0000] Initiating Kubernetes cluster
DEBU[0000] metadataInitialized: [False] []
DEBU[0000] Loading data.json from local source
DEBU[0000] data.json SHA256 checksum: 74664a6ce625a6aeaef8183de2f65f289cd752a80103768c7d2d4359ac423172
DEBU[0000] metadata initialized successfully
DEBU[0000] metadataInitialized: [true] []
DEBU[0000] No DNS provider configured, setting default based on cluster version [1.20.15-rancher1-2]
DEBU[0000] DNS provider set to [coredns]
DEBU[0000] Checking if cluster version [1.20.15-rancher1-2] needs to have kube-api audit log enabled
DEBU[0000] Cluster version [1.20.15-rancher1-2] needs to have kube-api audit log enabled
DEBU[0000] Enabling kube-api audit log for cluster version [v1.20.15-rancher1-2]
DEBU[0000] No input provided for maxUnavailableWorker, setting it to default value of 10 percent
DEBU[0000] No input provided for maxUnavailableControlplane, setting it to default value of 1
DEBU[0000] Host: mymachine.fqdn has role: controlplane
DEBU[0000] Host: mymachine.fqdn has role: worker
DEBU[0000] Host: mymachine.fqdn has role: etcd
DEBU[0000] [state] previous state not found, possible legacy cluster
INFO[0000] [dialer] Setup tunnel for host [mymachine.fqdn]
DEBU[0000] Connecting to Docker API for host [mymachine.fqdn]
DEBU[0000] FIXME: Got an status-code for which error does not match any expected type!!!: -1  module=api status_code=-1
WARN[0000] Failed to set up SSH tunneling for host [mymachine.fqdn]: Can't retrieve Docker Info: error during connect: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info": Unable to access the service on /var/run/docker.sock. The service might be still starting up. Error: ssh: rejected: connect failed (open failed)
WARN[0000] Removing host [mymachine.fqdn] from node lists
WARN[0000] [state] can't fetch legacy cluster state from Kubernetes: Cluster must have at least one etcd plane host: failed to connect
 to the following etcd host(s) [mymachine.fqdn]
INFO[0000] [certificates] Generating CA kubernetes certificates
INFO[0000] [certificates] Generating Kubernetes API server aggregation layer requestheader client CA certificates
INFO[0000] [certificates] GenerateServingCertificate is disabled, checking if there are unused kubelet certificates
INFO[0000] [certificates] Generating Kubernetes API server certificates
INFO[0000] [certificates] Generating Service account token key
INFO[0000] [certificates] Generating Kube Controller certificates
INFO[0000] [certificates] Generating Kube Scheduler certificates
INFO[0001] [certificates] Generating Kube Proxy certificates
INFO[0001] [certificates] Generating Node certificate
INFO[0001] [certificates] Generating admin certificates and kubeconfig
INFO[0001] [certificates] Generating Kubernetes API server proxy client certificates
INFO[0001] Successfully Deployed state file at [./cluster.rkestate]
DEBU[0001] Checking if cluster version [1.20.15-rancher1-2] needs to have kube-api audit log enabled
DEBU[0001] Cluster version [1.20.15-rancher1-2] needs to have kube-api audit log enabled
DEBU[0001] Enabling kube-api audit log for cluster version [v1.20.15-rancher1-2]
INFO[0001] Building Kubernetes cluster
FATA[0001] Cluster must have at least one etcd plane host: please specify one or more etcd in cluster config

However, note that the user ‘docker’ has access to docker socket and it can connect as docker onto the target machine (the same) with its public key (which means ssh private key is added).

# docker command OK
[docker@mymachine terraform]$ ll /var/run/docker.sock
srw-rw----. 1 root docker 0 Apr 14 16:44 /var/run/docker.sock
[docker@mymachine terraform]$ docker ps
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES


# SSH ok
[docker@mymachine terraform]$ ssh -i ~/.ssh/id_rsa docker@mymachine.fqdn

Last login: Wed Apr 20 17:50:47 2022 from 1.2.3.4
[docker@mymachine ~]$



Could somebody give some hints ? I have no idea which part of my OS conf causes this. sshd_config seems ok to me since I can connect via SSH with public key. It seems like rke command does not take into account properly sshd_config.

Thank you in advance for your help.

Regards,
Rahenda

Solved.

On RHEL 8, we need to set the following parameters for sshd_config (then restart ssh service such as systemctl restart sshd) :

AllowTcpForwarding yes
AllowStreamLocalForwarding yes
DisableForwarding no

p.s. On the documentation, only AllowTcpForwarding is required while in reality there are two other parameters to set.