Peer connection errors in cattle pod in new install

ITBlogger · October 9, 2018, 9:42pm

I’ve done a fresh install of Rancher 2.1.0 using the SSL certs terminated at ALB instructions https://rancher.com/docs/rancher/v2.x/en/installation/ha/rke-add-on/layer-7-lb/alb/ https://rancher.com/docs/rancher/v2.x/en/installation/ha/helm-rancher/chart-options/#external-tls-termination

Cluster appears up and running. Web UI is working and I can manage the cluster, setup Github Auth, etc.

However, when I check logs of the cattle pod, I see nothing but errors like below:

2018/10/09 21:26:04 [ERROR] Failed to connect to peer wss://10.42.1.3/v3/connect [local ID=10.42.2.2]: websocket: bad handshake
2018/10/09 21:26:07 [ERROR] Failed to connect to peer wss://10.42.2.3/v3/connect [local ID=10.42.2.2]: websocket: bad handshake
2018/10/09 21:26:07 [ERROR] Failed to connect to peer wss://10.42.0.7/v3/connect [local ID=10.42.2.2]: websocket: bad handshake

The rancher pod logs are filled with messages like:

2018/10/09 21:31:24 [INFO] 2018/10/09 21:31:24 http: multiple response.WriteHeader calls

Any help would be much appreciated. I’m at a loss on what could be causing the issues.

Thanks,

Alex

ITBlogger · October 9, 2018, 10:24pm

It’s odd though because this is not preventing the cluster from operating.

I was just able to spin up an EKS cluster from the Rancher UI, for instance.

ITBlogger · October 10, 2018, 3:56am

So it turns out that the docs are a bit out of date…rke add on installs are being deprecated and shouldn’t be used.

I was able to get a fully working install just using basic rke rancher.yaml file:

cluster_name: rancher
ignore_docker_version: true
cloud_provider:
aws
nodes:

address:
user: ubuntu
role: [controlplane,etcd,worker]
ssh_key_path: pem
address:
user: ubuntu
role: [controlplane,etcd,worker]
ssh_key_path: pem
address:
user: ubuntu
role: [controlplane,etcd,worker]
ssh_key_path: pem

services:
etcd:
snapshot: true
creation: 6h
retention: 24h

Also, the AWS ALB needs to be set up with two target groups, one for http 80 going to port 80 and one for https 443 going to port 443. This worked with the TLS cert set on the https target group.

Fraser_Goffin · October 12, 2018, 6:43pm

The docs suggest using a Network Load Balancer (or an alternate level 4) and for termination at the ingress controller that gets created ?

ITBlogger · October 12, 2018, 9:20pm

You can see from the links above that there are also docs for using AWS ALB. And using SSL certs on the ingress controller don’t make much sense when you already have wildcart certs in AWS set up for a domain. It’s just a lot nicer to have the AWS load balancer handle the certs.

Thankfully, I got it working and everything is great now.

Fraser_Goffin · October 12, 2018, 11:57pm

Yes I see that ALB can be configured albeit even these docs recommend using a level 4 Load Balancer. I’m not 100% sure why that is but perhaps because of the extremely low overhead (and thus higher scalability) and also maintaining encryption further upstream. One thing that I dislike about AWS NLBs is that they do not implement security groups so it’s harder to write rules to constrain the origin, whereas for an ALB you can associate a sec-group of the ALB with the instances so traffic can only arrive via the ALB and with a valid cert.

We too use wildcard certs for hosted zones and, at least up until now have terminated SSL at the ALB.

TBH I’m a bit unsure which way to go with this now. Time to have a chat with my Rancher tech support guy.

ntkach · January 8, 2019, 10:31pm

Very interested to know what you found out if you don’t mind. That’s exactly why we gave up trying to use the suggested NLBs. There doesn’t seem to be a way to get the equivalent functionality of security groups with NLBs (sg on the lb, sg on the node hosts, allowing traffic from one to the other). If you have your node hosts in private subnets, then you’re pretty much forced to accept everything from the entire cidr block for the vpc.

Topic		Replies	Views
Failed to connect to peer wss://10.42.0.22/v3/connect [local ID=10.42.1.15]: websocket: bad handshake Rancher	9	8022	February 17, 2022
Import RKE cluster into Rancher Rancher	4	1590	November 13, 2019
Failed to connect to peer wss://IP/v3/connect [local ID=IP]: dial tcp IP:443: i/o timeout Rancher	4	1543	January 2, 2024
Error syncing 'serving-cert' secret and UI not accessible Rancher	0	525	May 24, 2022
Rancher Upgrade v2.7.x or later Instability and Downstream Clusters Flapping Rancher	0	61	October 29, 2024

Peer connection errors in cattle pod in new install

Related topics