Peer connection errors in cattle pod in new install

I’ve done a fresh install of Rancher 2.1.0 using the SSL certs terminated at ALB instructions https://rancher.com/docs/rancher/v2.x/en/installation/ha/rke-add-on/layer-7-lb/alb/ https://rancher.com/docs/rancher/v2.x/en/installation/ha/helm-rancher/chart-options/#external-tls-termination

Cluster appears up and running. Web UI is working and I can manage the cluster, setup Github Auth, etc.

However, when I check logs of the cattle pod, I see nothing but errors like below:

2018/10/09 21:26:04 [ERROR] Failed to connect to peer wss://10.42.1.3/v3/connect [local ID=10.42.2.2]: websocket: bad handshake
2018/10/09 21:26:07 [ERROR] Failed to connect to peer wss://10.42.2.3/v3/connect [local ID=10.42.2.2]: websocket: bad handshake
2018/10/09 21:26:07 [ERROR] Failed to connect to peer wss://10.42.0.7/v3/connect [local ID=10.42.2.2]: websocket: bad handshake

The rancher pod logs are filled with messages like:

2018/10/09 21:31:24 [INFO] 2018/10/09 21:31:24 http: multiple response.WriteHeader calls

Any help would be much appreciated. I’m at a loss on what could be causing the issues.

Thanks,

Alex

It’s odd though because this is not preventing the cluster from operating.

I was just able to spin up an EKS cluster from the Rancher UI, for instance.

So it turns out that the docs are a bit out of date…rke add on installs are being deprecated and shouldn’t be used.

I was able to get a fully working install just using basic rke rancher.yaml file:


cluster_name: rancher
ignore_docker_version: true
cloud_provider:
aws
nodes:

  • address:
    user: ubuntu
    role: [controlplane,etcd,worker]
    ssh_key_path: pem
  • address:
    user: ubuntu
    role: [controlplane,etcd,worker]
    ssh_key_path: pem
  • address:
    user: ubuntu
    role: [controlplane,etcd,worker]
    ssh_key_path: pem

services:
etcd:
snapshot: true
creation: 6h
retention: 24h

Also, the AWS ALB needs to be set up with two target groups, one for http 80 going to port 80 and one for https 443 going to port 443. This worked with the TLS cert set on the https target group.

The docs suggest using a Network Load Balancer (or an alternate level 4) and for termination at the ingress controller that gets created ?

You can see from the links above that there are also docs for using AWS ALB. And using SSL certs on the ingress controller don’t make much sense when you already have wildcart certs in AWS set up for a domain. It’s just a lot nicer to have the AWS load balancer handle the certs.

Thankfully, I got it working and everything is great now.

Yes I see that ALB can be configured albeit even these docs recommend using a level 4 Load Balancer. I’m not 100% sure why that is but perhaps because of the extremely low overhead (and thus higher scalability) and also maintaining encryption further upstream. One thing that I dislike about AWS NLBs is that they do not implement security groups so it’s harder to write rules to constrain the origin, whereas for an ALB you can associate a sec-group of the ALB with the instances so traffic can only arrive via the ALB and with a valid cert.

We too use wildcard certs for hosted zones and, at least up until now have terminated SSL at the ALB.

TBH I’m a bit unsure which way to go with this now. Time to have a chat with my Rancher tech support guy.

Very interested to know what you found out if you don’t mind. That’s exactly why we gave up trying to use the suggested NLBs. There doesn’t seem to be a way to get the equivalent functionality of security groups with NLBs (sg on the lb, sg on the node hosts, allowing traffic from one to the other). If you have your node hosts in private subnets, then you’re pretty much forced to accept everything from the entire cidr block for the vpc.