Agent certificate chain error with custom CA & external TLS termination

I have an HA setup on K3s with an AWS ALB doing external SSL/TLS termination with a certificate issued by our corporate CA. The rancher pods are up & healthy, and I can log into Rancher. But the cattle-cluster-agent and cattle-system-agent pods are stuck in a crash loop, with the following error:

level=fatal msg="Certificate chain is not complete, please check if all needed intermediate certificates are included in the server certificate (in the correct order) and if the cacerts setting in Rancher either contains the correct CA certificate (in the case of using self signed certificates) or is empty (in the case of using a certificate signed by a recognized CA). Certificate information is displayed above. error: Get https://rancher.example.com: x509: certificate signed by unknown authority"

Prior to that error I see the agent correctly trying to connect to the load-balanced URL, and is listing out the server, intermediate, and root CA certs, so I know the load balancer is serving up the complete chain (also verified using openssl s_client).

I set tls=external and usePrivateCA=true on install per the docs, here’s my install command:

helm install rancher rancher-stable/rancher \
  --namespace cattle-system \
  --set hostname=rancher.example.com \
  --set tls=external \
  --set additionalTrustedCAs=true \
  --set privateCA=true

Then I added the tls-ca secret with our root CA. I confirmed it’s being loaded into Rancher’s cacerts setting, but the agents still throw the error even after deleting the agent pods so they recreate.

sudo k3s kubectl -n cattle-system create secret generic tls-ca --from-file=cacerts.pem

The cacerts.pem file contains the root CA for the rancher.example.com cert on the LB. Also tried it with intermediate+root, and even the full chain with the server+intermediate+root.

It just appears the agent pods are not honoring the cacerts setting as documented. Any assistance is appreciated.

If the CA certificate is correctly configured and seen at /v3/settings/cacerts, and correctly retrieved by the agent, the issue is most certainly in the served certificate chain. The output was included to be able to more easily debug this, so I would be very interested to see it (if its sensitive info, you can share it to me on https://slack.rancher.io).

Probably the most important is the order of the certificates (given the chain is complete, e.g. all intermediates and server certificate). Checking using openssl only works properly if that system doesn’t have all the (intermediate) certificates in the system/OS store, as it will use this to verify.

Could be a bug but I will need more info to determine.

To be clear, what exactly is expected to be supplied to /v3/settings/cacerts via the tls-ca secret, just the root CA, or intermediate + root (I’ve tried both)? And what format should the file supplied to --from-file be, Base64 PEM including the -----BEGIN CERTIFICATE----- and -----END CERTIFICATE----- lines? In /v3/settings/cacerts it shows up like this:

"value": "-----BEGIN CERTIFICATE-----\n<cert contents with \n for newlines>\n-----END CERTIFICATE-----"

The newlines do render properly in the GUI Advanced Settings view.

Here’s a redacted form of what the agent pod logs. The order is right - server, then intermediate (“Corp Issuing CA”), then root (“Corp Root CA”). I don’t see anything indicating where the agent is querying /v3/settings/cacerts to compare/validate though.

INFO: https://rancher.example.com/ping is accessible
INFO: rancher.example.com resolves to 192.168.0.10 192.168.1.10
time="2020-07-15T17:45:02Z" level=info msg="Rancher agent version v2.4.5 is starting"
time="2020-07-15T17:45:02Z" level=info msg="Option requestedHostname=ip-192-168-3-4.us-east-1.compute.internal"
time="2020-07-15T17:45:02Z" level=info msg="Option customConfig=map[address:192.1686.3.4 internalAddress: label:map[] roles:[] taints:[]]"
time="2020-07-15T17:45:02Z" level=info msg="Option etcd=false"
time="2020-07-15T17:45:02Z" level=info msg="Option controlPlane=false"
time="2020-07-15T17:45:02Z" level=info msg="Option worker=false"
time="2020-07-15T17:45:02Z" level=info msg="Listening on /tmp/log.sock"
time="2020-07-15T17:45:02Z" level=info msg="Certificate details from https://rancher.example.com"
time="2020-07-15T17:45:02Z" level=info msg="Certificate #0 (https://rancher.example.com)"
time="2020-07-15T17:45:02Z" level=info msg="Subject: CN=rancher.example.com,O=Corp,L=City,ST=State,C=US"
time="2020-07-15T17:45:02Z" level=info msg="Issuer: CN=Corp Issuing CA"
time="2020-07-15T17:45:02Z" level=info msg="IsCA: false"
time="2020-07-15T17:45:02Z" level=info msg="DNS Names: [rancher.example.com]"
time="2020-07-15T17:45:02Z" level=info msg="IPAddresses: <none>"
time="2020-07-15T17:45:02Z" level=info msg="NotBefore: 2020-07-13 18:36:19 +0000 UTC"
time="2020-07-15T17:45:02Z" level=info msg="NotAfter: 2023-07-13 18:36:19 +0000 UTC"
time="2020-07-15T17:45:02Z" level=info msg="SignatureAlgorithm: SHA256-RSA"
time="2020-07-15T17:45:02Z" level=info msg="PublicKeyAlgorithm: RSA"
time="2020-07-15T17:45:02Z" level=info msg="Certificate #1 (https://rancher.example.com)"
time="2020-07-15T17:45:02Z" level=info msg="Subject: CN=Corp Issuing CA"
time="2020-07-15T17:45:02Z" level=info msg="Issuer: CN=Corp Root CA,O=Corp,L=City,ST=State,C=US"
time="2020-07-15T17:45:02Z" level=info msg="IsCA: true"
time="2020-07-15T17:45:02Z" level=info msg="DNS Names: <none>"
time="2020-07-15T17:45:02Z" level=info msg="IPAddresses: <none>"
time="2020-07-15T17:45:02Z" level=info msg="NotBefore: 2017-10-25 18:36:27 +0000 UTC"
time="2020-07-15T17:45:02Z" level=info msg="NotAfter: 2027-10-25 18:46:27 +0000 UTC"
time="2020-07-15T17:45:02Z" level=info msg="SignatureAlgorithm: SHA256-RSA"
time="2020-07-15T17:45:02Z" level=info msg="PublicKeyAlgorithm: RSA"
time="2020-07-15T17:45:02Z" level=info msg="Certificate #2 (https://rancher.example.com)"
time="2020-07-15T17:45:02Z" level=info msg="Subject: CN=Corp Root CA,O=Corp,L=City,ST=State,C=US"
time="2020-07-15T17:45:02Z" level=info msg="Issuer: CN=Corp Root CA,O=Corp,L=City,ST=State,C=US"
time="2020-07-15T17:45:02Z" level=info msg="IsCA: true"
time="2020-07-15T17:45:02Z" level=info msg="DNS Names: <none>"
time="2020-07-15T17:45:02Z" level=info msg="IPAddresses: <none>"
time="2020-07-15T17:45:02Z" level=info msg="NotBefore: 2013-11-11 17:05:26 +0000 UTC"
time="2020-07-15T17:45:02Z" level=info msg="NotAfter: 2037-05-10 15:38:31 +0000 UTC"
time="2020-07-15T17:45:02Z" level=info msg="SignatureAlgorithm: SHA384-RSA"
time="2020-07-15T17:45:02Z" level=info msg="PublicKeyAlgorithm: RSA"
time="2020-07-15T17:45:02Z" level=fatal msg="Certificate chain is not complete, please check if all needed intermediate certificates are included in the server certificate (in the correct order) and if the cacerts setting in Rancher either contains the correct CA certificate (in the case of using self signed certificates) or is empty (in the case of using a certificate signed by a recognized CA). Certificate information is displayed above. error: Get https://rancher.example.com: x509: certificate signed by unknown authority"

Ok, thanks for the logging. It seems that the agent is not instructed to download the CA certificate, which is weird, because the environment variable responsible for that (CATTLE_CA_CHECKSUM) is added based on the presence of a certificate in /v3/settings/cacerts. Is this reproducible on a setup from scratch with new nodes? Or was the cluster created in a setup where the certificates where not properly setup yet, and was redeployed after? The definition won’t dynamically reload, can you create a new cluster and add one clean node to that to see if that solves the issue?

Looking at the yaml spec of the agent pods, and CATTLE_CA_CHECKSUM indeed has no value.

I did miss the --set privateCA=true option when I first installed Rancher, and re-ran the install over the existing deployment to add it. I’ll break everything down and work on a fresh build tomorrow with privateCA there from the beginning, and report back my results. Thanks.

I started from scratch, this time I made sure not to forget privateCA=true in the initial install, and I also created the tls-ca secret before running the install (with just the root CA cert). This time, everything worked - the cattle-node-agent and cattle-cluster-agent pods come up successfully the first time, and their spec shows that CATTLE_CA_CHECKSUM is being populated.

You were right; seems the problem I hit is from re-configuring to add privateCA=true and populating /v3/settings/cacerts after initial deployment. It does beg the question; if we needed to replace the CA cert down the road, for example to switch from a private to public cert, would we be in the same situation?

Thanks for the help.

It can be manually corrected but was trying to root cause the issue first. We have https://github.com/rancher/rancher/issues/14731 open to track this.

Great, thanks again.