HA Rancher + Harvester: Provision Cluster errors with "certificate signed by unknown authority"

Environment

I have a HA Rancher v2.6.3 cluster which is being Reverse Proxied by Traefik which is handling cert termination and routes to 3 nodes of the etc/cp/worker Rancher install. This is all running out of Proxmox VMs. My cert was issued by letsencrypt.

I recently added a bare metal install of Harvester v1.0.0.
Currently a single machine with all traffic being handled over the management NIC, which is Gb to the switch.

This Harvester node was successfully integrated into my Rancher cluster, with the intent of spinning up new clusters on demand.

Harvester Setup

I set up a network with vlanid 1, and just gave it access to my whole /16 192.168.0.0 home network. I have done this through DHCP as well as setting the route statically in Harvester.

Problem

When provisioning a cluster through Harvester (RKE1/RKE2/K3s), the VMs aren’t created and I get the below error:

failing bootstrap machine(s) k3s-pool1-65b4c9d49c-qv42q: failed creating server (HarvesterMachine) in infrastructure provider: CreateError: Failure detected from referenced resource rke-machine.cattle.io/v1, Kind=HarvesterMachine with name “k3s-pool1-d7010ffa-52wvw”: Downloading driver from {{redacted because too many links in post}}/harvester-node-driver/v0.3.4/docker-machine-driver-harvester-amd64.tar.gz
docker-machine-driver-harvester-amd64.tar.gz
docker-machine-driver-harvester-amd64.tar.gz: gzip compressed data, from Unix, original size 36115968
Running pre-create checks…
Error with pre-create check: “Get "{{redacted because too many links in post}}/k8s/clusters/c-m-zrshdjzq/apis/harvesterhci.io/v1beta1/settings/server-version": x509: certificate signed by unknown authority”
The default lines below are for a sh/bash shell, you can specify the shell you’re using, with the --shell flag. and join url to be available on bootstrap node

Troubleshooting

Following these steps: Rancher Helm Chart Options | Rancher
I set the additionalTrustedCAs flag in my Rancher helm deploy to true

helm upgrade rancher rancher-latest/rancher \
–namespace cattle-system
–set hostname=rancher.plmr.cloud
–set additionalTrustedCAs=true

I uploaded the ca-additional.pem as a tls-ca-addtional secret. The file I used as the ca to trust this cert chain was obtained by opening my cert in Firefox, and downloading the PEM (cert) for ISRG Root X1.

I also set the Harvester setting “additional-ca” to this same ISRG Root X1 pem file, which in my understanding should be operating as the root CA to trust all letsencrypt issued certs

Ask

Can anyone help me understand whether this issue is because I have something misconfigured in Rancher, or in Harvester? It’s unclear to me whether the Provisioning Log error above is from Rancher attempting to call itself, or if it’s coming back from Harvester. I don’t have another cloud provider configured to do side by side testing.

I was able to spin up a regular workload cluster in Proxmox VMs and did not encounter this issue when registering them to Rancher using the provided Registration Command from Rancher.

If I need to move this to the Harvester section, just let me know.

I think I’ve made a helpful discovery this morning.

If I enter terminal on a cattle-system pod and curl my FQDN: https://rancher.home.lab, I can get two different results:

  • local cluster: no errors
  • herd0 (manually provisioned workload cluster):

curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: curl - SSL CA Certificates

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

I am assuming that this is equivalent to the go “certificate signed by unknown authority” error I outlined above. In my mind this means that the new cluster is attempting to provision itself, but upon creating whatever pods it needs for orchestration, it immediately bombs out on cert errors because it’s untrusted. I’m not totally clear on why herd0 has this issue, but functions without apparent problems though.

This would lead me to conclude that my tls-ca-additional secret is only providing trust for my local cluster, or it’s being trusted for some other reason that I don’t understand. I attempted to set the same secret in herd0, recycled one of the cattle-system pods, but had no greater success with the curl command at that point.

I want to attempt changing out the ca-cert under global settings, but I haven’t found a way to do that yet.

This is a solely Rancher configuration error that I made during setup. Documenting in case anyone else ever runs across this…

In my setup with a pre-issued letsEncrypt cert being terminated at my Traefik Reverse Proxy, I needed to follow the instructions here, downloading my cert chain, and then installing it as a secret/private cert chain.

This let me successfully curl my rancher.home.lab url without a cert error from the local cluster.
Sanity checked by making sure this matched my intended CA Cert: https://rancher.plmr.cloud/v3/settings/cacerts

My herd0 cluster is still having issues, and I was able to diagnose them as follows:
Open a kubectl shell to herd0

kubectl get settings.management.cattle.io cacerts
Printout does not match my root cert.

I see that my checksums in herd0 look right:

$ kubectl edit -n cattle-system ds/cattle-node-agent
$ kubectl edit -n cattle-system deployment/cattle-cluster-agent

I see the correct key under herd0 → storage → secrets → stv-aggregation → ca.crt

I saw a different cert under local → storage → secrets → stv-aggregation → ca.crt
It matched the results of kubectl get settings.management.cattle.io cacerts run on herd0, so I updated to to match my desired CA cert.

Now when I hit the herd0 kubernetes api /v3/settings/cacerts, I get the desired value, but still not when I kubectl get settings.management.cattle.io cacerts

At this point I’m just guessing blindly, because after all of that I still can’t get herd0 to successfully recognize the cert on my reverse proxy.

Next steps are to try creating a new test cluster, but I’m not sure where to go next.

It seems like most of my testing has been erroneous.

Any workload cluster I spin up, none of the cattle-syste pods in that cluster can securely resolve any cert.
I think I’m back to square 1.

After re-reading documentation carefully, I had performed some actions in my initial setup which were not in-line with the instructions.

I now have Traefik terminating certs, and forwarding both :80 and :443 TCP to my 3 Rancher nodes.

And this is how I set up my rancher install:

helm upgrade rancher rancher-latest/rancher
  --namespace cattle-system \
  --set hostname=rancher.plmr.cloud \
  --set tls=external \
  --set ingress-tls-source=rancher \
  --set additionalTrustedCAs=true \
  --set bootstrapPassword=admin

Further attention to logs shows me that the step failing for Harvester provisioning is the machine-provision pod in the fleet-default namespace.

apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/containerID: 1abeca567ee54901fcf6c3eac37009c0fb6146193a82935b51eaca87300a7be0
    cni.projectcalico.org/podIP: ""
    cni.projectcalico.org/podIPs: ""
  creationTimestamp: "2022-04-03T16:16:31Z"
  generateName: test-pool1-178f439a-559wp-machine-provision-
  labels:
    controller-uid: 566710c6-146f-41d8-b00e-07395ec23022
    job-name: test-pool1-178f439a-559wp-machine-provision
    rke.cattle.io/infra-machine-group: rke-machine.cattle.io
    rke.cattle.io/infra-machine-kind: HarvesterMachine
    rke.cattle.io/infra-machine-name: test-pool1-178f439a-559wp
    rke.cattle.io/infra-machine-version: v1
    rke.cattle.io/infra-remove: "false"
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:generateName: {}
        f:labels:
          .: {}
          f:controller-uid: {}
          f:job-name: {}
          f:rke.cattle.io/infra-machine-group: {}
          f:rke.cattle.io/infra-machine-kind: {}
          f:rke.cattle.io/infra-machine-name: {}
          f:rke.cattle.io/infra-machine-version: {}
          f:rke.cattle.io/infra-remove: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"566710c6-146f-41d8-b00e-07395ec23022"}:
            .: {}
            f:apiVersion: {}
            f:blockOwnerDeletion: {}
            f:controller: {}
            f:kind: {}
            f:name: {}
            f:uid: {}
      f:spec:
        f:containers:
          k:{"name":"machine"}:
            .: {}
            f:args: {}
            f:envFrom: {}
            f:image: {}
            f:imagePullPolicy: {}
            f:name: {}
            f:resources: {}
            f:securityContext:
              .: {}
              f:runAsGroup: {}
              f:runAsUser: {}
            f:terminationMessagePath: {}
            f:terminationMessagePolicy: {}
            f:volumeMounts:
              .: {}
              k:{"mountPath":"/etc/ssl/certs/ca-additional.pem"}:
                .: {}
                f:mountPath: {}
                f:name: {}
                f:readOnly: {}
                f:subPath: {}
              k:{"mountPath":"/run/secrets/machine"}:
                .: {}
                f:mountPath: {}
                f:name: {}
        f:dnsPolicy: {}
        f:enableServiceLinks: {}
        f:restartPolicy: {}
        f:schedulerName: {}
        f:securityContext: {}
        f:serviceAccount: {}
        f:serviceAccountName: {}
        f:terminationGracePeriodSeconds: {}
        f:volumes:
          .: {}
          k:{"name":"bootstrap"}:
            .: {}
            f:name: {}
            f:secret:
              .: {}
              f:defaultMode: {}
              f:optional: {}
              f:secretName: {}
          k:{"name":"tls-ca-additional-volume"}:
            .: {}
            f:name: {}
            f:secret:
              .: {}
              f:defaultMode: {}
              f:optional: {}
              f:secretName: {}
    manager: kube-controller-manager
    operation: Update
    time: "2022-04-03T16:16:31Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:cni.projectcalico.org/containerID: {}
          f:cni.projectcalico.org/podIP: {}
          f:cni.projectcalico.org/podIPs: {}
    manager: calico
    operation: Update
    time: "2022-04-03T16:16:32Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
          k:{"type":"ContainersReady"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:message: {}
            f:reason: {}
            f:status: {}
            f:type: {}
          k:{"type":"Initialized"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:status: {}
            f:type: {}
          k:{"type":"Ready"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:message: {}
            f:reason: {}
            f:status: {}
            f:type: {}
        f:containerStatuses: {}
        f:hostIP: {}
        f:phase: {}
        f:podIP: {}
        f:podIPs:
          .: {}
          k:{"ip":"10.42.0.130"}:
            .: {}
            f:ip: {}
        f:startTime: {}
    manager: kubelet
    operation: Update
    time: "2022-04-03T16:16:44Z"
  name: test-pool1-178f439a-559wp-machine-provision-4wk7l
  namespace: fleet-default
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: test-pool1-178f439a-559wp-machine-provision
    uid: 566710c6-146f-41d8-b00e-07395ec23022
  resourceVersion: "9908570"
  uid: e0e7c115-c565-4f56-96a3-d1efeab62a9d
spec:
  containers:
  - args:
    - --driver-download-url=https://releases.rancher.com/harvester-node-driver/v0.3.4/docker-machine-driver-harvester-amd64.tar.gz
    - --driver-hash=e214c5ba38b83febce25863215f887239afee9b4477aa70b4f76695d53378632
    - --secret-namespace=fleet-default
    - --secret-name=test-pool1-178f439a-559wp-machine-state
    - create
    - --driver=harvester
    - --custom-install-script=/run/secrets/machine/value
    - --harvester-cpu-count=2
    - --harvester-disk-bus=virtio
    - --harvester-disk-size=40
    - --harvester-image-name=default/image-lddfx
    - --harvester-memory-size=4
    - --harvester-network-model=virtio
    - --harvester-network-name=default/wideopen
    - --harvester-network-type=dhcp
    - --harvester-ssh-port=22
    - --harvester-ssh-user=ubuntu
    - --harvester-vm-namespace=default
    - test-pool1-178f439a-559wp
    envFrom:
    - secretRef:
        name: test-pool1-178f439a-559wp-machine-driver-secret
    image: rancher/machine:v0.15.0-rancher73
    imagePullPolicy: Always
    name: machine
    resources: {}
    securityContext:
      runAsGroup: 1000
      runAsUser: 1000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /run/secrets/machine
      name: bootstrap
    - mountPath: /etc/ssl/certs/ca-additional.pem
      name: tls-ca-additional-volume
      readOnly: true
      subPath: ca-additional.pem
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-x76vp
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: 192.168.101.4
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: test-pool1-178f439a-559wp-machine-provision
  serviceAccountName: test-pool1-178f439a-559wp-machine-provision
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: bootstrap
    secret:
      defaultMode: 511
      optional: false
      secretName: test-bootstrap-template-jdzvs-machine-bootstrap
  - name: tls-ca-additional-volume
    secret:
      defaultMode: 292
      optional: true
      secretName: tls-ca-additional
  - name: kube-api-access-x76vp
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-04-03T16:16:31Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-04-03T16:16:44Z"
    message: 'containers with unready status: [machine]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-04-03T16:16:44Z"
    message: 'containers with unready status: [machine]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-04-03T16:16:31Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://0b95f2a6418d8b69e8b6e97e5e6ce0d99b8b60eb78adf84912af582d0f4f0239
    image: rancher/machine:v0.15.0-rancher73
    imageID: docker-pullable://rancher/machine@sha256:3baf7cb8bbc29fe9c16d45cd2b6c1a08a4deeb1bf6caff656fa264a4110c0d74
    lastState: {}
    name: machine
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: docker://0b95f2a6418d8b69e8b6e97e5e6ce0d99b8b60eb78adf84912af582d0f4f0239
        exitCode: 1
        finishedAt: "2022-04-03T16:16:44Z"
        message: |+
          Downloading driver from https://releases.rancher.com/harvester-node-driver/v0.3.4/docker-machine-driver-harvester-amd64.tar.gz
          docker-machine-driver-harvester-amd64.tar.gz
          docker-machine-driver-harvester-amd64.tar.gz: gzip compressed data, from Unix, original size 36115968
          Running pre-create checks...
          Error with pre-create check: "Get \"https://rancher.plmr.cloud/k8s/clusters/c-m-zrshdjzq/apis/harvesterhci.io/v1beta1/settings/server-version\": x509: certificate signed by unknown authority"
          The default lines below are for a sh/bash shell, you can specify the shell you're using, with the --shell flag.

        reason: Error
        startedAt: "2022-04-03T16:16:33Z"
  hostIP: 192.168.101.4
  phase: Failed
  podIP: 10.42.0.130
  podIPs:
  - ip: 10.42.0.130
  qosClass: BestEffort
  startTime: "2022-04-03T16:16:31Z"

You can see that it mentions my tls-ca-additional, which is the Root CA for the LetsEncrypt cert being used on Traefik. What I cannot understand is why the inclusion of that cert in this pod, under /etc/ssl/certs, is not allowing for successful validation of the cert.

I wound up upgrading the Kubernetes version my Rancher install was running on.
I cannot guarantee that an in-place upgrade will fix this, because I got sloppy and wound up blowing away my old install.

Success with these versions:

Rancher 2.6.4
Kubernetes 1.22.7