Init node not found

Rancher Version: v2.7.5

Hi,

I woke up to this error, on my cluster that is using vSphere as machine-driver. It has worked fine for months.

“[Error] init node not found”

This is the output of the following:

kubectl describe clusters.cluster.x-k8s.io externalprod -n fleet-default

v

Name:         externalprod
Namespace:    fleet-default
Labels:       objectset.rio.cattle.io/hash=fbfedfb3e63619bb80ec32eb0ab4e7316deed741
Annotations:  objectset.rio.cattle.io/applied:
                H4sIAAAAAAAA/5yST28TMRDFvwqa827I5m+zEqfSUw+gCHFBHMb2c2Pi2Ct7NoCi/e7Iq6KG0kRVj+udN/Peb+ZEBwgbFqb2RBxCFBYXQy6fUf2AlgyZJBcnmkU8Ji6+d4ZaSnvU2v...
              objectset.rio.cattle.io/id: rke-cluster
              objectset.rio.cattle.io/owner-gvk: provisioning.cattle.io/v1, Kind=Cluster
              objectset.rio.cattle.io/owner-name: externalprod
              objectset.rio.cattle.io/owner-namespace: fleet-default
API Version:  cluster.x-k8s.io/v1beta1
Kind:         Cluster
Metadata:
  Creation Timestamp:  2023-09-04T12:49:27Z
  Finalizers:
    cluster.cluster.x-k8s.io
  Generation:  197302
  Managed Fields:
    API Version:  cluster.x-k8s.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:objectset.rio.cattle.io/applied:
          f:objectset.rio.cattle.io/id:
          f:objectset.rio.cattle.io/owner-gvk:
          f:objectset.rio.cattle.io/owner-name:
          f:objectset.rio.cattle.io/owner-namespace:
        f:finalizers:
          .:
          v:"cluster.cluster.x-k8s.io":
        f:labels:
          .:
          f:objectset.rio.cattle.io/hash:
        f:ownerReferences:
          .:
          k:{"uid":"2ecd5d8d-f231-4fa7-9b1e-54bb74169c5f"}:
      f:spec:
        .:
        f:controlPlaneEndpoint:
          .:
          f:host:
          f:port:
        f:controlPlaneRef:
        f:infrastructureRef:
    Manager:      rancher
    Operation:    Update
    Time:         2023-11-13T14:37:17Z
    API Version:  cluster.x-k8s.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:conditions:
        f:controlPlaneReady:
        f:infrastructureReady:
        f:observedGeneration:
        f:phase:
    Manager:      rancher
    Operation:    Update
    Subresource:  status
    Time:         2023-11-13T14:37:17Z
  Owner References:
    API Version:           provisioning.cattle.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Cluster
    Name:                  externalprod
    UID:                   2ecd5d8d-f231-4fa7-9b1e-54bb74169c5f
  Resource Version:        87108732
  UID:                     cdef4c81-1f6f-4f81-9f7f-92aa5a1733a1
Spec:
  Control Plane Endpoint:
    Host:  localhost
    Port:  6443
  Control Plane Ref:
    API Version:  rke.cattle.io/v1
    Kind:         RKEControlPlane
    Name:         externalprod
    Namespace:    fleet-default
  Infrastructure Ref:
    API Version:  rke.cattle.io/v1
    Kind:         RKECluster
    Name:         externalprod
    Namespace:    fleet-default
Status:
  Conditions:
    Last Transition Time:  2023-09-04T12:49:29Z
    Status:                True
    Type:                  Ready
    Last Transition Time:  2023-09-04T13:09:56Z
    Status:                True
    Type:                  ControlPlaneInitialized
    Last Transition Time:  2023-11-10T20:26:56Z
    Message:               init node not found
    Reason:                Error
    Status:                False
    Type:                  ControlPlaneReady
    Last Transition Time:  2023-09-04T12:49:29Z
    Status:                True
    Type:                  InfrastructureReady
  Control Plane Ready:     true
  Infrastructure Ready:    true
  Observed Generation:     197302
  Phase:                   Provisioned
Events:                    <none>

I checked this issue, that apparently should have a workaround for this:

I got the metadata:
kubectl get secret externalprod-pool-c31513ec-rwf2n-machine-state -n fleet-default -o yaml |grep "metadata"

And then tried to label the node:

kubectl label clusters.provisioning.cattle.io local -n fleet-local rke.cattle.io/init-node-machine-id=ef710914b7f47702a96124064d5d7300fd504c13 --overwrite

image

I found these errors on the nodes itself:

Nov 13 16:30:54 externalprod-pool-81b84bbf-m2x5d rancher-system-agent[816]: time="2023-11-13T16:30:54+01:00" level=error msg="[K8s] received secret to process that was older than the last secret operated on. (87127446 vs 87131617)"
Nov 13 16:30:54 externalprod-pool-81b84bbf-m2x5d rancher-system-agent[816]: time="2023-11-13T16:30:54+01:00" level=error msg="error syncing 'fleet-default/externalprod-bootstrap-template-ljj2b-machine-plan': handler secret-watch: secret received was too old, requeuing"

But that did not help.
I am kinda stuck in where to start debugging this issue.

Thankful for any pointers

So I tested an radical approach - I rebooted all the nodes.
Now, the error I am getting is this:

fixed machine with ID 6bb2e394-xxxx-xxxx-xxxx-49f4996f3f30 not found

image

So I removed the label labels: rke.cattle.io/init-node-machine-id: 6bb2e394-04e8-4826-9a7c-49f4996f3f30

And we are back to the original error…
image

Looks like there’s an issue with your bootstrap node.
I had faced the similar issue before… I was able to resolve it by doing the following:

I see you’ve tried to add the label:

kubectl label clusters.provisioning.cattle.io local -n fleet-local rke.cattle.io/init-node-machine-id=ef710914b7f47702a96124064d5d7300fd504c13 --overwrite

If this is a downstream cluster then you should look for your cluster in the fleet-default namespace.
Try:

kubectl get clusters.provisioning.cattle.io -n fleet-default

Based on your cluster name apply the label

kubectl label clusters.provisioning.cattle.io <cluster_name> -n fleet-default rke.cattle.io/init-node-machine-id=<machine_id> --overwrite
1 Like

Thank you for your reply!

I’ve actually tried this aswell, but I am getting the above error fixed machine with ID XXXXX not found

The question is thou, if I have picked the correct machineID…

How did you get your machine ID?

Do you have access to the rancher UI? If yes, Can you go to cluster management → click on your cluster → select your master node → in the yaml of your master node in label section you will see a label rke.cattle.io/machine-id. Copy that machine ID and use that ID in the command.

The machine ID should be of the healthy Master node.

1 Like

Thanks for confirming.
I can confirm that I did exactly that even before, however - I am faced with the error (ixed machine with ID…) as before.

Can you check if there’s a spec.EtcdSnapshotCreate field in you cluster config.yaml? If yes can you remove it and apply and then try? reference

1 Like

Hi,
I’ve removed the value that previously was “1”, and set it to “nil” - and also removed that label that I set with the following:
kubectl label clusters.provisioning.cattle.io <cluster_name> -n fleet-default rke.cattle.io/init-node-machine-id=<machine_id> --overwrite

And it looks like progress is being made, the errors is gone in rancher and Rancher can finally provision new nodes and update it!

Thank you very much!

1 Like