Init node not found

wioxjk · November 13, 2023, 2:49pm

Rancher Version: v2.7.5

Hi,

I woke up to this error, on my cluster that is using vSphere as machine-driver. It has worked fine for months.

“[Error] init node not found”

This is the output of the following:

kubectl describe clusters.cluster.x-k8s.io externalprod -n fleet-default

v

Name:         externalprod
Namespace:    fleet-default
Labels:       objectset.rio.cattle.io/hash=fbfedfb3e63619bb80ec32eb0ab4e7316deed741
Annotations:  objectset.rio.cattle.io/applied:
                H4sIAAAAAAAA/5yST28TMRDFvwqa827I5m+zEqfSUw+gCHFBHMb2c2Pi2Ct7NoCi/e7Iq6KG0kRVj+udN/Peb+ZEBwgbFqb2RBxCFBYXQy6fUf2AlgyZJBcnmkU8Ji6+d4ZaSnvU2v...
              objectset.rio.cattle.io/id: rke-cluster
              objectset.rio.cattle.io/owner-gvk: provisioning.cattle.io/v1, Kind=Cluster
              objectset.rio.cattle.io/owner-name: externalprod
              objectset.rio.cattle.io/owner-namespace: fleet-default
API Version:  cluster.x-k8s.io/v1beta1
Kind:         Cluster
Metadata:
  Creation Timestamp:  2023-09-04T12:49:27Z
  Finalizers:
    cluster.cluster.x-k8s.io
  Generation:  197302
  Managed Fields:
    API Version:  cluster.x-k8s.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:objectset.rio.cattle.io/applied:
          f:objectset.rio.cattle.io/id:
          f:objectset.rio.cattle.io/owner-gvk:
          f:objectset.rio.cattle.io/owner-name:
          f:objectset.rio.cattle.io/owner-namespace:
        f:finalizers:
          .:
          v:"cluster.cluster.x-k8s.io":
        f:labels:
          .:
          f:objectset.rio.cattle.io/hash:
        f:ownerReferences:
          .:
          k:{"uid":"2ecd5d8d-f231-4fa7-9b1e-54bb74169c5f"}:
      f:spec:
        .:
        f:controlPlaneEndpoint:
          .:
          f:host:
          f:port:
        f:controlPlaneRef:
        f:infrastructureRef:
    Manager:      rancher
    Operation:    Update
    Time:         2023-11-13T14:37:17Z
    API Version:  cluster.x-k8s.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:conditions:
        f:controlPlaneReady:
        f:infrastructureReady:
        f:observedGeneration:
        f:phase:
    Manager:      rancher
    Operation:    Update
    Subresource:  status
    Time:         2023-11-13T14:37:17Z
  Owner References:
    API Version:           provisioning.cattle.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Cluster
    Name:                  externalprod
    UID:                   2ecd5d8d-f231-4fa7-9b1e-54bb74169c5f
  Resource Version:        87108732
  UID:                     cdef4c81-1f6f-4f81-9f7f-92aa5a1733a1
Spec:
  Control Plane Endpoint:
    Host:  localhost
    Port:  6443
  Control Plane Ref:
    API Version:  rke.cattle.io/v1
    Kind:         RKEControlPlane
    Name:         externalprod
    Namespace:    fleet-default
  Infrastructure Ref:
    API Version:  rke.cattle.io/v1
    Kind:         RKECluster
    Name:         externalprod
    Namespace:    fleet-default
Status:
  Conditions:
    Last Transition Time:  2023-09-04T12:49:29Z
    Status:                True
    Type:                  Ready
    Last Transition Time:  2023-09-04T13:09:56Z
    Status:                True
    Type:                  ControlPlaneInitialized
    Last Transition Time:  2023-11-10T20:26:56Z
    Message:               init node not found
    Reason:                Error
    Status:                False
    Type:                  ControlPlaneReady
    Last Transition Time:  2023-09-04T12:49:29Z
    Status:                True
    Type:                  InfrastructureReady
  Control Plane Ready:     true
  Infrastructure Ready:    true
  Observed Generation:     197302
  Phase:                   Provisioned
Events:                    <none>

I checked this issue, that apparently should have a workaround for this:

github.com/harvester/harvester

[BUG] Promote fail, cluster stays in Provisioning phase

opened 08:33AM - 25 Apr 22 UTC

closed 02:11AM - 22 Oct 22 UTC

bk201

kind/bug area/backend priority/0 area/rancher-related require/release-note not-require/test-plan

**Describe the bug** This was spotted when debugging https://github.com/harve…ster/harvester/issues/2187. After deleing a server node, a worker node can't become control plane node. **To Reproduce** Steps to reproduce the behavior: 1. Create a 4-node Harvester cluster. 2. Wait for 3 nodes to become control plane nodes (role is `control-plane,etcd,master`). 3. Find which node the rancher-webhook pod is on. Assume nodeX. 4. Delete nodeX. 5. Harvester should promote the remaining work node, but the job keep waiting: ``` machine.cluster.x-k8s.io/custom-6bce219ef5d1 labeled secret/custom-6bce219ef5d1-machine-plan labeled rkebootstrap.rke.cattle.io/custom-6bce219ef5d1 labeled Waiting for promotion... Waiting for promotion... Waiting for promotion... Waiting for promotion... Waiting for promotion... Waiting for promotion... Waiting for promotion... Waiting for promotion... Waiting for promotion... Waiting for promotion... Waiting for promotion... Waiting for promotion... Waiting for promotion... ``` The CAPI cluster keeps staying in `Provisioning` phase: ``` $ kubectl get cluster -n fleet-local -o yaml apiVersion: v1 items: - apiVersion: provisioning.cattle.io/v1 kind: Cluster metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"provisioning.cattle.io/v1","kind":"Cluster","metadata":{"annotations":{},"labels":{"rke.cattle.io/init-node-machine-id":"xkhlp79g4cg8rgdgfsxsbm26ftvhglvzst28r9cr87spst2hcldxdq"},"name":"local","namespace":"fleet-local"},"spec":{"kubernetesVersion":"v1.21.11+rke2r1","rkeConfig":{"controlPlaneConfig":null}}} objectset.rio.cattle.io/applied: H4sIAAAAAAAA/4yQzU7DMBCEXwXt2Slt079Y4oAQ4sCVF9jYS2Ow15G9CYfK746SVqJC4udo78xovjlBIEGLgqBPgMxRUFzkPD1j+0ZGMskiubgwKOJp4eKts6ChT3F02UV2fKyMH7JQqkwiFAL1ozV+MKXqOL6DhoCMRwrEciUYa3Xz7NjePZwj/8xiDAQafDTo/yXOPZrJAUXB3NdFfnGBsmDoQfPgvQKPLflfR+gwd6Bhu9ztt3XdUGNwc7Crdr9u6jW1y/pg91vb2LXdbHarA6jzYpbSVwho6DCNNIMWBd9Yrtu+eiKpzpeiIPdkpnbzx2Wq+0G6R7Z9dCygT2WSCcpwwciURrJPxJRmZtDLUj4DAAD//5CVWGcAAgAA objectset.rio.cattle.io/id: provisioning-cluster-create objectset.rio.cattle.io/owner-gvk: management.cattle.io/v3, Kind=Cluster objectset.rio.cattle.io/owner-name: local objectset.rio.cattle.io/owner-namespace: "" creationTimestamp: "2022-04-11T08:15:00Z" finalizers: - wrangler.cattle.io/provisioning-cluster-remove - wrangler.cattle.io/rke-cluster-remove generation: 2 labels: objectset.rio.cattle.io/hash: 50675339e9ca48d1b72932eb038d75d9d2d44618 provider.cattle.io: harvester rke.cattle.io/init-node-machine-id: xkhlp79g4cg8rgdgfsxsbm26ftvhglvzst28r9cr87spst2hcldxdq name: local namespace: fleet-local resourceVersion: "20689" uid: 45e02df5-5f70-4845-ae77-0954a4b68fa8 spec: kubernetesVersion: v1.21.11+rke2r1 localClusterAuthEndpoint: {} rkeConfig: {} status: clientSecretName: local-kubeconfig clusterName: local conditions: - status: "True" type: Ready - status: Unknown type: DefaultProjectCreated - status: Unknown type: SystemProjectCreated - lastUpdateTime: "2022-04-11T08:15:00Z" status: "False" type: Reconciling - lastUpdateTime: "2022-04-11T08:15:00Z" status: "False" type: Stalled - lastUpdateTime: "2022-04-11T08:15:49Z" status: "True" type: Created - lastUpdateTime: "2022-04-25T08:04:12Z" status: "True" type: RKECluster - lastUpdateTime: "2022-04-25T08:04:12Z" message: 'Operation cannot be fulfilled on secrets "custom-6aa860f10259-machine-plan": the object has been modified; please apply your changes to the latest version and try again' reason: Error status: "False" type: Provisioned observedGeneration: 2 ready: true kind: List metadata: resourceVersion: "" selfLink: "" node2:~ # kubectl get machines -A NAMESPACE NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION fleet-local custom-6aa860f10259 local node2 rke2://node2 Running 14d fleet-local custom-6bce219ef5d1 local node4 rke2://node4 Running 14d fleet-local custom-78e6431db553 local node3 rke2://node3 Running 14d ``` **Expected behavior** The worker node should be promoted. **Support bundle** **Environment:** - Harvester ISO version: `v1.0.1`, Rancher `v2.6.4` - Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): KVM VMs **Additional context** Add any other context about the problem here.

I got the metadata:
kubectl get secret externalprod-pool-c31513ec-rwf2n-machine-state -n fleet-default -o yaml |grep "metadata"

And then tried to label the node:

kubectl label clusters.provisioning.cattle.io local -n fleet-local rke.cattle.io/init-node-machine-id=ef710914b7f47702a96124064d5d7300fd504c13 --overwrite

I found these errors on the nodes itself:

Nov 13 16:30:54 externalprod-pool-81b84bbf-m2x5d rancher-system-agent[816]: time="2023-11-13T16:30:54+01:00" level=error msg="[K8s] received secret to process that was older than the last secret operated on. (87127446 vs 87131617)"
Nov 13 16:30:54 externalprod-pool-81b84bbf-m2x5d rancher-system-agent[816]: time="2023-11-13T16:30:54+01:00" level=error msg="error syncing 'fleet-default/externalprod-bootstrap-template-ljj2b-machine-plan': handler secret-watch: secret received was too old, requeuing"

But that did not help.
I am kinda stuck in where to start debugging this issue.

Thankful for any pointers

wioxjk · November 14, 2023, 8:52am

So I tested an radical approach - I rebooted all the nodes.
Now, the error I am getting is this:

fixed machine with ID 6bb2e394-xxxx-xxxx-xxxx-49f4996f3f30 not found

wioxjk · November 15, 2023, 2:29pm

So I removed the label labels: rke.cattle.io/init-node-machine-id: 6bb2e394-04e8-4826-9a7c-49f4996f3f30

And we are back to the original error…

vaishnav · November 17, 2023, 5:38am

Looks like there’s an issue with your bootstrap node.
I had faced the similar issue before… I was able to resolve it by doing the following:

I see you’ve tried to add the label:

kubectl label clusters.provisioning.cattle.io local -n fleet-local rke.cattle.io/init-node-machine-id=ef710914b7f47702a96124064d5d7300fd504c13 --overwrite

If this is a downstream cluster then you should look for your cluster in the fleet-default namespace.
Try:

kubectl get clusters.provisioning.cattle.io -n fleet-default

Based on your cluster name apply the label

kubectl label clusters.provisioning.cattle.io <cluster_name> -n fleet-default rke.cattle.io/init-node-machine-id=<machine_id> --overwrite

wioxjk · November 17, 2023, 7:55am

Thank you for your reply!

I’ve actually tried this aswell, but I am getting the above error fixed machine with ID XXXXX not found

The question is thou, if I have picked the correct machineID…

How did you get your machine ID?

vaishnav · November 17, 2023, 8:34am

Do you have access to the rancher UI? If yes, Can you go to cluster management → click on your cluster → select your master node → in the yaml of your master node in label section you will see a label rke.cattle.io/machine-id. Copy that machine ID and use that ID in the command.

The machine ID should be of the healthy Master node.

wioxjk · November 17, 2023, 9:31am

Thanks for confirming.
I can confirm that I did exactly that even before, however - I am faced with the error (ixed machine with ID…) as before.

vaishnav · November 20, 2023, 5:26am

Can you check if there’s a spec.EtcdSnapshotCreate field in you cluster config.yaml? If yes can you remove it and apply and then try? reference

wioxjk · November 20, 2023, 8:03am

Hi,
I’ve removed the value that previously was “1”, and set it to “nil” - and also removed that label that I set with the following:
kubectl label clusters.provisioning.cattle.io <cluster_name> -n fleet-default rke.cattle.io/init-node-machine-id=<machine_id> --overwrite

And it looks like progress is being made, the errors is gone in rancher and Rancher can finally provision new nodes and update it!

Thank you very much!

Topic		Replies	Views
Cluster api access is stuck on a missing node Rancher	4	1240	April 15, 2022
Unable to provision Vsphere cluster in rancher Rancher	4	698	August 4, 2021
Cattle-pods failing Rancher	2	1773	October 25, 2019
Cluster in error, new Rancher v. 2.5.2 deployment Rancher	0	1560	November 29, 2020
Error: Error from server (NotFound): error when creating "STDIN": namespaces "cattle-system" not found	0	376	March 18, 2024

Init node not found

Related topics