wioxjk
November 13, 2023, 2:49pm
1
Rancher Version: v2.7.5
Hi,
I woke up to this error, on my cluster that is using vSphere as machine-driver. It has worked fine for months.
“[Error] init node not found”
This is the output of the following:
kubectl describe clusters.cluster.x-k8s.io externalprod -n fleet-default
v
Name: externalprod
Namespace: fleet-default
Labels: objectset.rio.cattle.io/hash=fbfedfb3e63619bb80ec32eb0ab4e7316deed741
Annotations: objectset.rio.cattle.io/applied:
H4sIAAAAAAAA/5yST28TMRDFvwqa827I5m+zEqfSUw+gCHFBHMb2c2Pi2Ct7NoCi/e7Iq6KG0kRVj+udN/Peb+ZEBwgbFqb2RBxCFBYXQy6fUf2AlgyZJBcnmkU8Ji6+d4ZaSnvU2v...
objectset.rio.cattle.io/id: rke-cluster
objectset.rio.cattle.io/owner-gvk: provisioning.cattle.io/v1, Kind=Cluster
objectset.rio.cattle.io/owner-name: externalprod
objectset.rio.cattle.io/owner-namespace: fleet-default
API Version: cluster.x-k8s.io/v1beta1
Kind: Cluster
Metadata:
Creation Timestamp: 2023-09-04T12:49:27Z
Finalizers:
cluster.cluster.x-k8s.io
Generation: 197302
Managed Fields:
API Version: cluster.x-k8s.io/v1beta1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:objectset.rio.cattle.io/applied:
f:objectset.rio.cattle.io/id:
f:objectset.rio.cattle.io/owner-gvk:
f:objectset.rio.cattle.io/owner-name:
f:objectset.rio.cattle.io/owner-namespace:
f:finalizers:
.:
v:"cluster.cluster.x-k8s.io":
f:labels:
.:
f:objectset.rio.cattle.io/hash:
f:ownerReferences:
.:
k:{"uid":"2ecd5d8d-f231-4fa7-9b1e-54bb74169c5f"}:
f:spec:
.:
f:controlPlaneEndpoint:
.:
f:host:
f:port:
f:controlPlaneRef:
f:infrastructureRef:
Manager: rancher
Operation: Update
Time: 2023-11-13T14:37:17Z
API Version: cluster.x-k8s.io/v1beta1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:conditions:
f:controlPlaneReady:
f:infrastructureReady:
f:observedGeneration:
f:phase:
Manager: rancher
Operation: Update
Subresource: status
Time: 2023-11-13T14:37:17Z
Owner References:
API Version: provisioning.cattle.io/v1
Block Owner Deletion: true
Controller: true
Kind: Cluster
Name: externalprod
UID: 2ecd5d8d-f231-4fa7-9b1e-54bb74169c5f
Resource Version: 87108732
UID: cdef4c81-1f6f-4f81-9f7f-92aa5a1733a1
Spec:
Control Plane Endpoint:
Host: localhost
Port: 6443
Control Plane Ref:
API Version: rke.cattle.io/v1
Kind: RKEControlPlane
Name: externalprod
Namespace: fleet-default
Infrastructure Ref:
API Version: rke.cattle.io/v1
Kind: RKECluster
Name: externalprod
Namespace: fleet-default
Status:
Conditions:
Last Transition Time: 2023-09-04T12:49:29Z
Status: True
Type: Ready
Last Transition Time: 2023-09-04T13:09:56Z
Status: True
Type: ControlPlaneInitialized
Last Transition Time: 2023-11-10T20:26:56Z
Message: init node not found
Reason: Error
Status: False
Type: ControlPlaneReady
Last Transition Time: 2023-09-04T12:49:29Z
Status: True
Type: InfrastructureReady
Control Plane Ready: true
Infrastructure Ready: true
Observed Generation: 197302
Phase: Provisioned
Events: <none>
I checked this issue, that apparently should have a workaround for this:
opened 08:33AM - 25 Apr 22 UTC
closed 02:11AM - 22 Oct 22 UTC
kind/bug
area/backend
priority/0
area/rancher-related
require/release-note
not-require/test-plan
**Describe the bug**
This was spotted when debugging https://github.com/harve… ster/harvester/issues/2187.
After deleing a server node, a worker node can't become control plane node.
**To Reproduce**
Steps to reproduce the behavior:
1. Create a 4-node Harvester cluster.
2. Wait for 3 nodes to become control plane nodes (role is `control-plane,etcd,master`).
3. Find which node the rancher-webhook pod is on. Assume nodeX.
4. Delete nodeX.
5. Harvester should promote the remaining work node, but the job keep waiting:
```
machine.cluster.x-k8s.io/custom-6bce219ef5d1 labeled
secret/custom-6bce219ef5d1-machine-plan labeled
rkebootstrap.rke.cattle.io/custom-6bce219ef5d1 labeled
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...
```
The CAPI cluster keeps staying in `Provisioning` phase:
```
$ kubectl get cluster -n fleet-local -o yaml
apiVersion: v1
items:
- apiVersion: provisioning.cattle.io/v1
kind: Cluster
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"provisioning.cattle.io/v1","kind":"Cluster","metadata":{"annotations":{},"labels":{"rke.cattle.io/init-node-machine-id":"xkhlp79g4cg8rgdgfsxsbm26ftvhglvzst28r9cr87spst2hcldxdq"},"name":"local","namespace":"fleet-local"},"spec":{"kubernetesVersion":"v1.21.11+rke2r1","rkeConfig":{"controlPlaneConfig":null}}}
objectset.rio.cattle.io/applied: H4sIAAAAAAAA/4yQzU7DMBCEXwXt2Slt079Y4oAQ4sCVF9jYS2Ow15G9CYfK746SVqJC4udo78xovjlBIEGLgqBPgMxRUFzkPD1j+0ZGMskiubgwKOJp4eKts6ChT3F02UV2fKyMH7JQqkwiFAL1ozV+MKXqOL6DhoCMRwrEciUYa3Xz7NjePZwj/8xiDAQafDTo/yXOPZrJAUXB3NdFfnGBsmDoQfPgvQKPLflfR+gwd6Bhu9ztt3XdUGNwc7Crdr9u6jW1y/pg91vb2LXdbHarA6jzYpbSVwho6DCNNIMWBd9Yrtu+eiKpzpeiIPdkpnbzx2Wq+0G6R7Z9dCygT2WSCcpwwciURrJPxJRmZtDLUj4DAAD//5CVWGcAAgAA
objectset.rio.cattle.io/id: provisioning-cluster-create
objectset.rio.cattle.io/owner-gvk: management.cattle.io/v3, Kind=Cluster
objectset.rio.cattle.io/owner-name: local
objectset.rio.cattle.io/owner-namespace: ""
creationTimestamp: "2022-04-11T08:15:00Z"
finalizers:
- wrangler.cattle.io/provisioning-cluster-remove
- wrangler.cattle.io/rke-cluster-remove
generation: 2
labels:
objectset.rio.cattle.io/hash: 50675339e9ca48d1b72932eb038d75d9d2d44618
provider.cattle.io: harvester
rke.cattle.io/init-node-machine-id: xkhlp79g4cg8rgdgfsxsbm26ftvhglvzst28r9cr87spst2hcldxdq
name: local
namespace: fleet-local
resourceVersion: "20689"
uid: 45e02df5-5f70-4845-ae77-0954a4b68fa8
spec:
kubernetesVersion: v1.21.11+rke2r1
localClusterAuthEndpoint: {}
rkeConfig: {}
status:
clientSecretName: local-kubeconfig
clusterName: local
conditions:
- status: "True"
type: Ready
- status: Unknown
type: DefaultProjectCreated
- status: Unknown
type: SystemProjectCreated
- lastUpdateTime: "2022-04-11T08:15:00Z"
status: "False"
type: Reconciling
- lastUpdateTime: "2022-04-11T08:15:00Z"
status: "False"
type: Stalled
- lastUpdateTime: "2022-04-11T08:15:49Z"
status: "True"
type: Created
- lastUpdateTime: "2022-04-25T08:04:12Z"
status: "True"
type: RKECluster
- lastUpdateTime: "2022-04-25T08:04:12Z"
message: 'Operation cannot be fulfilled on secrets "custom-6aa860f10259-machine-plan":
the object has been modified; please apply your changes to the latest version
and try again'
reason: Error
status: "False"
type: Provisioned
observedGeneration: 2
ready: true
kind: List
metadata:
resourceVersion: ""
selfLink: ""
node2:~ # kubectl get machines -A
NAMESPACE NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
fleet-local custom-6aa860f10259 local node2 rke2://node2 Running 14d
fleet-local custom-6bce219ef5d1 local node4 rke2://node4 Running 14d
fleet-local custom-78e6431db553 local node3 rke2://node3 Running 14d
```
**Expected behavior**
The worker node should be promoted.
**Support bundle**
**Environment:**
- Harvester ISO version: `v1.0.1`, Rancher `v2.6.4`
- Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): KVM VMs
**Additional context**
Add any other context about the problem here.
I got the metadata:
kubectl get secret externalprod-pool-c31513ec-rwf2n-machine-state -n fleet-default -o yaml |grep "metadata"
And then tried to label the node:
kubectl label clusters.provisioning.cattle.io local -n fleet-local rke.cattle.io/init-node-machine-id=ef710914b7f47702a96124064d5d7300fd504c13 --overwrite
I found these errors on the nodes itself:
Nov 13 16:30:54 externalprod-pool-81b84bbf-m2x5d rancher-system-agent[816]: time="2023-11-13T16:30:54+01:00" level=error msg="[K8s] received secret to process that was older than the last secret operated on. (87127446 vs 87131617)"
Nov 13 16:30:54 externalprod-pool-81b84bbf-m2x5d rancher-system-agent[816]: time="2023-11-13T16:30:54+01:00" level=error msg="error syncing 'fleet-default/externalprod-bootstrap-template-ljj2b-machine-plan': handler secret-watch: secret received was too old, requeuing"
But that did not help.
I am kinda stuck in where to start debugging this issue.
Thankful for any pointers
wioxjk
November 14, 2023, 8:52am
2
So I tested an radical approach - I rebooted all the nodes.
Now, the error I am getting is this:
fixed machine with ID 6bb2e394-xxxx-xxxx-xxxx-49f4996f3f30 not found
wioxjk
November 15, 2023, 2:29pm
3
So I removed the label labels: rke.cattle.io/init-node-machine-id: 6bb2e394-04e8-4826-9a7c-49f4996f3f30
And we are back to the original error…
Looks like there’s an issue with your bootstrap node.
I had faced the similar issue before… I was able to resolve it by doing the following:
I see you’ve tried to add the label:
kubectl label clusters.provisioning.cattle.io local -n fleet-local rke.cattle.io/init-node-machine-id=ef710914b7f47702a96124064d5d7300fd504c13 --overwrite
If this is a downstream cluster then you should look for your cluster in the fleet-default namespace.
Try:
kubectl get clusters.provisioning.cattle.io -n fleet-default
Based on your cluster name apply the label
kubectl label clusters.provisioning.cattle.io <cluster_name> -n fleet-default rke.cattle.io/init-node-machine-id=<machine_id> --overwrite
1 Like
wioxjk
November 17, 2023, 7:55am
5
Thank you for your reply!
I’ve actually tried this aswell, but I am getting the above error fixed machine with ID XXXXX not found
The question is thou, if I have picked the correct machineID…
How did you get your machine ID?
Do you have access to the rancher UI? If yes, Can you go to cluster management → click on your cluster → select your master node → in the yaml of your master node in label section you will see a label rke.cattle.io/machine-id . Copy that machine ID and use that ID in the command.
The machine ID should be of the healthy Master node.
1 Like
wioxjk
November 17, 2023, 9:31am
7
Thanks for confirming.
I can confirm that I did exactly that even before, however - I am faced with the error (ixed machine with ID
…) as before.
Can you check if there’s a spec.EtcdSnapshotCreate
field in you cluster config.yaml? If yes can you remove it and apply and then try? reference
1 Like
wioxjk
November 20, 2023, 8:03am
9
Hi,
I’ve removed the value that previously was “1”, and set it to “nil” - and also removed that label that I set with the following:
kubectl label clusters.provisioning.cattle.io <cluster_name> -n fleet-default rke.cattle.io/init-node-machine-id=<machine_id> --overwrite
And it looks like progress is being made, the errors is gone in rancher and Rancher can finally provision new nodes and update it!
Thank you very much!
1 Like