Cluster attempting to use deleted node group/launch template

I’ve recently configured a launch template for a node group in EKS, which is managed by Rancher.

This went through some iterations which required creating a few different node groups in order to get the nodes set up as I wanted.

After cleaning up the old node groups and updating the rancher user with the necessary permissions, the cluster is now stuck in an updating state with the following error:

Controller.FailureMessage{ClusterName:“”, Message_:“Launch template could not be found : Could not find the specified version 3 for the launch template with ID lt-xxxx.”, NodegroupName:“”}

The launch template exists, but the version was removed. This version was also connected to a node group which doesn’t exist.

Anyone come across this before? Or any ideas on how to solve this?

EDIT: Looks like the ‘eks-config-operator’ is the culprit:

time=“2024-04-16T13:56:59Z” level=error msg=“error syncing ‘cattle-global-data/c-94tkb’: handler eks-controller: error creating nodegroup: InvalidParameterException: Launch template could not be found : Could not find the specified version 3 for the launch template with ID lt-xxxxxxxx.\n{\n RespMetadata: {\n StatusCode: 400,\n RequestID: "e324c1b9-16ec-4788-af55-c3581719fe15"\n },\n Message_: "Launch template could not be found : Could not find the specified version 3 for the launch template with ID lt-xxxxxxxx."\n}, requeuing”

It looks like the controller is infinitely requeuing the job. Restarting the operator doesn’t help. Not sure if there’s a way to clear this.

I am facing the similar issue.

I recently upgraded eks from 1.26 to 1.27, there was new launch template been created. On rancher, it is still looking for old template.

ERROR: Controller.FailureMessage{ClusterName:“”, Message_:“Launch template could not be found : The specified launch template, with template ID lt-xxxx, does not exist.”, NodegroupName:“”}

Also, when checked cluster management, the kubernetes version been grayed out.

On Rancher 2.7.9. I solved patching manually the clusters.management.cattle.io Object related to the downstream cluster on the Rancher local instance (namespace fllet-default) .

I replaced the inconsistent data under

spec:
eksConfig:
nodeGroups:

with the correct coming from the status field of the object

status:
appliedSpec:
eksConfig:
nodeGroups:

1 Like

Thanks! That worked like a charm.

Note for anyone else making these changes: ensure that all changes have been moved from status.appliedSpec.eksConfig.nodeGroups to spec.eksConfig.nodeGroups

Any changes made to node groups after the initial error will be reverted if the spec isn’t up-to-date, which can result in node groups being deleted from EKS.

We have the same issue here and it happens 3 times again where we created new nodegroup and deleted the old one. So it looks for us like an error which wil always happen.

Do someone know a way to fix this instead of everythime to edit the http://clusters.management.cattle.io with the fear to lose nodegroups because something is overseen.

Using Rancher v2.8.4

At it seems that Rancher 2.8.5 fix this:

  • Fixed an issue where custom secrets encryption configurations were being stored in plaintext under the clusters AppliedSpec. This was also causing clusters to continuously reconcile, as the AppliedSpec would never match the desired cluster Spec. The information stored here contains the encryption configuration for secrets within etcd, and could potentially expose sensitive data if the etcd database was exposed directly. For more information, see [#45800] and [CVE-2024-22032 ].

Rancher Release v2.8.5 - Announcements - Rancher Labs