Cluster attempting to use deleted node group/launch template

I’ve recently configured a launch template for a node group in EKS, which is managed by Rancher.

This went through some iterations which required creating a few different node groups in order to get the nodes set up as I wanted.

After cleaning up the old node groups and updating the rancher user with the necessary permissions, the cluster is now stuck in an updating state with the following error:

Controller.FailureMessage{ClusterName:“”, Message_:“Launch template could not be found : Could not find the specified version 3 for the launch template with ID lt-xxxx.”, NodegroupName:“”}

The launch template exists, but the version was removed. This version was also connected to a node group which doesn’t exist.

Anyone come across this before? Or any ideas on how to solve this?

EDIT: Looks like the ‘eks-config-operator’ is the culprit:

time=“2024-04-16T13:56:59Z” level=error msg=“error syncing ‘cattle-global-data/c-94tkb’: handler eks-controller: error creating nodegroup: InvalidParameterException: Launch template could not be found : Could not find the specified version 3 for the launch template with ID lt-xxxxxxxx.\n{\n RespMetadata: {\n StatusCode: 400,\n RequestID: "e324c1b9-16ec-4788-af55-c3581719fe15"\n },\n Message_: "Launch template could not be found : Could not find the specified version 3 for the launch template with ID lt-xxxxxxxx."\n}, requeuing”

It looks like the controller is infinitely requeuing the job. Restarting the operator doesn’t help. Not sure if there’s a way to clear this.

I am facing the similar issue.

I recently upgraded eks from 1.26 to 1.27, there was new launch template been created. On rancher, it is still looking for old template.

ERROR: Controller.FailureMessage{ClusterName:“”, Message_:“Launch template could not be found : The specified launch template, with template ID lt-xxxx, does not exist.”, NodegroupName:“”}

Also, when checked cluster management, the kubernetes version been grayed out.

On Rancher 2.7.9. I solved patching manually the clusters.management.cattle.io Object related to the downstream cluster on the Rancher local instance (namespace fllet-default) .

I replaced the inconsistent data under

spec:
eksConfig:
nodeGroups:

with the correct coming from the status field of the object

status:
appliedSpec:
eksConfig:
nodeGroups:

1 Like

Thanks! That worked like a charm.

Note for anyone else making these changes: ensure that all changes have been moved from status.appliedSpec.eksConfig.nodeGroups to spec.eksConfig.nodeGroups

Any changes made to node groups after the initial error will be reverted if the spec isn’t up-to-date, which can result in node groups being deleted from EKS.