Longhorn helm upgrade to 1.7.1 stuck

timvdh · September 24, 2024, 5:15am

My upgrade job in ns kube-system is stuck with the below log messages. The pods in ns longhorn-system still look intact. However, the failed upgrade degraded longhorn somehow and many pvcs are not working anymore.

Maybe someone knows a proper way out of this situation? This is not a production k3s installation, but I invested some time in getting it where it was and would like not to lose the data stored in longhorn. It’s a 3 server 3 agent cluster and I am using kube-hetzner.

if [[ ${KUBERNETES_SERVICE_HOST} =~ .*:.* ]]; then
    echo "KUBERNETES_SERVICE_HOST is using IPv6"
    CHART="${CHART//%\{KUBERNETES_API\}%/[${KUBERNETES_SERVICE_HOST}]:${KUBERNETES_SERVICE_PORT}}"
else
    CHART="${CHART//%\{KUBERNETES_API\}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}}"
fi

set +v -x
+ [[ '' == \v\2 ]]
+ shopt -s nullglob
+ [[ -f /config/ca-file.pem ]]
+ [[ -f /tmp/ca-file.pem ]]
+ [[ false == \t\r\u\e ]]
+ [[ false == \t\r\u\e ]]
+ [[ -n '' ]]
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/longhorn.tgz.base64
+ CHART_PATH=/tmp/longhorn.tgz
+ [[ ! -f /chart/longhorn.tgz.base64 ]]
+ return
+ [[ install != \d\e\l\e\t\e ]]
+ helm_repo_init
+ grep -q -e 'https\?://'
+ [[ longhorn/longhorn == stable/* ]]
+ [[ -n https://charts.longhorn.io ]]
+ [[ -f /auth/username ]]
+ [[ -f /auth/tls.crt ]]
+ helm repo add longhorn https://charts.longhorn.io
"longhorn" already exists with the same configuration, skipping
+ helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "longhorn" chart repository
Update Complete. ÔÄêHappy Helming!ÔÄê
+ helm_update install --namespace longhorn-system --version '*'
++ ++ jq -r '"\(.[0].chart),\(.[0].status)"'
helm ls --all -f '^longhorn$' --namespace longhorn-system --output json
++ tr '' ''
+ LINE=longhorn-1.7.1,uninstalling
+ IFS=,
+ read -r INSTALLED_VERSION STATUS _
+ VALUES=
+ for VALUES_FILE in /config/*.yaml
+ VALUES=' --values /config/values-01_HelmChart.yaml'
+ [[ install = \d\e\l\e\t\e ]]
+ [[ longhorn-1.7.1 =~ ^(|null)$ ]]
+ [[ uninstalling =~ ^(pending-install|pending-upgrade|pending-rollback|uninstalling)$ ]]
Previous helm job was interrupted, updating status from uninstalling to failed
+ echo Previous helm job was interrupted, updating status from uninstalling to failed
+ echo 'Resetting helm release status from '\''uninstalling'\'' to '\''failed'\'''
+ helm set-status longhorn failed --namespace longhorn-system
2024/09/24 05:12:14 release longhorn status updated
+ [[ uninstalling == \p\e\n\d\i\n\g\-\u\p\g\r\a\d\e ]]
+ STATUS=failed
+ [[ failed =~ ^deployed$ ]]
+ [[ failed =~ ^(deleted|failed|null|unknown)$ ]]
+ [[ reinstall == \r\e\i\n\s\t\a\l\l ]]
+ echo 'Uninstalling failed helm chart'
+ helm uninstall longhorn --namespace longhorn-system --wait

shuo-wu · September 25, 2024, 4:12pm

This log looks weird… Have you tried to uninstall Longhorn before this upgrade? Or was this upgrade somehow trying to do uninstallation then re-installation with the new version? Would you describe more about how you do the upgrade?

timvdh · September 25, 2024, 8:14pm

Thanks, I was finally able to get 1.7.1 to install, but only by using

kubectl edit settings.longhorn.io deleting-confirmation-flag -n longhorn-system

That, as I suspected, removed all volumes, but at least produced a working longhorn system.

The upgrade was initiated by k3s or maybe something that the GitHub - kube-hetzner/terraform-hcloud-kube-hetzner: Optimized and Maintenance-free Kubernetes on Hetzner Cloud in one command! project installs - it is quiet the blackbox for me and I only activated longhorn in the kube.tf file of this project.

I still have replicas in /var/longhorn but cannot mount them, they show numerous file system errors. I was able to recover some of my data. However, since this is a hobby project not much was lost. It seems however good advice to backup longhorn data before installing a new release - which I did not.

shuo-wu · September 26, 2024, 1:55am

We encountered a similar issue before, which is caused by helm controller FAILURE_POLICY: reinstall. I suspected that this is also the culprit in your case since kube-hetzner is using Helm Controller inside k3s as well. We are investigating how we can improve it.

PhanLe0110 · September 26, 2024, 2:03am

Ref the Longhorn ticket [FEATURE] Support/verify installing Longhorn Helm chart using helm-controller which is built in k3s and rke2 · Issue #9506 · longhorn/longhorn · GitHub

thezog · November 7, 2024, 5:26pm

I’m guessing this crappy forum website software did something to mangle what you posted.
When you ran the kubectl edit command, what did you change? Did you delete the block of yaml that contained “deleting-confirmation-flag” ?

Topic		Replies	Views
Longhorn-driver-deployer; stuck at Init:0/1 - Dual Stack Cluster Longhorn	1	4437	January 3, 2023
Pod stucks when recreates at another node Longhorn	0	442	August 9, 2023
Longhorn App missing after upgrade to Rancher 2.62 Longhorn	3	1276	December 23, 2021
Failed Upgrade from v0.8.1 to v1.0.0 caused by pv created before v0.6.2 Longhorn	9	2302	June 8, 2020
Longhorn PVC failed to switch to different pod, once pod instance died Longhorn	3	3848	September 4, 2019

Longhorn helm upgrade to 1.7.1 stuck

Related topics