Longhorn UI, Volume stuck in deleting

I have a longhorn volume in “deleting” state shown in UI.
Repeated attempts to delete it don’t remove it and it is stuck in this state.
I have already removed the corresponding PV/PVC from k8s but still that Longhorn Volume is listed as “deleting” in UI.

Background info regarding this volume:

  • Initially the volume was a valid PV/PVC following the standard naming convention “pvc-*”
  • At some point we had DiskPressure issues on a node, rendering that volume broken because re-scheduling the longhorn replicas failed due to insufficient resources on other nodes.
  • Restored the broken volume from a backup using longhorn UI - but using a different volume-name, not like “pvc-*”
  • afterwards that longhorn volume (PV/PVC) was usable on the corresponding POD again.
  • The DiskPressure issue reappeared, again longhorn volume got broken due the resulting POD eviction.
  • Afterwards that longhorn volume was shown in UI as “Detached” even though the rescheduled POD successfully accessed PV/PVC
  • Downscaled corresponding deployment, manually removed PV/PVC.
  • Any attempt to remove the volume via longhorn UI failed, leaving it in “deleting” state.

Any hint appreciated to get it removed !
What can I do ?
Where should I gather more debug info to see what the problem is ?

Hi @manuel-koch

Can you check the log of Longhorn manager and search for the stuck volume name? It should provide some clue.

Also, which version of Longhorn you’re using?

We are using Longhorn 0.7.0.

The logs of longhorn-manager start with this snippet and the last errors are repeated over and over again:
Name of the stuck longhorn volume is “sfpl-prod-file-service”.
The mentioned “instance-manager-r-af5c615a” POD does not run anymore, I guess it was evicted when DiskPressure happened.

time="2020-02-24T16:16:55Z" level=info msg="Start overwriting built-in settings with customized values"
time="2020-02-24T16:16:55Z" level=debug msg="Engine image longhornio/longhorn-engine:v0.7.0 is ready"
time="2020-02-24T16:16:55Z" level=info msg="Listening on 10.42.4.140:9500"
time="2020-02-24T16:16:55Z" level=info msg="Start Longhorn node controller"
time="2020-02-24T16:16:55Z" level=info msg="Start Longhorn replica controller"
time="2020-02-24T16:16:55Z" level=info msg="Start Longhorn engine controller"
time="2020-02-24T16:16:55Z" level=info msg="Start Longhorn websocket controller"
time="2020-02-24T16:16:55Z" level=info msg="Start Longhorn volume controller"
time="2020-02-24T16:16:55Z" level=info msg="Start Longhorn Engine Image controller"
time="2020-02-24T16:16:55Z" level=info msg="Starting Longhorn instance manager controller"
time="2020-02-24T16:16:55Z" level=info msg="Start Longhorn Setting controller"
time="2020-02-24T16:16:55Z" level=info msg="Start kubernetes controller"
time="2020-02-24T16:16:56Z" level=debug msg="Start monitoring pvc-1a6b2d60-2c86-437f-bad4-6c3e01abab4b-e-38079260"
time="2020-02-24T16:16:56Z" level=debug msg="Start monitoring instance manager instance-manager-r-33fd6639"
time="2020-02-24T16:16:56Z" level=warning msg="Error syncing Longhorn replica longhorn-system/sfpl-prod-file-service-r-720aa022: fail to sync replica for longhorn-system/sfpl-prod-file-service-r-720aa022: failed to cleanup the related replica process before deleting replica sfpl-prod-file-service-r-720aa022: instancemanager.longhorn.io \"instance-manager-r-af5c615a\" not found"
time="2020-02-24T16:16:56Z" level=debug msg="Start monitoring pvc-e580989d-01e3-4b58-bf8b-6fee72b63762-e-15b99770"
time="2020-02-24T16:16:56Z" level=debug msg="Start backup store monitoring for s3://longhorn-backup@us-east-1/"
time="2020-02-24T16:16:56Z" level=debug msg="Start monitoring instance manager instance-manager-e-5f078c81"
time="2020-02-24T16:16:56Z" level=warning msg="Error syncing Longhorn replica longhorn-system/sfpl-prod-file-service-r-720aa022: fail to sync replica for longhorn-system/sfpl-prod-file-service-r-720aa022: failed to cleanup the related replica process before deleting replica sfpl-prod-file-service-r-720aa022: instancemanager.longhorn.io \"instance-manager-r-af5c615a\" not found"
E0224 16:16:56.073734       1 replica_controller.go:178] fail to sync replica for longhorn-system/sfpl-prod-file-service-r-720aa022: failed to cleanup the related replica process before deleting replica sfpl-prod-file-service-r-720aa022: instancemanager.longhorn.io "instance-manager-r-af5c615a" not found
time="2020-02-24T16:16:56Z" level=warning msg="Dropping Longhorn replica longhorn-system/sfpl-prod-file-service-r-720aa022 out of the queue: fail to sync replica for longhorn-system/sfpl-prod-file-service-r-720aa022: failed to cleanup the related replica process before deleting replica sfpl-prod-file-service-r-720aa022: instancemanager.longhorn.io \"instance-manager-r-af5c615a\" not found"
time="2020-02-24T16:17:25Z" level=warning msg="Error syncing Longhorn replica longhorn-system/sfpl-prod-file-service-r-720aa022: fail to sync replica for longhorn-system/sfpl-prod-file-service-r-720aa022: failed to cleanup the related replica process before deleting replica sfpl-prod-file-service-r-720aa022: instancemanager.longhorn.io \"instance-manager-r-af5c615a\" not found"
time="2020-02-24T16:17:25Z" level=warning msg="Error syncing Longhorn replica longhorn-system/sfpl-prod-file-service-r-720aa022: fail to sync replica for longhorn-system/sfpl-prod-file-service-r-720aa022: failed to cleanup the related replica process before deleting replica sfpl-prod-file-service-r-720aa022: instancemanager.longhorn.io \"instance-manager-r-af5c615a\" not found"
E0224 16:17:25.795593       1 replica_controller.go:178] fail to sync replica for longhorn-system/sfpl-prod-file-service-r-720aa022: failed to cleanup the related replica process before deleting replica sfpl-prod-file-service-r-720aa022: instancemanager.longhorn.io "instance-manager-r-af5c615a" not found
time="2020-02-24T16:17:25Z" level=warning msg="Dropping Longhorn replica longhorn-system/sfpl-prod-file-service-r-720aa022 out of the queue: fail to sync replica for longhorn-system/sfpl-prod-file-service-r-720aa022: failed to cleanup the related replica process before deleting replica sfpl-prod-file-service-r-720aa022: instancemanager.longhorn.io \"instance-manager-r-af5c615a\" not found"
.....
E0225 07:23:56.705452       1 replica_controller.go:178] fail to sync replica for longhorn-system/sfpl-prod-file-service-r-720aa022: failed to cleanup the related replica process before deleting replica sfpl-prod-file-service-r-720aa022: instancemanager.longhorn.io "instance-manager-r-af5c615a" not found
time="2020-02-25T07:23:56Z" level=warning msg="Dropping Longhorn replica longhorn-system/sfpl-prod-file-service-r-720aa022 out of the queue: fail to sync replica for longhorn-system/sfpl-prod-file-service-r-720aa022: failed to cleanup the related replica process before deleting replica sfpl-prod-file-service-r-720aa022: instancemanager.longhorn.io \"instance-manager-r-af5c615a\" not found"

@manuel-koch

Can you file a ticket for this issue? Also, if you can send us a support bundle (see the UI footer), we can look more into it.

I created issue https://github.com/longhorn/longhorn/issues/1071