Pods stuck while in removing state

jeremyweber-np · June 13, 2018, 3:52pm

Ive got several pods that are stuck in the removing state… How to debug and remove:

cjellick · June 13, 2018, 5:07pm

can you use kubectl describe on the pods to get some details on them?

jeremyweber-np · June 13, 2018, 5:24pm

./rancher kubectl describe pod hub-857dc8f9f4-j9nqt --namespace platform
Name: hub-857dc8f9f4-j9nqt
Namespace: platform
Node: ip-10-1-10-190/10.1.10.190
Start Time: Sat, 09 Jun 2018 09:28:01 -0400
Labels: pod-template-hash=4138749590
workload.user.cattle.io/workloadselector=deployment-platform-hub
workloadID_ingress-4d741ade6d2bdf8f43d7a491eb645b06=true
workloadID_ingress-5cf5f9db88fc56cc41f0def7f145ae65=true
workloadID_ingress-c84e860e5a186ef0532c749a7ff8dbc6=true
Annotations: field.cattle.io/publicEndpoints=[{“addresses”:[“ip”],“port”:30868,“protocol”:“TCP”,“serviceName”:“platform:ingress-4d741ade6d2bdf8f43d7a491eb645b06”,“allNodes”:true},{“addresses”:["ip…
Status: Terminating (lasts 4d)
Termination Grace Period: 30s
IP:
Controlled By: ReplicaSet/hub-857dc8f9f4
Containers:
hub:
Container ID:
Image: some.url:5000/hub:test2-0982d24dcbc2b26e16bb8500f4f100efae7c45cc
Image ID:
Port:
Host Port:
State: Terminated
Exit Code: 0
Started: Mon, 01 Jan 0001 00:00:00 +0000
Finished: Mon, 01 Jan 0001 00:00:00 +0000
Ready: False
Restart Count: 0
Environment Variables from:
staging-secrets Secret Optional: false
Environment:
Mounts:
/etc/northpage from staging-secrets (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-c74p6 (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
staging-secrets:
Type: Secret (a volume populated by a Secret)
SecretName: staging-secrets
Optional: false
default-token-c74p6:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-c74p6
Optional: false
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:

minneyar · July 2, 2018, 9:42pm

Was there ever any resolution to this? I’m having a similar issue; I have two pods that are stuck in the “Terminating” state and I can’t figure out how to get rid of them. The one thing they both have in common is that they both tried (and failed) to mount a secret that didn’t exist as a volume; I made a typo when creating the secret and didn’t notice it. Here’s the status for one of them:

Name:                      catbot-tunnel-7d65bcbc4f-2v5dt
Namespace:                 default
Node:                      ares/129.162.199.31
Start Time:                Mon, 02 Jul 2018 14:30:15 -0500
Labels:                    io.kompose.service=catbot-tunnel
                           pod-template-hash=3821676709
Annotations:               <none>
Status:                    Terminating (lasts 2h)
Termination Grace Period:  0s
IP:                        
Controlled By:             ReplicaSet/catbot-tunnel-7d65bcbc4f
Containers:
  catbot-tunnel:
    Container ID:   
    Image:          jnovack/autossh
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Exit Code:    0
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False
    Restart Count:  0
    Environment:
      SSH_HOSTUSER:       omitted
    Mounts:
      /id_rsa from key (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-4pc7s (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  key:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  key
    Optional:    false
  default-token-4pc7s:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-4pc7s
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

jeremyweber-np · July 3, 2018, 12:57am

No resolution so I just moved on. If I remember correctly I was playing with secrets as well. Sorry couldn’t be more help.

minneyar · July 3, 2018, 2:00pm

Just in case anybody else comes along, I was able to remove them by first force-removing the pods:

kubectl delete pods/catbot-tunnel-2-798675b7c5-dnw8r --grace-period=0 --force

This did not actually remove them, but instead it caused them to get stuck waiting on a foregroundDeletion event. I used kubectl edit to remove this from their definitions:

finalizers:
- foregroundDeletion

Then they finally died (and hopefully haven’t left any open resources in the system that I’m not aware of).

Alexey_Kruchenok · July 9, 2018, 10:41am

We experiencing same issue, what i found so far that this issue is somewhat related to docker, after we downgraded to 1.12.6 from 17.3.2, situation become much more stable.

Napsty · September 20, 2018, 1:53pm

Same issue here, but also appearing with 17.03.2-ce (we too did a downgrade, from 17.12.1 to 17.0.3.2 to be precise). Rancher 2.0.8, Docker 17.12.1-0ubuntu1

The container/pod has been in “Removing” state for the past days. When I use kubectl to get the current status:

$ kubectl describe pod importer-8bf85dcc9-r5rtn --namespace gamma --insecure-skip-tls-verify=true
Name:				importer-8bf85dcc9-r5rtn
Namespace:			gamma
Node:				redacted/redacted
Start Time:			Tue, 18 Sep 2018 16:03:59 +0200
Labels:				pod-template-hash=469418775
				workload.user.cattle.io/workloadselector=deployment-gamma-importer
Annotations:			cni.projectcalico.org/podIP=10.42.1.39/32
Status:				Terminating (expires Tue, 18 Sep 2018 16:09:11 +0200)
Termination Grace Period:	30s
IP:				10.42.1.39
Controllers:			<none>
Containers:
  importer:
    Container ID:	docker://05b93ed9018854067b5ec63ef4929b512cd9f9f2306f9e0ff67ea6ee06478c1b
    Image:		redacted/importer:stage-461
    Image ID:		docker-pullable://redacted/importer@sha256:fa93b2f3359ce7b72823292bbfc2bfb493e912cbf28d87b976ecdeadb6ba3ca7
    Port:		
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		False
    Restart Count:	0

Note the Status: Terminating (expires Tue, 18 Sep 2018 16:09:11 +0200). It’s September 20th today…

I tried to force a deletion using kubectl like @minneyar described:

$ kubectl delete pod importer-8bf85dcc9-r5rtn --now --force --namespace gamma --insecure-skip-tls-verify=true
pod "importer-8bf85dcc9-r5rtn" deleted

But the pod is still shown in Rancher2 UI (Removing) and the pod still shows up when I use the same kubectl describe command from above…

I could not find that information about “finalizers”. What exactly did you edit, @minneyar?

minneyar · September 20, 2018, 3:01pm

I feel like what I edited was in the pod metadata, but I don’t exactly remember and I actually haven’t used rancher in a while now. Here’s a little bit of documentation on that metadata, though: https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#controlling-how-the-garbage-collector-deletes-dependents

Napsty · September 21, 2018, 3:03pm

OK I will take a look at that.

In the meantime I “profited” from the pod being in “removing” state to extend the check_rancher2 monitoring plugin (can be found here: https://github.com/Napsty/check_rancher2). It will now alarm when a pod in a non-running state is found within a project:

./check_rancher2.sh -H myrancher2.example.com -U token-XXXXX -P ootaefomai7eeseyoopeeghooxoor1iuvie0Ohvahph5ahrui5Ailee -S -t pod -p c-xxxxx:p-xxxxx
CHECK_RANCHER2 CRITICAL - Pod "importer-8bf85dcc9-r5rtn" is removing -|'pods_total'=8;;;; 'pods_errors'=1;;;;

Napsty · September 21, 2018, 3:41pm

Thank you @minneyar

kubectl delete pods/podname --grace-period=0 --force

Now worked for me too! Seems the --grace-period=0 did the trick (I tried it with --now before).

The monitoring plugin now returns OK:

$ ./check_rancher2.sh -H myrancher2.example.com -U token-XXXXX -P ootaefomai7eeseyoopeeghooxoor1iuvie0Ohvahph5ahrui5Ailee -S -t pod -p c-xxxxx:p-xxxxx
CHECK_RANCHER2 OK - All pods (7) in project c-xxxxx:p-xxxxx are running|'pods_total'=7;;;; 'pod_errors'=0;;;;

Topic		Replies	Views
Liveness probe failed connection refused Rancher	16	29341	November 23, 2022
Machine stuck in "removing" state Rancher 1.x	4	2141	November 13, 2015
Harvester: Management URL Stuck in NotReady Harvester	0	535	December 16, 2023
Pods won't delete Rancher	5	4500	December 29, 2018
Host stucked in removing state Rancher 1.x	6	1044	September 24, 2016

Pods stuck while in removing state

Related topics