Pods stuck while in removing state

Ive got several pods that are stuck in the removing state… How to debug and remove:

can you use kubectl describe on the pods to get some details on them?

./rancher kubectl describe pod hub-857dc8f9f4-j9nqt --namespace platform
Name: hub-857dc8f9f4-j9nqt
Namespace: platform
Node: ip-10-1-10-190/10.1.10.190
Start Time: Sat, 09 Jun 2018 09:28:01 -0400
Labels: pod-template-hash=4138749590
workload.user.cattle.io/workloadselector=deployment-platform-hub
workloadID_ingress-4d741ade6d2bdf8f43d7a491eb645b06=true
workloadID_ingress-5cf5f9db88fc56cc41f0def7f145ae65=true
workloadID_ingress-c84e860e5a186ef0532c749a7ff8dbc6=true
Annotations: field.cattle.io/publicEndpoints=[{“addresses”:[“ip”],“port”:30868,“protocol”:“TCP”,“serviceName”:“platform:ingress-4d741ade6d2bdf8f43d7a491eb645b06”,“allNodes”:true},{“addresses”:["ip
Status: Terminating (lasts 4d)
Termination Grace Period: 30s
IP:
Controlled By: ReplicaSet/hub-857dc8f9f4
Containers:
hub:
Container ID:
Image: some.url:5000/hub:test2-0982d24dcbc2b26e16bb8500f4f100efae7c45cc
Image ID:
Port:
Host Port:
State: Terminated
Exit Code: 0
Started: Mon, 01 Jan 0001 00:00:00 +0000
Finished: Mon, 01 Jan 0001 00:00:00 +0000
Ready: False
Restart Count: 0
Environment Variables from:
staging-secrets Secret Optional: false
Environment:
Mounts:
/etc/northpage from staging-secrets (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-c74p6 (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
staging-secrets:
Type: Secret (a volume populated by a Secret)
SecretName: staging-secrets
Optional: false
default-token-c74p6:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-c74p6
Optional: false
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:

Was there ever any resolution to this? I’m having a similar issue; I have two pods that are stuck in the “Terminating” state and I can’t figure out how to get rid of them. The one thing they both have in common is that they both tried (and failed) to mount a secret that didn’t exist as a volume; I made a typo when creating the secret and didn’t notice it. Here’s the status for one of them:

Name:                      catbot-tunnel-7d65bcbc4f-2v5dt
Namespace:                 default
Node:                      ares/129.162.199.31
Start Time:                Mon, 02 Jul 2018 14:30:15 -0500
Labels:                    io.kompose.service=catbot-tunnel
                           pod-template-hash=3821676709
Annotations:               <none>
Status:                    Terminating (lasts 2h)
Termination Grace Period:  0s
IP:                        
Controlled By:             ReplicaSet/catbot-tunnel-7d65bcbc4f
Containers:
  catbot-tunnel:
    Container ID:   
    Image:          jnovack/autossh
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Exit Code:    0
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False
    Restart Count:  0
    Environment:
      SSH_HOSTUSER:       omitted
    Mounts:
      /id_rsa from key (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-4pc7s (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  key:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  key
    Optional:    false
  default-token-4pc7s:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-4pc7s
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

No resolution so I just moved on. If I remember correctly I was playing with secrets as well. Sorry couldn’t be more help.

Just in case anybody else comes along, I was able to remove them by first force-removing the pods:

kubectl delete pods/catbot-tunnel-2-798675b7c5-dnw8r --grace-period=0 --force

This did not actually remove them, but instead it caused them to get stuck waiting on a foregroundDeletion event. I used kubectl edit to remove this from their definitions:

finalizers:
- foregroundDeletion

Then they finally died (and hopefully haven’t left any open resources in the system that I’m not aware of).

2 Likes

We experiencing same issue, what i found so far that this issue is somewhat related to docker, after we downgraded to 1.12.6 from 17.3.2, situation become much more stable.

Same issue here, but also appearing with 17.03.2-ce (we too did a downgrade, from 17.12.1 to 17.0.3.2 to be precise). Rancher 2.0.8, Docker 17.12.1-0ubuntu1

The container/pod has been in “Removing” state for the past days. When I use kubectl to get the current status:

$ kubectl describe pod importer-8bf85dcc9-r5rtn --namespace gamma --insecure-skip-tls-verify=true
Name:				importer-8bf85dcc9-r5rtn
Namespace:			gamma
Node:				redacted/redacted
Start Time:			Tue, 18 Sep 2018 16:03:59 +0200
Labels:				pod-template-hash=469418775
				workload.user.cattle.io/workloadselector=deployment-gamma-importer
Annotations:			cni.projectcalico.org/podIP=10.42.1.39/32
Status:				Terminating (expires Tue, 18 Sep 2018 16:09:11 +0200)
Termination Grace Period:	30s
IP:				10.42.1.39
Controllers:			<none>
Containers:
  importer:
    Container ID:	docker://05b93ed9018854067b5ec63ef4929b512cd9f9f2306f9e0ff67ea6ee06478c1b
    Image:		redacted/importer:stage-461
    Image ID:		docker-pullable://redacted/importer@sha256:fa93b2f3359ce7b72823292bbfc2bfb493e912cbf28d87b976ecdeadb6ba3ca7
    Port:		
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		False
    Restart Count:	0

Note the Status: Terminating (expires Tue, 18 Sep 2018 16:09:11 +0200). It’s September 20th today…

I tried to force a deletion using kubectl like @minneyar described:

$ kubectl delete pod importer-8bf85dcc9-r5rtn --now --force --namespace gamma --insecure-skip-tls-verify=true
pod "importer-8bf85dcc9-r5rtn" deleted

But the pod is still shown in Rancher2 UI (Removing) and the pod still shows up when I use the same kubectl describe command from above…

I could not find that information about “finalizers”. What exactly did you edit, @minneyar?

I feel like what I edited was in the pod metadata, but I don’t exactly remember and I actually haven’t used rancher in a while now. Here’s a little bit of documentation on that metadata, though: https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#controlling-how-the-garbage-collector-deletes-dependents

OK I will take a look at that.

In the meantime I “profited” from the pod being in “removing” state to extend the check_rancher2 monitoring plugin (can be found here: https://github.com/Napsty/check_rancher2). It will now alarm when a pod in a non-running state is found within a project:

./check_rancher2.sh -H myrancher2.example.com -U token-XXXXX -P ootaefomai7eeseyoopeeghooxoor1iuvie0Ohvahph5ahrui5Ailee -S -t pod -p c-xxxxx:p-xxxxx
CHECK_RANCHER2 CRITICAL - Pod "importer-8bf85dcc9-r5rtn" is removing -|'pods_total'=8;;;; 'pods_errors'=1;;;;

Thank you @minneyar

kubectl delete pods/podname --grace-period=0 --force

Now worked for me too! Seems the --grace-period=0 did the trick (I tried it with --now before).

The monitoring plugin now returns OK:

$ ./check_rancher2.sh -H myrancher2.example.com -U token-XXXXX -P ootaefomai7eeseyoopeeghooxoor1iuvie0Ohvahph5ahrui5Ailee -S -t pod -p c-xxxxx:p-xxxxx
CHECK_RANCHER2 OK - All pods (7) in project c-xxxxx:p-xxxxx are running|'pods_total'=7;;;; 'pod_errors'=0;;;;