SOLVED: How to remove "Delayed" processes which have been stuck for months?

I need advice to remove some hung Rancher Processes from my Rancher Server, possibly by removing them from the MySQL Schema.

My production Rancher cluster shows 23 processes which have been “Delayed” for since either 12/1/2016 or 01/05/2017. The Rancher process name is instance.purge. Here’s a screenshot from https://rancher.example.org/admin/processes/list?which=delayed

The Rancher Server logs are filled with thousands of errors and Java stack traces like these:

{"log":"2017-03-31 19:41:26,121 ERROR [:] [] [] [] [ecutorService-1] [.e.s.i.ProcessInstanceDispatcherImpl] Unknown exception running process [instance.purge:1247761] on [9669], canceled by [State [activating] is not valid for process [instancehostmap.remove:null] on resource [9106]] \n","stream":"stdout","time":"2017-03-31T19:41:26.121741677Z"}
{"log":"2017-03-31 19:41:26,124 ERROR [:] [] [] [] [cutorService-14] [.e.s.i.ProcessInstanceDispatcherImpl] Unknown exception running process [instance.purge:11181739] on [204711], canceled by [State [activating] is not valid for process [instancehostmap.remove:null] on resource [202811]] \n","stream":"stdout","time":"2017-03-31T19:41:26.124572917Z"}
{"log":"2017-03-31 19:41:26,127 ERROR [:] [] [] [] [cutorService-12] [.e.s.i.ProcessInstanceDispatcherImpl] Unknown exception running process [instance.purge:11184576] on [204754], canceled by [State [activating] is not valid for process [instancehostmap.remove:null] on resource [202867]] \n","stream":"stdout","time":"2017-0331T19:41:26.128013789Z"}
{"log":"2017-03-31 19:41:41,112 ERROR [657ae095-4d79-4e8f-84a8-b0bb08137e75:11180638] [instance:204677] [instance.purge] [] [ecutorService-7] [c.p.e.p.i.DefaultProcessInstanceImpl] Unknown exception java.lang.IllegalStateException: Attempt to cancel when process is still transitioning\n","stream":"stdout","time":"2017-03-31T19:41:41.11392176Z"}

If I view the host/instance ID in the API, the purge button is not clickable. If I use the ‘Delete’ button on that screen, Rancher seems return an error message under “HTTP Response:”, and the Process is not removed.

{
"id": "3322515d-5452-441c-90eb-1d3541c605d5",
"type": "error",
"links": { },
"actions": { },
"status": 409,
"code": "Conflict",
"message": "Conflict",
"detail": null,
"baseType": "error"
}

I suppose I could remove these processes from the database. Is there a clear procedure on how to do that? Is the MySQL Schema documented? I could try deleting the rows from process_instance (After backing up the schema), but I’m unclear about other tables.

DELETE FROM process_instance WHERE process_name LIKE "instance.purge" AND start_time LIKE "2016-12-01%";```

For the record, this post is an attempt to fix my issue reported at Dozens of processes named "instance.purge" have been "delayed" since 12/1/2016 and 1/5/2017 · Issue #8316 · rancher/rancher · GitHub

1 Like

I could really use some advice here. Anyone have any information about this?

FYI, I managed to fix this by running docker run --rm rancher/cleanup-1-1:v0.1.2 and upgrading Rancher from 1.4.2 to 1.5.7 a few days later.

I don’t understand why this would have fixed it, but details are in https://github.com/rancher/rancher/issues/8316 .

When you ran the container with rancher / cleanup, did you delete only what was purged or deleted important database information?
I’m thinking of using this rancher / cleanup.

I am having this problem, my database in the instance table is with 6 gb of data. I can not even dump the database that the service stops running

I am experiencing a very similar issue, everything is reported here https://github.com/rancher/rancher/issues/16694#issuecomment-481190622

So this problem occurs also with Rancher v1.6.14 and also with last stable version Rancher v1.6.26
FYI https://www.claudiokuenzler.com/blog/830/how-to-solve-rancher-1.x-service-stuck-in-removing-in-progress

Anyone can give support about this critical bug? Because it is critical!