Rancher stops responding and restarts regularly

I did a fresh install of ubuntu 20 server.

Installed rancher 2.5.3.

Rancher is regularly restarting.
Deployments stay up and running. Rancher regularly unavailable. Seems to restart atleast every 15 minutes while using ui. I presume it’s doing it while I’m not trying to use it.

Any ideas what is causing this.

Thanks in advance.

Please share more info from the setup, I’m assuming you are running the Docker install (using docker run)? What are the specifications of the machine running the Rancher container? What is in the log of the Rancher container?

I am running on dell poweredge r710.
VMware 5.5.
VM Ubuntu 20 ( assigned 1 cpu 8 cores, 16 gig ram )

I followed a tutorial to install

I installed docker like this:
sudo curl https://releases.rancher.com/install-docker/19.03.sh | sh

I ran rancher like this:
sudo docker run --privileged -d --restart=unless-stopped -p 80:80 -p 443:443 rancher/rancher

I’m new to rancher, how do I get rancher logs?

The command docker ps -a will give you a list of all running and stopped containers, using docker logs with either the ID or the name of the container will output the logs from the container.

Sorry pretty noob…

Logs are massive … Is there a way to filter with grep or what ever to return last 15 minutes?

If you are saying it is restarting, probably getting the lines before it is starting again will help. You can get those by grepping for Rancher version which will match the line Rancher version v2.5.x (x) is starting which indicates Rancher is starting and then add -B100 to grep to get the 100 lines before that matched line.

docker logs CONTAINERID 2>&1 | grep -B100 "Rancher version"

Get a ton of these messages:
W1207 21:15:05.811183 8 reflector.go:425] pkg/mod/github.com/rancher/client-go@v1.19.0-rancher.1/tools/cache/reflector.go:157: watch of *v3.PipelineSetting ended with: very short watch: pkg/mod/github.com/rancher/client-go@v1.19.0-rancher.1/tools/cache/reflector.go:157: Unexpected watch close - watch lasted less than a second and no items received

Then ends in this:
2020/12/07 21:15:05 [FATAL] k3s exited with: exit status 255
2020/12/07 21:15:14 [INFO] Rancher version v2.5.3 (028c6dd04) is starting

We need a little more log lines to have a better indication but this usually comes down to a resource problem, what kind of disk is being served to the VMs? How many clusters are you managing in Rancher with how many nodes and resources in them? Is this node only being used for Rancher or also as a node in a cluster?

Hard disk 8tb raid 6 hardware raid from dell r710 h700 raid controller.
200gig is allocated to this VM

I have just this one vm as combined rancher and also single worker node in cluster.
I plan to add another node but not until I get this working smoothly.

--
W1208 07:06:14.033907       7 warnings.go:67] v1 ComponentStatus is deprecated in v1.19+
2020-12-08 07:06:14.509127 W | etcdserver: read-only range request "key:\"/registry/configmaps/kube-system/k3s\" " with result "range_response_count:1 size:507" took too long (111.546733ms) to execute
2020-12-08 07:06:14.509658 W | etcdserver: read-only range request "key:\"/registry/configmaps/kube-system/rancher-controller-lock\" " with result "range_response_count:1 size:581" took too long (154.03619ms) to execute
2020-12-08 07:06:14.510176 W | etcdserver: read-only range request "key:\"/registry/management.cattle.io/composeconfigs\" range_end:\"/registry/management.cattle.io/composeconfigt\" count_only:true " with result "range_response_count:0 size:7" took too long (142.826868ms) to execute
2020-12-08 07:06:14.510644 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-scheduler\" " with result "range_response_count:1 size:576" took too long (119.781303ms) to execute
2020-12-08 07:06:14.510913 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/cloud-controller-manager\" " with result "range_response_count:1 size:596" took too long (113.303314ms) to execute
2020-12-08 07:06:18.479731 W | wal: sync duration of 2.891757985s, expected less than 1s
2020-12-08 07:06:22.595949 W | etcdserver: read-only range request "key:\"/registry/fleet.cattle.io/bundles\" range_end:\"/registry/fleet.cattle.io/bundlet\" count_only:true " with result "range_response_count:0 size:9" took too long (7.951914736s) to execute
2020-12-08 07:06:22.596242 W | etcdserver: read-only range request "key:\"/registry/limitranges\" range_end:\"/registry/limitranget\" count_only:true " with result "range_response_count:0 size:7" took too long (7.047055602s) to execute
2020-12-08 07:06:22.596447 W | etcdserver: request "header:<ID:7587850969354493321 > txn:<compare:<target:MOD key:\"/registry/leases/kube-node-lease/local-node\" mod_revision:2121918 > success:<request_put:<key:\"/registry/leases/kube-node-lease/local-node\" value_size:538 >> failure:<request_range:<key:\"/registry/leases/kube-node-lease/local-node\" > >>" with result "size:20" took too long (7.00833105s) to execute
2020-12-08 07:06:22.596928 W | etcdserver: read-only range request "key:\"/registry/jobs/\" range_end:\"/registry/jobs0\" limit:500 " with result "range_response_count:0 size:7" took too long (7.003767322s) to execute
I1208 07:06:22.597503      37 trace.go:116] Trace[327910794]: "List etcd3" key:/jobs,resourceVersion:,limit:500,continue: (started: 2020-12-08 07:06:15.592776247 +0000 UTC m=+458.201130673) (total time: 7.00468331s):
Trace[327910794]: [7.00468331s] [7.00468331s] END
I1208 07:06:22.597652      37 trace.go:116] Trace[94876952]: "List" url:/apis/batch/v1/jobs,user-agent:k3s/v1.18.8+k3s1 (linux/amd64) kubernetes/6b59531/system:serviceaccount:kube-system:cronjob-controller,client:127.0.0.1 (started: 2020-12-08 07:06:15.59274146 +0000 UTC m=+458.201095928) (total time: 7.004878762s):
Trace[94876952]: [7.004786934s] [7.004759318s] Listing from storage done
I1208 07:06:22.599348      37 trace.go:116] Trace[806047525]: "GuaranteedUpdate etcd3" type:*coordination.Lease (started: 2020-12-08 07:06:15.586109017 +0000 UTC m=+458.194463456) (total time: 7.013200418s):
Trace[806047525]: [7.013168441s] [7.012022303s] Transaction committed
I1208 07:06:22.599550      37 trace.go:116] Trace[840925927]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/local-node,user-agent:k3s/v1.18.8+k3s1 (linux/amd64) kubernetes/6b59531,client:127.0.0.1 (started: 2020-12-08 07:06:15.585876729 +0000 UTC m=+458.194231157) (total time: 7.013635558s):
Trace[840925927]: [7.013559916s] [7.013390408s] Object stored in database
2020-12-08 07:06:25.877176 W | wal: sync duration of 3.258803815s, expected less than 1s
E1208 07:06:26.250053      37 leaderelection.go:320] error retrieving resource lock kube-system/kube-controller-manager: Get https://127.0.0.1:6444/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
I1208 07:06:26.250288      37 leaderelection.go:277] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
I1208 07:06:26.250424      37 event.go:278] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"", Name:"", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' 33a6e2ae8a16_d27772e9-9121-46e9-b20f-ee8a40151208 stopped leading
I1208 07:06:26.250461      37 event.go:278] Event(v1.ObjectReference{Kind:"Lease", Namespace:"kube-system", Name:"kube-controller-manager", UID:"8953acbb-b98c-45d7-9f8b-f51e943f8404", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"2121971", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' 33a6e2ae8a16_d27772e9-9121-46e9-b20f-ee8a40151208 stopped leading
E1208 07:06:26.250325      37 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
I1208 07:06:26.250860      37 trace.go:116] Trace[1338418010]: "Get" url:/api/v1/namespaces/kube-system/endpoints/kube-controller-manager,user-agent:k3s/v1.18.8+k3s1 (linux/amd64) kubernetes/6b59531/leader-election,client:127.0.0.1 (started: 2020-12-08 07:06:16.251639378 +0000 UTC m=+458.859993809) (total time: 9.999172527s):
Trace[1338418010]: [9.999172527s] [9.999130524s] END
I1208 07:06:26.251222      37 garbagecollector.go:151] Shutting down garbage collector controller
I1208 07:06:26.251309      37 resource_quota_controller.go:291] Shutting down resource quota controller
I1208 07:06:26.251743      37 graph_builder.go:310] stopped 140 of 140 monitors
I1208 07:06:26.251770      37 graph_builder.go:311] GraphBuilder stopping
I1208 07:06:26.251880      37 certificate_controller.go:131] Shutting down certificate controller "csrapproving"
I1208 07:06:26.252436      37 attach_detach_controller.go:387] Shutting down attach detach controller
I1208 07:06:26.252735      37 pvc_protection_controller.go:113] Shutting down PVC protection controller
I1208 07:06:26.262577      37 replica_set.go:193] Shutting down replicaset controller
2020-12-08 07:06:28.367082 W | etcdserver: read-only range request "key:\"/registry/project.cattle.io/apps\" range_end:\"/registry/project.cattle.io/appt\" count_only:true " with result "range_response_count:0 size:7" took too long (2.696494829s) to execute
2020-12-08 07:06:28.367165 W | etcdserver: read-only range request "key:\"/registry/management.cattle.io/preferences\" range_end:\"/registry/management.cattle.io/preferencet\" count_only:true " with result "range_response_count:0 size:9" took too long (2.857625837s) to execute
2020-12-08 07:06:28.367195 W | etcdserver: read-only range request "key:\"/registry/cronjobs/\" range_end:\"/registry/cronjobs0\" limit:500 " with result "range_response_count:0 size:7" took too long (5.767541975s) to execute
2020-12-08 07:06:28.367353 W | etcdserver: read-only range request "key:\"/registry/catalog.cattle.io/clusterrepos\" range_end:\"/registry/catalog.cattle.io/clusterrepot\" count_only:true " with result "range_response_count:0 size:9" took too long (2.885289638s) to execute
2020-12-08 07:06:28.367395 W | etcdserver: read-only range request "key:\"/registry/persistentvolumes\" range_end:\"/registry/persistentvolumet\" count_only:true " with result "range_response_count:0 size:7" took too long (11.477060609s) to execute
2020-12-08 07:06:28.367467 W | etcdserver: read-only range request "key:\"/registry/configmaps/kube-system/cattle-controllers\" " with result "range_response_count:1 size:541" took too long (12.079885128s) to execute
2020-12-08 07:06:28.367490 W | etcdserver: read-only range request "key:\"/registry/namespaces/default\" " with result "range_response_count:1 size:979" took too long (3.807397366s) to execute
2020-12-08 07:06:28.367598 W | etcdserver: read-only range request "key:\"/registry/clusterrolebindings\" range_end:\"/registry/clusterrolebindingt\" count_only:true " with result "range_response_count:0 size:9" took too long (11.649303468s) to execute
2020-12-08 07:06:28.367645 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-controller-manager\" " with result "range_response_count:1 size:594" took too long (12.115478704s) to execute
2020-12-08 07:06:28.367751 W | etcdserver: read-only range request "key:\"/registry/csidrivers\" range_end:\"/registry/csidrivert\" count_only:true " with result "range_response_count:0 size:7" took too long (5.863154653s) to execute
2020-12-08 07:06:28.367785 W | etcdserver: read-only range request "key:\"/registry/configmaps/fleet-system/fleet-agent-lock\" " with result "range_response_count:1 size:557" took too long (12.266885414s) to execute
2020-12-08 07:06:28.367842 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/cloud-controller-manager\" " with result "range_response_count:1 size:596" took too long (11.827473064s) to execute
2020-12-08 07:06:28.367878 W | etcdserver: read-only range request "key:\"/registry/management.cattle.io/templatecontents\" range_end:\"/registry/management.cattle.io/templatecontentt\" count_only:true " with result "range_response_count:0 size:7" took too long (3.832659513s) to execute
2020-12-08 07:06:28.367922 W | etcdserver: read-only range request "key:\"/registry/configmaps/fleet-system/gitjob\" " with result "range_response_count:1 size:528" took too long (12.270517088s) to execute
2020-12-08 07:06:28.367977 W | etcdserver: read-only range request "key:\"/registry/management.cattle.io/podsecuritypolicytemplateprojectbindings\" range_end:\"/registry/management.cattle.io/podsecuritypolicytemplateprojectbindingt\" count_only:true " with result "range_response_count:0 size:7" took too long (10.654256631s) to execute
2020-12-08 07:06:28.368002 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-scheduler\" " with result "range_response_count:1 size:576" took too long (11.829122612s) to execute
2020-12-08 07:06:28.368022 W | etcdserver: read-only range request "key:\"/registry/project.cattle.io/pipelines\" range_end:\"/registry/project.cattle.io/pipelinet\" count_only:true " with result "range_response_count:0 size:9" took too long (5.880214948s) to execute
2020-12-08 07:06:28.368105 W | etcdserver: read-only range request "key:\"/registry/configmaps/fleet-system/fleet-controller-lock\" " with result "range_response_count:1 size:578" took too long (12.722065011s) to execute
2020-12-08 07:06:28.368179 W | etcdserver: read-only range request "key:\"/registry/configmaps/kube-system/rancher-controller-lock\" " with result "range_response_count:1 size:581" took too long (11.842638674s) to execute
2020-12-08 07:06:28.368200 W | etcdserver: read-only range request "key:\"/registry/roles\" range_end:\"/registry/rolet\" count_only:true " with result "range_response_count:0 size:9" took too long (4.188591838s) to execute
2020-12-08 07:06:28.368232 W | etcdserver: read-only range request "key:\"/registry/minions\" range_end:\"/registry/miniont\" count_only:true " with result "range_response_count:0 size:9" took too long (12.757146396s) to execute
2020-12-08 07:06:28.368305 W | etcdserver: read-only range request "key:\"/registry/management.cattle.io/projectalertgroups\" range_end:\"/registry/management.cattle.io/projectalertgroupt\" count_only:true " with result "range_response_count:0 size:9" took too long (9.453502808s) to execute
2020-12-08 07:06:28.368382 W | etcdserver: read-only range request "key:\"/registry/controllerrevisions\" range_end:\"/registry/controllerrevisiont\" count_only:true " with result "range_response_count:0 size:7" took too long (4.701230007s) to execute
2020-12-08 07:06:28.368532 W | etcdserver: read-only range request "key:\"/registry/namespaces/kube-system\" " with result "range_response_count:1 size:987" took too long (5.28989984s) to execute
2020-12-08 07:06:28.368705 W | etcdserver: read-only range request "key:\"/registry/leases\" range_end:\"/registry/leaset\" count_only:true " with result "range_response_count:0 size:9" took too long (8.407112971s) to execute
2020-12-08 07:06:28.368779 W | etcdserver: read-only range request "key:\"/registry/management.cattle.io/clustertemplates\" range_end:\"/registry/management.cattle.io/clustertemplatet\" count_only:true " with result "range_response_count:0 size:7" took too long (6.310269546s) to execute
2020-12-08 07:06:28.368858 W | etcdserver: read-only range request "key:\"/registry/project.cattle.io/sourcecodecredentials\" range_end:\"/registry/project.cattle.io/sourcecodecredentialt\" count_only:true " with result "range_response_count:0 size:7" took too long (6.320361866s) to execute
2020-12-08 07:06:28.368882 W | etcdserver: read-only range request "key:\"/registry/priorityclasses\" range_end:\"/registry/priorityclasset\" count_only:true " with result "range_response_count:0 size:9" took too long (9.2416341s) to execute
2020-12-08 07:06:28.368959 W | etcdserver: read-only range request "key:\"/registry/management.cattle.io/clusterroletemplatebindings\" range_end:\"/registry/management.cattle.io/clusterroletemplatebindingt\" count_only:true " with result "range_response_count:0 size:9" took too long (12.012646627s) to execute
2020-12-08 07:06:28.369039 W | etcdserver: read-only range request "key:\"/registry/management.cattle.io/clusteralerts\" range_end:\"/registry/management.cattle.io/clusteralertt\" count_only:true " with result "range_response_count:0 size:7" took too long (6.706223716s) to execute
2020-12-08 07:06:28.369062 W | etcdserver: read-only range request "key:\"/registry/management.cattle.io/fleetworkspaces\" range_end:\"/registry/management.cattle.io/fleetworkspacet\" count_only:true " with result "range_response_count:0 size:9" took too long (9.674449386s) to execute
2020-12-08 07:06:28.369179 W | etcdserver: read-only range request "key:\"/registry/configmaps/kube-system/k3s\" " with result "range_response_count:1 size:507" took too long (11.842776377s) to execute
2020-12-08 07:06:28.369261 W | etcdserver: read-only range request "key:\"/registry/management.cattle.io/globalroles\" range_end:\"/registry/management.cattle.io/globalrolet\" count_only:true " with result "range_response_count:0 size:9" took too long (7.956702269s) to execute
2020-12-08 07:06:28.369356 W | etcdserver: read-only range request "key:\"/registry/management.cattle.io/clustercatalogs\" range_end:\"/registry/management.cattle.io/clustercatalogt\" count_only:true " with result "range_response_count:0 size:7" took too long (4.468222622s) to execute
2020-12-08 07:06:28.369411 W | etcdserver: read-only range request "key:\"/registry/management.cattle.io/monitormetrics\" range_end:\"/registry/management.cattle.io/monitormetrict\" count_only:true " with result "range_response_count:0 size:7" took too long (8.068412678s) to execute
2020-12-08 07:06:28.369522 W | etcdserver: read-only range request "key:\"/registry/project.cattle.io/sourcecodeproviderconfigs\" range_end:\"/registry/project.cattle.io/sourcecodeproviderconfigt\" count_only:true " with result "range_response_count:0 size:9" took too long (7.398845748s) to execute
I1208 07:06:28.393453      37 serviceaccounts_controller.go:129] Shutting down service account controller
I1208 07:06:28.396291      37 node_lifecycle_controller.go:593] Shutting down node controller
I1208 07:06:28.396666      37 range_allocator.go:184] Shutting down range CIDR allocator
I1208 07:06:28.396684      37 endpoints_controller.go:199] Shutting down endpoint controller
I1208 07:06:28.396743      37 pv_controller_base.go:311] Shutting down persistent volume controller
I1208 07:06:28.396792      37 pv_controller_base.go:505] claim worker queue shutting down
I1208 07:06:28.396923      37 pv_controller_base.go:448] volume worker queue shutting down
I1208 07:06:28.419928      37 replica_set.go:193] Shutting down replicationcontroller controller
I1208 07:06:28.422938      37 resource_quota_controller.go:260] resource quota controller worker shutting down
I1208 07:06:28.422993      37 resource_quota_controller.go:260] resource quota controller worker shutting down
I1208 07:06:28.423027      37 resource_quota_controller.go:260] resource quota controller worker shutting down
I1208 07:06:28.423060      37 resource_quota_controller.go:260] resource quota controller worker shutting down
E1208 07:06:28.425898      37 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
I1208 07:06:28.426347      37 trace.go:116] Trace[1595896535]: "Get" url:/api/v1/namespaces/kube-system,user-agent:rancher/v0.0.0 (linux/amd64) kubernetes/$Format,client:127.0.0.1 (started: 2020-12-08 07:06:23.078096627 +0000 UTC m=+465.686451053) (total time: 5.348196797s):
Trace[1595896535]: [5.348196797s] [5.348166554s] END
I1208 07:06:28.427332      37 trace.go:116] Trace[610236364]: "Get" url:/api/v1/namespaces/default,user-agent:k3s/v1.18.8+k3s1 (linux/amd64) kubernetes/6b59531,client:127.0.0.1 (started: 2020-12-08 07:06:24.554582689 +0000 UTC m=+467.162937153) (total time: 3.872709327s):
Trace[610236364]: [3.872634946s] [3.872613256s] About to write a response
I1208 07:06:28.428335      37 trace.go:116] Trace[442693639]: "Get" url:/api/v1/namespaces/kube-system/configmaps/k3s,user-agent:k3s/v1.18.8+k3s1 (linux/amd64) kubernetes/6b59531,client:127.0.0.1 (started: 2020-12-08 07:06:16.523825029 +0000 UTC m=+459.132179453) (total time: 11.904471343s):
Trace[442693639]: [11.904378235s] [11.904363299s] About to write a response
I1208 07:06:28.428880      37 trace.go:116] Trace[1066877537]: "Get" url:/api/v1/namespaces/kube-system/endpoints/cloud-controller-manager,user-agent:k3s/v1.18.8+k3s1 (linux/amd64) kubernetes/6b59531/leader-election,client:127.0.0.1 (started: 2020-12-08 07:06:16.540048438 +0000 UTC m=+459.148402861) (total time: 11.888793079s):
Trace[1066877537]: [11.888711836s] [11.888702843s] About to write a response
F1208 07:06:28.429291      37 controllermanager.go:279] leaderelection lost
2020-12-08 07:06:31.688627 W | wal: sync duration of 1.730836559s, expected less than 1s
W1208 07:06:42.892852       7 warnings.go:67] v1 ComponentStatus is deprecated in v1.19+
time="2020-12-08T07:06:45.686311162Z" level=info msg="error in remotedialer server [400]: read tcp 172.17.0.2:6443->172.17.0.2:45852: i/o timeout"
E1208 07:06:46.546206       7 leaderelection.go:321] error retrieving resource lock kube-system/cattle-controllers: Get "https://127.0.0.1:6443/api/v1/namespaces/kube-system/configmaps/cattle-controllers?timeout=15m0s": context deadline exceeded
I1208 07:06:46.546361       7 leaderelection.go:278] failed to renew lease kube-system/cattle-controllers: timed out waiting for the condition
E1208 07:06:46.546543       7 leaderelection.go:297] Failed to release lock: resource name may not be empty
2020/12/08 07:06:46 [FATAL] leaderelection lost for cattle-controllers
2020/12/08 07:06:54 [INFO] Rancher version v2.5.3 (028c6dd04) is starting

It seems like disk IO is an issue, can you read through https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/hardware.md#disks and run the test to check disk IO on the system?

I’m rebuilding the disk array at the moment so disk speed is a bit all over the place. Had a drive fail.

So you suspect disk speed is the issue?

fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest

 fio --rw=write --ioengine=sync --fdatasync=1 --directory=hdd-test --size=22m --bs=2300 --name=test2

test2: (g=0): rw=write, bs=® 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.16
Starting 1 process
test2: Laying out IO file (1 file / 22MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=1567KiB/s][w=697 IOPS][eta 00m:00s]
test2: (groupid=0, jobs=1): err= 0: pid=582337: Tue Dec 8 17:03:11 2020
write: IOPS=1597, BW=3588KiB/s (3674kB/s)(21.0MiB/6279msec); 0 zone resets
clat (usec): min=10, max=499, avg=16.81, stdev= 7.83
lat (usec): min=10, max=500, avg=17.31, stdev= 7.85
clat percentiles (usec):
| 1.00th=[ 12], 5.00th=[ 12], 10.00th=[ 12], 20.00th=[ 12],
| 30.00th=[ 13], 40.00th=[ 17], 50.00th=[ 18], 60.00th=[ 18],
| 70.00th=[ 18], 80.00th=[ 19], 90.00th=[ 22], 95.00th=[ 25],
| 99.00th=[ 40], 99.50th=[ 48], 99.90th=[ 81], 99.95th=[ 96],
| 99.99th=[ 172]
bw ( KiB/s): min= 1455, max= 5260, per=100.00%, avg=3624.75, stdev=1316.26, samples=12
iops : min= 648, max= 2342, avg=1613.92, stdev=585.97, samples=12
lat (usec) : 20=87.48%, 50=12.08%, 100=0.39%, 250=0.04%, 500=0.01%
fsync/fdatasync/sync_file_range:
sync (usec): min=113, max=435565, avg=603.37, stdev=5088.32
sync percentiles (usec):
| 1.00th=[ 126], 5.00th=[ 135], 10.00th=[ 141], 20.00th=[ 153],
| 30.00th=[ 167], 40.00th=[ 215], 50.00th=[ 474], 60.00th=[ 506],
| 70.00th=[ 537], 80.00th=[ 570], 90.00th=[ 644], 95.00th=[ 766],
| 99.00th=[ 4817], 99.50th=[ 9896], 99.90th=[ 18744], 99.95th=[ 32900],
| 99.99th=[168821]
cpu : usr=2.12%, sys=12.28%, ctx=16791, majf=0, minf=13
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=3588KiB/s (3674kB/s), 3588KiB/s-3588KiB/s (3674kB/s-3674kB/s), io=21.0MiB (23.1MB), run=6279-6279msec

Disk stats (read/write):
dm-0: ios=1157/31942, merge=0/0, ticks=34228/8728, in_queue=42956, util=92.45%, aggrios=1163/21479, aggrmerge=3/11571, aggrticks=33943/6916, aggrin_queue=35516, aggrutil=94.49%
sda: ios=1163/21479, merge=3/11571, ticks=33943/6916, in_queue=35516, util=94.49%jade@shared3:~$ fio --rw=write --ioengine=sync --fdatasync=1 --directory=hdd-test --size=22m --bs=2300 --name=test2
test2: (g=0): rw=write, bs=® 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.16
Starting 1 process
Jobs: 1 (f=1): [W(1)][-.-%][w=8076KiB/s][w=3595 IOPS][eta 00m:00s]
test2: (groupid=0, jobs=1): err= 0: pid=582775: Tue Dec 8 17:03:31 2020
write: IOPS=3463, BW=7778KiB/s (7965kB/s)(21.0MiB/2896msec); 0 zone resets
clat (usec): min=8, max=20609, avg=105.59, stdev=314.21
lat (usec): min=9, max=20610, avg=106.09, stdev=314.22
clat percentiles (usec):
| 1.00th=[ 10], 5.00th=[ 10], 10.00th=[ 10], 20.00th=[ 11],
| 30.00th=[ 11], 40.00th=[ 14], 50.00th=[ 141], 60.00th=[ 149],
| 70.00th=[ 157], 80.00th=[ 165], 90.00th=[ 182], 95.00th=[ 202],
| 99.00th=[ 277], 99.50th=[ 355], 99.90th=[ 1450], 99.95th=[ 2999],
| 99.99th=[19792]
bw ( KiB/s): min= 6810, max= 8575, per=100.00%, avg=7866.00, stdev=882.71, samples=5
iops : min= 3032, max= 3818, avg=3502.20, stdev=393.06, samples=5
lat (usec) : 10=11.22%, 20=31.93%, 50=0.48%, 100=0.19%, 250=54.71%
lat (usec) : 500=1.14%, 750=0.12%, 1000=0.08%
lat (msec) : 2=0.05%, 4=0.06%, 10=0.01%, 20=0.01%, 50=0.01%
fsync/fdatasync/sync_file_range:
sync (usec): min=109, max=57452, avg=176.02, stdev=770.69
sync percentiles (usec):
| 1.00th=[ 117], 5.00th=[ 125], 10.00th=[ 129], 20.00th=[ 135],
| 30.00th=[ 141], 40.00th=[ 145], 50.00th=[ 149], 60.00th=[ 153],
| 70.00th=[ 159], 80.00th=[ 169], 90.00th=[ 188], 95.00th=[ 210],
| 99.00th=[ 343], 99.50th=[ 603], 99.90th=[ 2376], 99.95th=[ 4817],
| 99.99th=[35390]
cpu : usr=4.73%, sys=26.84%, ctx=21509, majf=0, minf=15
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=7778KiB/s (7965kB/s), 7778KiB/s-7778KiB/s (7965kB/s-7965kB/s), io=21.0MiB (23.1MB), run=2896-2896msec

Disk stats (read/write):
dm-0: ios=5674/10287, merge=0/0, ticks=1660/2280, in_queue=3940, util=95.38%, aggrios=5688/10233, aggrmerge=0/73, aggrticks=1697/2265, aggrin_queue=1764, aggrutil=96.71%
sda: ios=5688/10233, merge=0/73, ticks=1697/2265, in_queue=1764, util=96.71%
jade@shared3:~$ fio --rw=write --ioengine=sync --fdatasync=1 --directory=hdd-test --size=22m --bs=2300 --name=test2
test2: (g=0): rw=write, bs=® 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.16
Starting 1 process
Jobs: 1 (f=1)
test2: (groupid=0, jobs=1): err= 0: pid=582983: Tue Dec 8 17:03:42 2020
write: IOPS=3858, BW=8667KiB/s (8875kB/s)(21.0MiB/2599msec); 0 zone resets
clat (usec): min=9, max=12240, avg=97.21, stdev=233.34
lat (usec): min=10, max=12240, avg=97.78, stdev=233.36
clat percentiles (usec):
| 1.00th=[ 10], 5.00th=[ 11], 10.00th=[ 11], 20.00th=[ 11],
| 30.00th=[ 11], 40.00th=[ 14], 50.00th=[ 135], 60.00th=[ 141],
| 70.00th=[ 145], 80.00th=[ 153], 90.00th=[ 167], 95.00th=[ 184],
| 99.00th=[ 241], 99.50th=[ 289], 99.90th=[ 971], 99.95th=[ 2245],
| 99.99th=[11076]
bw ( KiB/s): min= 8256, max= 9123, per=100.00%, avg=8667.80, stdev=351.06, samples=5
iops : min= 3676, max= 4062, avg=3859.20, stdev=156.27, samples=5
lat (usec) : 10=3.75%, 20=39.59%, 50=0.36%, 100=0.09%, 250=55.33%
lat (usec) : 500=0.66%, 750=0.07%, 1000=0.06%
lat (msec) : 2=0.04%, 4=0.01%, 10=0.03%, 20=0.02%
fsync/fdatasync/sync_file_range:
sync (usec): min=109, max=45567, avg=154.51, stdev=560.49
sync percentiles (usec):
| 1.00th=[ 117], 5.00th=[ 121], 10.00th=[ 123], 20.00th=[ 128],
| 30.00th=[ 133], 40.00th=[ 135], 50.00th=[ 139], 60.00th=[ 143],
| 70.00th=[ 147], 80.00th=[ 155], 90.00th=[ 172], 95.00th=[ 190],
| 99.00th=[ 260], 99.50th=[ 334], 99.90th=[ 988], 99.95th=[ 2409],
| 99.99th=[32375]
cpu : usr=5.62%, sys=29.72%, ctx=21514, majf=0, minf=13
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=8667KiB/s (8875kB/s), 8667KiB/s-8667KiB/s (8875kB/s-8875kB/s), io=21.0MiB (23.1MB), run=2599-2599msec

Disk stats (read/write):
dm-0: ios=5266/9480, merge=0/0, ticks=656/1060, in_queue=1716, util=94.98%, aggrios=5646/10134, aggrmerge=0/51, aggrticks=795/1273, aggrin_queue=212, aggrutil=94.48%
sda: ios=5646/10134, merge=0/51, ticks=795/1273, in_queue=212, util=94.48%
jade@shared3:~$ fio --rw=write --ioengine=sync --fdatasync=1 --directory=hdd-test --size=22m --bs=2300 --name=test2
test2: (g=0): rw=write, bs=® 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.16
Starting 1 process
Jobs: 1 (f=1)
test2: (groupid=0, jobs=1): err= 0: pid=583157: Tue Dec 8 17:03:52 2020
write: IOPS=3650, BW=8200KiB/s (8397kB/s)(21.0MiB/2747msec); 0 zone resets
clat (usec): min=9, max=22409, avg=100.14, stdev=308.93
lat (usec): min=9, max=22410, avg=100.68, stdev=308.92
clat percentiles (usec):
| 1.00th=[ 10], 5.00th=[ 10], 10.00th=[ 11], 20.00th=[ 11],
| 30.00th=[ 11], 40.00th=[ 14], 50.00th=[ 139], 60.00th=[ 147],
| 70.00th=[ 153], 80.00th=[ 163], 90.00th=[ 178], 95.00th=[ 194],
| 99.00th=[ 251], 99.50th=[ 293], 99.90th=[ 553], 99.95th=[ 652],
| 99.99th=[19792]
bw ( KiB/s): min= 7537, max= 8737, per=100.00%, avg=8218.60, stdev=492.54, samples=5
iops : min= 3356, max= 3890, avg=3659.20, stdev=219.10, samples=5
lat (usec) : 10=8.65%, 20=34.58%, 50=0.52%, 100=0.06%, 250=55.16%
lat (usec) : 500=0.90%, 750=0.10%
lat (msec) : 4=0.01%, 20=0.01%, 50=0.01%
fsync/fdatasync/sync_file_range:
sync (usec): min=104, max=45464, avg=166.48, stdev=622.43
sync percentiles (usec):
| 1.00th=[ 116], 5.00th=[ 123], 10.00th=[ 127], 20.00th=[ 133],
| 30.00th=[ 137], 40.00th=[ 141], 50.00th=[ 145], 60.00th=[ 149],
| 70.00th=[ 155], 80.00th=[ 163], 90.00th=[ 178], 95.00th=[ 196],
| 99.00th=[ 260], 99.50th=[ 355], 99.90th=[ 2343], 99.95th=[10421],
| 99.99th=[32375]
cpu : usr=5.13%, sys=27.49%, ctx=21558, majf=0, minf=13
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=8200KiB/s (8397kB/s), 8200KiB/s-8200KiB/s (8397kB/s-8397kB/s), io=21.0MiB (23.1MB), run=2747-2747msec

Disk stats (read/write):
dm-0: ios=5442/9715, merge=0/0, ticks=1568/1304, in_queue=2872, util=94.59%, aggrios=5710/10121, aggrmerge=0/49, aggrticks=1749/1367, aggrin_queue=1072, aggrutil=94.34%
sda: ios=5710/10121, merge=0/49, ticks=1749/1367, in_queue=1072, util=94.34%
jade@shared3:~$ fio --rw=write --ioengine=sync --fdatasync=1 --directory=hdd-test --size=22m --bs=2300 --name=test2
test2: (g=0): rw=write, bs=® 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.16
Starting 1 process
Jobs: 1 (f=1)
test2: (groupid=0, jobs=1): err= 0: pid=583439: Tue Dec 8 17:04:01 2020
write: IOPS=3711, BW=8337KiB/s (8537kB/s)(21.0MiB/2702msec); 0 zone resets
clat (usec): min=9, max=10050, avg=100.22, stdev=167.76
lat (usec): min=9, max=10051, avg=100.73, stdev=167.75
clat percentiles (usec):
| 1.00th=[ 11], 5.00th=[ 11], 10.00th=[ 11], 20.00th=[ 11],
| 30.00th=[ 11], 40.00th=[ 14], 50.00th=[ 139], 60.00th=[ 145],
| 70.00th=[ 153], 80.00th=[ 163], 90.00th=[ 178], 95.00th=[ 198],
| 99.00th=[ 265], 99.50th=[ 359], 99.90th=[ 1020], 99.95th=[ 2089],
| 99.99th=[ 8356]
bw ( KiB/s): min= 7385, max= 8885, per=99.04%, avg=8256.20, stdev=616.15, samples=5
iops : min= 3288, max= 3956, avg=3676.00, stdev=274.27, samples=5
lat (usec) : 10=1.06%, 20=42.35%, 50=0.32%, 100=0.09%, 250=54.84%
lat (usec) : 500=1.07%, 750=0.13%, 1000=0.03%
lat (msec) : 2=0.06%, 4=0.03%, 10=0.02%, 20=0.01%
fsync/fdatasync/sync_file_range:
sync (usec): min=109, max=29635, avg=161.79, stdev=374.24
sync percentiles (usec):
| 1.00th=[ 117], 5.00th=[ 122], 10.00th=[ 126], 20.00th=[ 133],
| 30.00th=[ 137], 40.00th=[ 141], 50.00th=[ 145], 60.00th=[ 149],
| 70.00th=[ 155], 80.00th=[ 165], 90.00th=[ 182], 95.00th=[ 202],
| 99.00th=[ 306], 99.50th=[ 502], 99.90th=[ 1909], 99.95th=[ 4948],
| 99.99th=[13042]
cpu : usr=4.96%, sys=29.58%, ctx=21504, majf=0, minf=13
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=8337KiB/s (8537kB/s), 8337KiB/s-8337KiB/s (8537kB/s-8537kB/s), io=21.0MiB (23.1MB), run=2702-2702msec

Disk stats (read/write):
dm-0: ios=5534/10029, merge=0/0, ticks=2012/1152, in_queue=3164, util=96.42%, aggrios=5682/10212, aggrmerge=0/80, aggrticks=2027/1355, aggrin_queue=1300, aggrutil=96.05%
sda: ios=5682/10212, merge=0/80, ticks=2027/1355, in_queue=1300, util=96.05%
jade@shared3:~$ fio --rw=write --ioengine=sync --fdatasync=1 --directory=hdd-test --size=22m --bs=2300 --name=test2
test2: (g=0): rw=write, bs=® 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.16
Starting 1 process
Jobs: 1 (f=1)
test2: (groupid=0, jobs=1): err= 0: pid=583883: Tue Dec 8 17:04:24 2020
write: IOPS=3715, BW=8346KiB/s (8546kB/s)(21.0MiB/2699msec); 0 zone resets
clat (usec): min=6, max=41417, avg=102.79, stdev=522.78
lat (usec): min=7, max=41418, avg=103.29, stdev=522.78
clat percentiles (usec):
| 1.00th=[ 10], 5.00th=[ 10], 10.00th=[ 10], 20.00th=[ 11],
| 30.00th=[ 11], 40.00th=[ 14], 50.00th=[ 139], 60.00th=[ 145],
| 70.00th=[ 151], 80.00th=[ 159], 90.00th=[ 174], 95.00th=[ 190],
| 99.00th=[ 243], 99.50th=[ 297], 99.90th=[ 840], 99.95th=[ 1827],
| 99.99th=[30802]
bw ( KiB/s): min= 7546, max= 8966, per=99.94%, avg=8340.60, stdev=627.61, samples=5
iops : min= 3360, max= 3992, avg=3713.60, stdev=279.34, samples=5
lat (usec) : 10=12.86%, 20=30.65%, 50=0.22%, 100=0.11%, 250=55.21%
lat (usec) : 500=0.73%, 750=0.09%, 1000=0.04%
lat (msec) : 2=0.04%, 4=0.03%, 50=0.02%
fsync/fdatasync/sync_file_range:
sync (usec): min=109, max=28686, avg=159.17, stdev=416.65
sync percentiles (usec):
| 1.00th=[ 117], 5.00th=[ 123], 10.00th=[ 126], 20.00th=[ 131],
| 30.00th=[ 137], 40.00th=[ 139], 50.00th=[ 143], 60.00th=[ 149],
| 70.00th=[ 153], 80.00th=[ 161], 90.00th=[ 176], 95.00th=[ 194],
| 99.00th=[ 260], 99.50th=[ 318], 99.90th=[ 1418], 99.95th=[ 3589],
| 99.99th=[22152]
cpu : usr=5.86%, sys=27.76%, ctx=21555, majf=0, minf=13
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=8346KiB/s (8546kB/s), 8346KiB/s-8346KiB/s (8546kB/s-8546kB/s), io=21.0MiB (23.1MB), run=2699-2699msec

Disk stats (read/write):
dm-0: ios=5504/9916, merge=0/0, ticks=688/1344, in_queue=2032, util=95.80%, aggrios=5632/10123, aggrmerge=0/22, aggrticks=768/1413, aggrin_queue=248, aggrutil=95.28%
sda: ios=5632/10123, merge=0/22, ticks=768/1413, in_queue=248, util=95.28%
jade@shared3:~$

The question is probably if there is a lot of fluctuation in disk performance that might cause this. Based on the logging, what is the interval the container restarts? What does the system monitoring say what happens at that point?

Restarts about 11 times an hour.

I’ll wait for raid array to rebuild. Then I will see if restarts reduce or stop.

Otherwise I might be chasing my tail.

Should be done in next 24 hours.
I’ll update this thread with how it goes.

Thanks for your help so far.
Jade

Okay after raid array rebuilt. Problem disappeared.

Thanks for your help!