Swarmkit-mon failure

brdlfg · February 9, 2017, 9:26pm

Does the Swarm environment work? I’m unable to keep the swarmkit-mon containers in a stable state which causes the environment to always being an an setup state, no CLI. I’ve tested it with different versions of Docker and Rancher and all have failed with the same issue. Trying to figure out what I’m over looking here. I’ve rebuilt this setup multiple times with VMs and each time it ends with the same results. The initial setup will sometimes come up stable, but reboot any of the client VMs and this swarmkit-mon will start to fail on all machines. I understand the Swarm support is now listed as experimental. I didn’t experience this issue with the Swarm environment when I initially tested Rancher six months ago.

Nothing out of the normal with the setup of the VMs. Five VMs in VirtualBox on the same host. One MySQL DB, one Rancher server 1.3.4, and three clients

All three clients setup the same - Centos 7
Docker version
Client:
Version: 1.12.3
API version: 1.24
Go version: go1.6.3
Git commit: 6b644ec
Built:
OS/Arch: linux/amd64

Server:
Version: 1.12.3
API version: 1.24
Go version: go1.6.3
Git commit: 6b644ec
Built:
OS/Arch: linux/amd64

Firewalls (IPtables, Firewalld) disabled.
SELinux disabled

Rancher server 1.3.4 currently. I’ve tested it with every version from 1.3.0 all the way to 1.4.0, same results.

Logs from swarmkit-mon-1

2/9/2017 12:12:25 PMtime="2017-02-09T19:12:25Z" level=info msg="Listening on port: 2378"
2/9/2017 1:43:30 PMtime="2017-02-09T20:43:30Z" level=info msg="Listening on port: 2378"

Logs from swarmkit-mon-3

2/9/2017 1:40:05 PMDeleted host label swarm
2/9/2017 1:40:35 PMDeleted host label swarm
2/9/2017 1:41:06 PMDeleted host label swarm
2/9/2017 1:41:36 PMDeleted host label swarm
2/9/2017 1:42:06 PMDeleted host label swarm
2/9/2017 1:42:36 PMDeleted host label swarm
2/9/2017 1:43:06 PMDeleted host label swarm
2/9/2017 1:43:32 PMtime="2017-02-09T20:43:32Z" level=info msg="Listening on port: 2378"
2/9/2017 1:43:33 PMDeleted host label swarm

Logs from swarmkit-mon-2

/9/2017 2:09:05 PMNo active workers present for promotion, add more nodes to enable reconciliation.
2/9/2017 2:09:35 PMError response from daemon: rpc error: code = 9 desc = attempting to demote the last manager of the swarm
2/9/2017 2:09:35 PMError response from daemon: rpc error: code = 9 desc = node 4w8j9psi970rfqs7ag3uargxh is a cluster manager and is a member of the raft cluster. It must be demoted to worker before removal
2/9/2017 2:09:35 PMRemoved 4w8j9psi970rfqs7ag3uargxh from the swarm.
2/9/2017 2:09:35 PM1 of 1 manager(s) reachable, 0 worker(s) active
2/9/2017 2:09:35 PMNo active workers present for promotion, add more nodes to enable reconciliation.
2/9/2017 2:10:05 PMError response from daemon: rpc error: code = 9 desc = attempting to demote the last manager of the swarm
2/9/2017 2:10:05 PMError response from daemon: rpc error: code = 9 desc = node 4w8j9psi970rfqs7ag3uargxh is a cluster manager and is a member of the raft cluster. It must be demoted to worker before removal
2/9/2017 2:10:05 PMRemoved 4w8j9psi970rfqs7ag3uargxh from the swarm.
2/9/2017 2:10:05 PM1 of 1 manager(s) reachable, 0 worker(s) active
2/9/2017 2:10:05 PMNo active workers present for promotion, add more nodes to enable reconciliation.
2/9/2017 2:10:35 PMError response from daemon: rpc error: code = 9 desc = attempting to demote the last manager of the swarm
2/9/2017 2:10:35 PMError response from daemon: rpc error: code = 9 desc = node 4w8j9psi970rfqs7ag3uargxh is a cluster manager and is a member of the raft cluster. It must be demoted to worker before removal
2/9/2017 2:10:35 PMRemoved 4w8j9psi970rfqs7ag3uargxh from the swarm.
2/9/2017 2:10:35 PM1 of 1 manager(s) reachable, 0 worker(s) active
2/9/2017 2:10:35 PMNo active workers present for promotion, add more nodes to enable reconciliation.
2/9/2017 2:11:06 PMError response from daemon: rpc error: code = 9 desc = attempting to demote the last manager of the swarm
2/9/2017 2:11:06 PMError response from daemon: rpc error: code = 9 desc = node 4w8j9psi970rfqs7ag3uargxh is a cluster manager and is a member of the raft cluster. It must be demoted to worker before removal
2/9/2017 2:11:06 PMRemoved 4w8j9psi970rfqs7ag3uargxh from the swarm.
2/9/2017 2:11:06 PM1 of 1 manager(s) reachable, 0 worker(s) active
2/9/2017 2:11:06 PMNo active workers present for promotion, add more nodes to enable reconciliation.

This is found in the logs for the network-services-metadata containers on the failing hosts

2/9/2017 2:28:39 PMtime="2017-02-09T21:28:39Z" level=info msg="Error: /self/host/labels/swarm" client=172.17.0.1 version=2015-12-19
2/9/2017 2:29:09 PMtime="2017-02-09T21:29:09Z" level=info msg="Error: /self/host/labels/swarm" client=172.17.0.1 version=2015-12-19
2/9/2017 2:29:39 PMtime="2017-02-09T21:29:39Z" level=info msg="Error: /self/host/labels/swarm" client=172.17.0.1 version=2015-12-19
2/9/2017 2:30:09 PMtime="2017-02-09T21:30:09Z" level=info msg="Error: /self/host/labels/swarm" client=172.17.0.1 version=2015-12-19
2/9/2017 2:30:39 PMtime="2017-02-09T21:30:39Z" level=info msg="Error: /self/host/labels/swarm" client=172.17.0.1 version=2015-12-19

JoKoT3 · February 14, 2017, 7:49pm

Hi,

I’m facing the same issue, even with the last version 1.4.1.
I modified the swarm template to use swarmkit-mon v1.12.3-3 because v1.13 is said to only work with docker engine 1.13.
I’m using docker engine v 1.12.6

From what I understand, the cluster is correctly created with 3 manager nodes, but few seconds later swarmkit-mon is trying to remove all node from the cluster. It fail removing the node it’s running on because you can’t leave the cluster if you are the last manager.
I don’t understand why it behaves like this.

NB : I tried using a template with one manager instead of 3 without success.

JoKoT3 · February 14, 2017, 9:04pm

Hi Again,

I was fiddling with swarmkit-mon trying to understand why other nodes leave the cluster.
I still don’t know why… but I found a “method” to get swarm running :

add 1 node : start swarmkit
- it keep trying to remove itself from the cluster but can’t, don’t worry about that
add a second node : start swarmkit
- it is going to join the cluster and immediately leave it (that is the part I still don’t understand)
- just do : docker swarm leave on node 2
- a few seconds later (30 secs sleep) swarmkit-mon fires again and rejoins the cluster.
- this time the node stays in the cluster
add a third node : start swarmkit
- it joins the cluster and don’t crash

Points I don’t understand :

my template still specifies I want 1 manager, I have 3
why node 2 leave the cluster the first time, but not the second time ?
why does it work out of the box on node 3

My best guess is swarmkit-mon does not play well when all 3 containers fires at the same time (too fast for raft gossip ?)

another thing that annoys me is the following line in swarmkit-mon:
for hostname in $(echo $hosts | jq -r .[].hostname | cut -d. -f1); do
why does it needs to remove tld from fqdn ? all my hosts report fqdn everywhere (rancher, swarm…).
Still, I do not know if it is of any significance.

Disclaimer: ATM I don’t know if swarm really works, but all nodes see each others and containers are stable.

brdlfg · February 14, 2017, 10:12pm

JoKoT3,

What happens to the environment when you reboot one of the hosts? Does it come back up stable without any user intervention?

JoKoT3 · February 15, 2017, 9:24am

I just tried that and the node came back in the cluster without issue.
I still have problems with selinux but it does not have any impact on swarm (only ipsec + healthcheck)

JoKoT3 · February 15, 2017, 4:55pm

Well, It is not stable at all. I lost the cluster today. I was able to recreate one by leaving the swarm manually on each node.

cquinta · March 20, 2017, 2:08pm

We are facing the same problem here.

zbruhnke · March 22, 2017, 9:00pm

I’m getting the same issue - I used it with both 1.5.1 and 1.5.2 - Seems like its just not stable - I’ve had less problems just setting up swarms with the command line so I guess im going to go that route, I wanted to like rancher, but when I tried asking questions on slack I was told I should wait for their “Swarm expert” to get into the office and then later was told he “is not on slack” so im really not sure whats going on - I know its free software but if I cant get it to work for my use case and cannot get support something has to give, I’m not looking for just one more thing I have to maintain and troubleshoot on my own

zbruhnke · March 23, 2017, 4:41pm

FWIW - The UI doesnt do a great job of letting you know this but the default Rancher security group created in AWS only opens port 2376 - but swarmkit also requires that port 2377 and 2378 are opened. I was able to get everything up and running by opening those

Topic		Replies	Views
Swarmkit-mon: Expected state running but got stopped Rancher 1.x	0	1277	December 5, 2016
Swarm Environment Initialization failed Rancher 1.x	1	841	March 9, 2017
Rancher cannot build a docker swarm on centos	5	2315	June 3, 2017
Swarm Environment missing on Rancher 1.5 Rancher 1.x	2	832	March 7, 2017
Diagnosing Swarm Issues Rancher 1.x	1	1179	April 3, 2017

Swarmkit-mon failure

Related topics