Rancher agent stuck during run.sh

The issue is the host was not appearing on the GUI. When I use docker logs to see the agent container logs, it shows nothing. It is so wired that nothing is running, not even an exception. Also nothing under /var/lib/rancher/.

firewalld was disabled and stopped.
I did set the environment variable CATTLE_AGENT_IP.
I noticed the docker server container doesn’t use the public IP.

http://docs.rancher.com/rancher/faqs/agents/#where-can-i-find-detailed-logs-of-the-rancher-agent-container

Can you provide more info such as Docker version? The command you used to run your rancher/agent command?

I wasn’t sure I understood “I noticed the docker server container doesn’t use the public IP.”

I am running docker daemon under a service account.

[ts1@EP004 log]$ docker version
Client version: 1.7.0
Client API version: 1.19
Go version (client): go1.4.2
Git commit (client): 0baf609
OS/Arch (client): linux/amd64
Server version: 1.9.1
Server API version: 1.21
Go version (server): go1.4.2
Git commit (server): a34a1d5
OS/Arch (server): linux/amd64

[ts1@EP004 log]$ cd /var/log/rancher/
-bash: cd: /var/log/rancher/: No such file or directory

[ts1@EP004 log]$ docker inspect 43e7bd817066
[
{
“Id”: “43e7bd81706651478207feda7c36a3373614fae30643de79dfa516826f87af3c”,
“Created”: “2016-04-07T09:12:37.940298505Z”,
“Path”: “/run.sh”,
“Args”: [
http://ep004.dev.fwmrm.net:8080/v1/scripts/B6706404F3D8EAF7C0DA:1460019600000:nJpRiZ8Yhzn6ITb3rnO7S3dib78”,
“debug”
],
“State”: {
“Status”: “running”,
“Running”: true,
“Paused”: false,
“Restarting”: false,
“OOMKilled”: false,
“Dead”: false,
“Pid”: 12048,
“ExitCode”: 0,
“Error”: “”,
“StartedAt”: “2016-04-07T09:12:38.205893568Z”,
“FinishedAt”: “0001-01-01T00:00:00Z”
},
“Image”: “d15ea06f4165f4a07fb435600367f6955285add2c6c3e0229a69dd6656602df6”,
“ResolvConfPath”: “/export/docker/containers/43e7bd81706651478207feda7c36a3373614fae30643de79dfa516826f87af3c/resolv.conf”,
“HostnamePath”: “/export/docker/containers/43e7bd81706651478207feda7c36a3373614fae30643de79dfa516826f87af3c/hostname”,
“HostsPath”: “/export/docker/containers/43e7bd81706651478207feda7c36a3373614fae30643de79dfa516826f87af3c/hosts”,
“LogPath”: “/export/docker/containers/43e7bd81706651478207feda7c36a3373614fae30643de79dfa516826f87af3c/43e7bd81706651478207feda7c36a3373614fae30643de79dfa516826f87af3c-json.log”,
“Name”: “/awesome_knuth”,
“RestartCount”: 0,
“Driver”: “devicemapper”,
“ExecDriver”: “native-0.2”,
“MountLabel”: “”,
“ProcessLabel”: “”,
“AppArmorProfile”: “”,
“ExecIDs”: null,
“HostConfig”: {
“Binds”: [
"/var/run/docker.sock:/var/run/docker.sock",
"/var/lib/rancher:/var/lib/rancher"
],
“ContainerIDFile”: “”,
“LxcConf”: [],
“Memory”: 0,
“MemoryReservation”: 0,
“MemorySwap”: 0,
“KernelMemory”: 0,
“CpuShares”: 0,
“CpuPeriod”: 0,
“CpusetCpus”: “”,
“CpusetMems”: “”,
“CpuQuota”: 0,
“BlkioWeight”: 0,
“OomKillDisable”: false,
“MemorySwappiness”: null,
“Privileged”: true,
“PortBindings”: {},
“Links”: null,
“PublishAllPorts”: false,
“Dns”: null,
“DnsOptions”: null,
“DnsSearch”: null,
“ExtraHosts”: null,
“VolumesFrom”: null,
“Devices”: [],
“NetworkMode”: “bridge”,
“IpcMode”: “”,
“PidMode”: “”,
“UTSMode”: “”,
“CapAdd”: null,
“CapDrop”: null,
“GroupAdd”: null,
“RestartPolicy”: {
“Name”: “no”,
“MaximumRetryCount”: 0
},
“SecurityOpt”: null,
“ReadonlyRootfs”: false,
“Ulimits”: null,
“LogConfig”: {
“Type”: “json-file”,
“Config”: {}
},
“CgroupParent”: “”,
“ConsoleSize”: [
0,
0
],
“VolumeDriver”: “”
},
“GraphDriver”: {
“Name”: “devicemapper”,
“Data”: {
“DeviceId”: “66”,
“DeviceName”: “docker-253:16-105644035-43e7bd81706651478207feda7c36a3373614fae30643de79dfa516826f87af3c”,
“DeviceSize”: “107374182400”
}
},
“Volumes”: {
"/var/lib/rancher": “/var/lib/rancher”,
"/var/run/docker.sock": “/var/run/docker.sock”
},
“VolumesRW”: {
"/var/lib/rancher": true,
"/var/run/docker.sock": true
},
“Config”: {
“Hostname”: “43e7bd817066”,
“Domainname”: “”,
“User”: “”,
“AttachStdin”: false,
“AttachStdout”: false,
“AttachStderr”: false,
“Tty”: false,
“OpenStdin”: false,
“StdinOnce”: false,
“Env”: [
“CATTLE_AGENT_IP=192.168.0.183”,
“PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin”,
“HOST_DOCKER_SOCK=/var/run/docker.sock”,
“RANCHER_AGENT_IMAGE=rancher/agent:v0.11.0”
],
“Cmd”: [
http://ep004.dev.fwmrm.net:8080/v1/scripts/B6706404F3D8EAF7C0DA:1460019600000:nJpRiZ8Yhzn6ITb3rnO7S3dib78”,
“debug”
],
“Image”: “rancher/agent:v0.11.0”,
“Volumes”: null,
“WorkingDir”: “”,
“Entrypoint”: [
"/run.sh"
],
“OnBuild”: null,
“Labels”: {
“io.rancher.container.system”: “rancher-agent”
},
“MacAddress”: “”,
“NetworkDisabled”: false,
“ExposedPorts”: null,
“VolumeDriver”: “”,
“Memory”: 0,
“MemorySwap”: 0,
“CpuShares”: 0,
“Cpuset”: “”
},
“NetworkSettings”: {
“Bridge”: “”,
“SandboxID”: “17408ba2c83cf1299b3d097e3a8af27515934b842a0111686eb1f06f892f22a4”,
“HairpinMode”: false,
“LinkLocalIPv6Address”: “”,
“LinkLocalIPv6PrefixLen”: 0,
“Ports”: {},
“SandboxKey”: “/var/run/docker/netns/17408ba2c83c”,
“SecondaryIPAddresses”: null,
“SecondaryIPv6Addresses”: null,
“EndpointID”: “41c3fc32f642d1edee95a34fb660510e376d79bdecd84b7cc599d53dbe48f390”,
“Gateway”: “172.17.42.1”,
“GlobalIPv6Address”: “”,
“GlobalIPv6PrefixLen”: 0,
“IPAddress”: “172.17.0.1”,
“IPPrefixLen”: 16,
“IPv6Gateway”: “”,
“MacAddress”: “02:42:ac:11:00:01”
}
}

I use the command copied from the GUI to run the agent.

The /var/log/rancher folder doesn’t exist at all. I think it might mean the main process of the agent didn’t get started at all.

Have you read through the troubleshooting docs on agents and registration?

http://docs.rancher.com/rancher/faqs/agents/#how-to-check-your-host-registration-is-set-correctly

On the host that you are running the rancher/agent, have you tried curling your rancher-server? Based on the command of your rancher/agent, it would be:

curl -i http://ep004.dev.fwmrm.net:8080/v1/

Really need help here.

Yes I have followed the instructions on the troubleshooting docs. The really weird thing is that there was no log output from Rancher agent container, no clue what happened there. I looked into the server logs, nothing looks suspicious.

I destroyed the previous Rancher setup and started over again, still I met the same issue.

curl looks good:

curl http://192.168.14.75:8080/v1/ {"id":"9c2cc064-57af-42cf-84c5-9d747ebb084d","type":"error","links":{},"actions":{},"status":401,"code":"Unauthorized","message":"Unauthorized","detail":null}

Server logs:
time="2016-04-26T20:46:26Z" level=info msg="Exiting rancher-compose-executor" version=v0.7.4 time="2016-04-26T20:46:28Z" level=info msg="Starting rancher-compose-executor" version=v0.7.4 time="2016-04-26T20:46:28Z" level=info msg="Initializing event router" workerCount=10 time="2016-04-26T20:46:28Z" level=info msg="Connection established" time="2016-04-26T20:48:04Z" level=info msg="Exiting go-machine-service..." time="2016-04-26T20:48:06Z" level=info msg="Setting log level" logLevel=info time="2016-04-26T20:48:06Z" level=info msg="Starting go-machine-service..." gitcommit=dc97268 time="2016-04-26T20:48:06Z" level=info msg="Updating docker-machine-drivers from cattle." time="2016-04-26T20:48:06Z" level=info msg="Active driver packet currently installed at docker-machine-driver-packet." time="2016-04-26T20:48:06Z" level=info msg="Active driver ubiquity currently installed at docker-machine-driver-ubiquity." time="2016-04-26T20:48:07Z" level=error msg="Error from driver: Error dialing to plugin server's address(), err=dial tcp: missing address" time="2016-04-26T20:48:07Z" level=fatal msg="Encourted an error while updating drivers." time="2016-04-26T20:48:08Z" level=info msg="Setting log level" logLevel=info time="2016-04-26T20:48:08Z" level=info msg="Starting go-machine-service..." gitcommit=dc97268 time="2016-04-26T20:48:08Z" level=info msg="Updating docker-machine-drivers from cattle." time="2016-04-26T20:48:08Z" level=info msg="Active driver ubiquity currently installed at docker-machine-driver-ubiquity." time="2016-04-26T20:48:08Z" level=info msg="Active driver packet currently installed at docker-machine-driver-packet." time="2016-04-26T20:48:19Z" level=info msg="Started setting and driver watcher." time="2016-04-26T20:48:19Z" level=info msg="Initializing event router" workerCount=10 time="2016-04-26T20:48:19Z" level=info msg="Connection established"

But looks like there was no log in the agent container:
docker cp 184a7d712931:/var/log/rancher . Error response from daemon: lstat /export/docker/devicemapper/mnt/184a7d712931f45038956c723cc8300c4ef91cc17af52f070bad4d50f46e0809/rootfs/var/log/rancher: no such file or directory

BTW, I am running the docker daemon under a service account, not under root or sudo.

Latest update:

I cloned the Rancher project from github and added some debug information in ENTRPOINT run.sh. It turns out the agent container was stuck at docker inspect. This line:

RUNNING_IMAGE="$(docker inspect -f '{{.Config.Image}}' ${CONTAINER})"

It just stuck there like forever.

Then I modified the Docker file and built the agent container by myself for further debugging. I managed to manually run this line with strace. It turns out the process stuck at a system call while waiting for a response from socket write:

mmap(0xc208500000, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xc208500000 mmap(0xc207fa0000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xc207fa0000 ioctl(0, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {c_iflags=0x500, c_oflags=0x5, c_cflags=0xbf, c_lflags=0x8a3b, c_line=0, c_cc="\x03\x1c\x7f\x15\x04\x00\x01\x00\x11\x13\x1a\x00\x12\x0f\x17\x16\x00\x00\x00"}) = 0 ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {c_iflags=0x500, c_oflags=0x5, c_cflags=0xbf, c_lflags=0x8a3b, c_line=0, c_cc="\x03\x1c\x7f\x15\x04\x00\x01\x00\x11\x13\x1a\x00\x12\x0f\x17\x16\x00\x00\x00"}) = 0 stat("/root/.docker/config.json", 0xc2080467e0) = -1 ENOENT (No such file or directory) stat("/root/.dockercfg", 0xc208046870) = -1 ENOENT (No such file or directory) socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 6 setsockopt(6, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0 connect(6, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, 23) = 0 epoll_ctl(5, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3960682120, u64=140530995639944}}) = 0 getsockname(6, {sa_family=AF_LOCAL, NULL}, [2]) = 0 getpeername(6, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, [23]) = 0 read(6, 0xc2084e5000, 4096) = -1 EAGAIN (Resource temporarily unavailable) write(6, "GET /v1.21/version HTTP/1.1\r\nHos"..., 100) = 100 epoll_wait(5, {{EPOLLOUT, {u32=3960682120, u64=140530995639944}}}, 128, 0) = 1 epoll_wait(5,

We currently don’t support running the docker daemon under a service account. We need root or sudo access.

Can you try using that account type and let me know if you still face the same issues?