ELK Stack Demo Issue

Trying to setup the ELK stack that was demoed this week by Rancher in https://github.com/rancher/compose-templates

I am running on GCE. I have the Rancher cluster setup, launched the ES templates and they all came up successfully (once I opened FW rules for 8080, 80 and 443 for the nodes). When I try and access KOPF the screen doesn’t show any ES instances and comes up with the following error:

9/18/2015 12:44:01 PM2015/09/18 16:44:01 [error] 10#10: *33 upstream timed out (110: Connection timed out) while connecting to upstream, client: <my laptop IP>, server: es.dev, request: "GET /es/ HTTP/1.1", upstream: "", host: "<node public IP>", referrer: "http://<node public IP>/"

I have a firewall rule setup to allow all TCP/UDP traffic to pass thru on the private network in GCE. Any pointers as to how I can troubleshoot further?

@coleca, if you drop into a shell inside the Kopf container, either the Rancher UI or docker exec, can you ping es-clients? Also, can you curl http://es-clients:9200 and get the Elasticsearch json response? The traffic is sent over the Rancher network, and setup via the links. Not sure if you tweaked the templates, but if so, the env var on Kopf service KOPF_ES_SERVERS needs to = the links name. (Default is es-clients)

The client is what makes the request to elasticsearch, but its going through the proxy on port 80.

I don’t know if this is related, but I recently had to open up some ICMP rules. It seems there may have been a ping check added, not sure.

I created a new Rancher cluster and re-composed the ES stack.

I can ping / curl to es-clients.

This is the output of the curl:

    root@es_kopf_1:/# curl http://es-clients:9200                                   
  "status" : 503,                                                               
  "name" : "es_elasticsearch-clients_1",                                        
  "cluster_name" : "logs",                                                      
  "version" : {                                                                 
    "number" : "1.7.1",                                                         
    "build_hash" : "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19",                  
    "build_timestamp" : "2015-07-29T09:54:16Z",                                 
    "build_snapshot" : false,                                                   
    "lucene_version" : "4.10.4"                                                 
  "tagline" : "You Know, for Search"                                            

I did add a ping rule per the last suggestion. Now Kopf does come up, but it says “No active master, switching to basic mode”. I am using the templates exactly as they are in Github, didn’t make any changes to them.

PS. One other note, when you run rancher-compose, is it supposed to complete ever, or is the default behavior to just tail -f the docker logs? I have never seen it finish but all the services are up and green, I always have to CTRL-C it.

Thanks again for your help!

I figured out the issue. It’s because Rancher isn’t able to use more than 1 host. I killed off the cluster a few times, tried different operating systems (Ubuntu and CoreOS) and it only works if I create a 1 node cluster. Looking at the logs on the Network Agent, I am getting the same errors as in this Github issue:

Rebooting the hosts makes no difference (actually makes things worse because some of the containers won’t come up after the restart). Note: The raccoon error still comes up with one node, but I guess it doesn’t matter since all the containers are forced to live on one node and that works fine.

Has there been any progress on fixing this issue to enable Rancher to use multiple hosts?

Let me know if there is any debugging info I can provide to help troubleshoot this issue.

Regarding rancher-compose, you need to pass the -d to be able to exit after logging is complete.

@coleca, Rancher can use multiple hosts, the demo/video we did is on 3 nodes. With the templates as they are today, you can not have more then one elasticsearch datanode per host, because they bind mount a volume on the server. If you were running into that you would see failures coming from the elasticsearch datanodes. You could check that out with rancher-compose -p <stack> logs elasticsearch-datanodes or in the UI you could see the logs from the fly out. In the UI you would also see the containers starting/stopping a lot.

Still that doesn’t explain the master node missing. Are there any error messages in the elasticsearch-masters logs?

1 Like

I think the issue was firewall rules. I found a reference in another ticket saying that Rancher also needs UDP rules opened on the public Ethernet adapter (ports 500 and 4500) or the containers can’t talk to each other. I had rules wide open on the private Ethernet adapters but I guess it only works on the public segment?

Is there a list of ports / protocols that is needed for Rancher and this demo?

At a minimum you need ports UDP ports 500 and 4500 to enable containers to talk to one another. The rules for this stack, are port 80 (optionally 443 if you have SSL enabled) for Kibana and Kopf. If you expose Logstash outside of Rancher then you need to also open up the ports required to take in logs. Out of the box Logstash is listening on port 5000/udp and should only be on Rancher network.

Your experience makes sense a bit more now. If the nodes couldn’t talk, you were likely able to reach an ES node on the same host as Kopf.

I am also having issues with this demo, and mine are also with Kopf. However the problems I am facing are with Kopf not ‘finding’ the configuration file.

The compose output shows:

elasticsearch-datanodes_1 | 2015-09-25T00:08:02.144122507Z at org.elasticsearch.transport.TransportService.connectToNodeLight(TransportService.java:220)
elasticsearch-datanodes_1 | 2015-09-25T00:08:02.144126285Z at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:373)
elasticsearch-datanodes_1 | 2015-09-25T00:08:02.144129819Z at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
elasticsearch-datanodes_1 | 2015-09-25T00:08:02.144133297Z at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
elasticsearch-datanodes_1 | 2015-09-25T00:08:02.144136929Z at java.lang.Thread.run(Thread.java:745)
elasticsearch-datanodes_1 | 2015-09-25T00:08:02.144140582Z [2015-09-25 00:08:02,143][WARN ][transport.netty ] [es_elasticsearch-datanodes_1] exception caught on transport layer [[id: 0xdc3b2a8b]], closing connection
elasticsearch-datanodes_1 | 2015-09-25T00:08:02.144144299Z java.nio.channels.UnresolvedAddressException
elasticsearch-datanodes_1 | 2015-09-25T00:08:02.144147696Z at sun.nio.ch.Net.checkAddress(Net.java:101)

and the logs from the Kopf container:

2015/09/24 23:19:10 [emerg] 1#1: host not found in upstream “es-clients:9200” in /etc/nginx/nginx.conf:25
nginx: [emerg] host not found in upstream “es-clients:9200” in /etc/nginx/nginx.conf:25
Error: [Errno 2] No such file or directory: ‘/etc/nginx/nginx.conf.tpl’

I have LOTS of ports open:

TCP 22, 80, 443, 5000, 5601, 8500, 8600, 9090, 9200
UDP 0 - 65535

Am I missing something?

The Kopf container has to be destroyed and recreated. We are using the upstream Kopf Docker file, with a couple mods that were submitted but haven’t been accepted upstream, to create the container. Its using a python template instead of confd like the others.

Easiest thing to do is rancher-compose -p es rm -f kopf then recreate… rancher-compose -p es up kopf

I followed your instructions, but still receive:

9/30/2015 12:04:50 AM2015/09/30 04:04:30 [emerg] 1#1: host not found in upstream “es-clients:9200” in /etc/nginx/nginx.conf:25
9/30/2015 12:04:50 AMnginx: [emerg] host not found in upstream “es-clients:9200” in /etc/nginx/nginx.conf:25
9/30/2015 12:04:52 AMError: [Errno 2] No such file or directory: ‘/etc/nginx/nginx.conf.tpl’

I am still not sure what I am doing wrong.

@tdensmore, It looks like the DNS entries are not getting created or are not resolving. Is the master node coming up in your configuration? If so, can you launch a container on another host through the Rancher UI and link it to the master to verify you can ping the container.

You should be able to Curl the master node and get a response. If you can’t, can you check the logs on the network agent for each of your hosts? What version of Rancher / Rancher-compose client are you using?

If you can prove the master is up and reachable from another host, then I’d try launching the datanodes.

Thanks for the help, but unfortunately I have run out of time on my research spike.
You can close this issue. I will pick up on Rancher again in a few weeks.