Default logstash indexer service fails

Hello,

Been trying to deploy an ELK stack using latest rancher-catalog and rancher version (non os). I point the logstash service to my elasticsearch client , however the configuration sidekick for the indexer fails on loading data, presumably from rancher-metadata, which is working in other custom stacks/standalone containers

2/23/2016 3:02:09 PM2016-02-23T20:02:09Z logstash_logstash-indexer_logstash-indexer-config_1 /confd[1]: INFO Backend set to rancher
2/23/2016 3:02:09 PM2016-02-23T20:02:09Z logstash_logstash-indexer_logstash-indexer-config_1 /confd[1]: INFO Starting confd
2/23/2016 3:02:09 PM2016-02-23T20:02:09Z logstash_logstash-indexer_logstash-indexer-config_1 /confd[1]: INFO Backend nodes set to
2/23/2016 3:02:09 PM2016-02-23T20:02:09Z logstash_logstash-indexer_logstash-indexer-config_1 /confd[1]: INFO Using Rancher Metadata URL: http://rancher-metadata
2/23/2016 3:02:40 PM2016-02-23T20:02:40Z logstash_logstash-indexer_logstash-indexer-config_1 /confd[1]: FATAL Get http://rancher-metadata/: dial tcp: lookup rancher-metadata on 10.108.114.2:53: no such host
2/23/2016 3:02:44 PM2016-02-23T20:02:44Z logstash_logstash-indexer_logstash-indexer-config_1 /confd[1]: INFO Backend set to rancher
2/23/2016 3:02:44 PM2016-02-23T20:02:44Z logstash_logstash-indexer_logstash-indexer-config_1 /confd[1]: INFO Starting confd
2/23/2016 3:02:44 PM2016-02-23T20:02:44Z logstash_logstash-indexer_logstash-indexer-config_1 /confd[1]: INFO Backend nodes set to
2/23/2016 3:02:44 PM2016-02-23T20:02:44Z logstash_logstash-indexer_logstash-indexer-config_1 /confd[1]: INFO Using Rancher Metadata URL: [removed http: due to forum limit] rancher-metadata
2/23/2016 3:02:44 PM2016-02-23T20:02:44Z logstash_logstash-indexer_logstash-indexer-config_1 /confd[1]: ERROR template: logstash.conf.tmpl:2:6: executing “logstash.conf.tmpl” at <getv "/self/service/…>: error calling getv: key does not exist
2/23/2016 3:02:44 PM2016-02-23T20:02:44Z logstash_logstash-indexer_logstash-indexer-config_1 /confd[1]: INFO Target config /opt/logstash/patterns/extra out of sync
2/23/2016 3:02:44 PM2016-02-23T20:02:44Z logstash_logstash-indexer_logstash-indexer-config_1 /confd[1]: INFO Target config /opt/logstash/patterns/extra has been updated
2/23/2016 3:12:44 PM2016-02-23T20:12:44Z logstash_logstash-indexer_logstash-indexer-config_1 /confd[1]: ERROR template: logstash.conf.tmpl:2:6: executing “logstash.conf.tmpl” at <getv "/self/service/…>: error calling getv: key does not exist
2/23/2016 3:22:44 PM2016-02-23T20:22:44Z logstash_logstash-indexer_logstash-indexer-config_1 /confd[1]: ERROR template: logstash.conf.tmpl:2:6: executing “logstash.conf.tmpl” at <getv "/self/service/…>: error calling getv: key does not exist

This causes a missing logstash conf in the container and causes restart loop

Thanks in advance.

Update:

Removed the host (coreos-2) from rancher UI, and it forced it to be installed on coreos-1 which worked fine. Reinstalled completley bare host coreos-2, re-added to rancher. Now stack is failing to add on both coreos-1/coreos-2. Anything using confd in the catalog is broken.

Very frustrating, seems rancher-internal DNS not close for prod? Is my hosts setup wrong? I can launch a ubuntu container and rancher-metadata URL works fine, as it does on other containers I’ve built. The problem is with the rancher-confd.

Thanks

I have a slight hunch that DNS delays in rancher-internal dns may be causing some services to fail… Its one explanation why some catalog services deploy “erratically” (I have had issues, and deleting recreating works…)

I had an issue in the past where the container was loading before internal DNS was updated, which caused the container to fail (as it looked for the dependency and failed to resolve dns), and this would stay on and on…

A simple 5s sleep in the entrypoint script fixed it (albeit an ugly fix)

A somewhat less-ugly fix is something like while ! ping -c1 svc >/dev/null; do sleep 0.5; done, or nc to see that a port is actually open… this has the advantage of also seeing that the thing you want to talk to is actually there, which is the next issue once you can resolve its name.

1 Like

Yes, you are 100% correct @vincent… In my case a change on the nginx part to ignore failures worked just as well… The healthcheck on the service would weed out any problems with the link “actually being there”, but thats only my case…

What I meant about “ugly” was the fact I had to add an entrypoint shell script instead of calling the main process directly… but its no biggie…

Wouldnt a more “foolproof” way be possible by adding some sort of validation in the container start sequence? (I fixed my use cases, but it does fail to load some standard images as they dont have any delay nor do they have an entrypoint to easily do that…just a thought)

We can (and will, someday) ensure that resolv.conf and the DNS server are updated and responding correctly before the container starts. There was no good way to hook in to the container start process but a way was added to Docker a while ago. We can’t really ensure that linked services are actually started & available though (well, if they have healthchecks…)

Yeah, that would be the best… Healthchecking should be resonsability of each service imo (as it is today)… One thing which would be neat is to fail a healthcheck on a container if a dependency’s healthcheck failed…

Arguably this can (and perhaps should?) be implemented in the application, i.e. container A’s healthcheck includes checking for Service B… What could add intelligence/edge to rancher is if rancher can be aware of what services are dependencies, and mark failed a service when its dependencies fail a healthcheck…

An example that illustrates this well could be:

  • webserver service with a db service as a dependency.
    – this webserver serves dynamic content from a data container.
  • unless your webserver health check is looking specifically at a successful rendition of an answer from the DB (which arguably it should), it will stay “up” but the application will show errors…

Also, the above strategy of “dependent health checks” would make things simpler if we want to re-use the same image with a different data-volume (such as a static content server), which doesnt depend on an external service… (it wouldnt require a differnt healthcheck definition based on the role its taking…)

Complex stuff :stuck_out_tongue: but this sort of intelligence to aid application deployment and monitoring is what I’d really look for in a system like rancher (and thats why I’m always voting to use the docker ecossystem for everything possible and focus on the “hard” stuff lol)

Just got the same error with galera cluster.

It shows when i want to expand cluster or create new one.

27.02.2018 17:33:252018-02-27T16:33:25Z galeratest123-galera-1 /confd[1]: INFO Backend set to rancher
27.02.2018 17:33:252018-02-27T16:33:25Z galeratest123-galera-1 /confd[1]: INFO Starting confd
27.02.2018 17:33:252018-02-27T16:33:25Z galeratest123-galera-1 /confd[1]: INFO Backend nodes set to
27.02.2018 17:33:252018-02-27T16:33:25Z galeratest123-galera-1 /confd[1]: INFO Using Rancher Metadata URL: http://rancher-metadata
27.02.2018 17:33:252018-02-27T16:33:25Z galeratest123-galera-1 /confd[1]: ERROR template: galera.cnf.tmpl:2:14: executing “galera.cnf.tmpl” at <getv "/self/containe…>: error calling getv: key does not exist


The strange part is that it fails on getv, and it should not as there is an IF statement that should skip the getv if it doesn’t exists, or I’m missing something?

Rancher v1.6.14