Cluster Logging Issues


#1

We have Rancher (2.1.7 HA cluster) setup to ship logs off to an Elasticsearch (6.6.1) cluster elsewhere on our network. We are currently logging at the cluster level rather than the project level. Overall it seems to be working fine, but we have seen an anomaly happen a few times that I cannot figure out. We have a particular pod that logs for a while and then some sort of event happens and it stops shipping out logs. The logs for the pod / container are still there as verified by looking at the log file in /var/log/containers/etc-etc or on the log console inside the Rancher GUI for the given pod.

Given that when Elasticsearch is used as the target from Rancher, there is a fluentd and log-aggregator daemonset that is deployed on the cluster to handle shipping the logs out. I reviewed the logs on the fluentd pod that sits on the worker node where suddenly the logs for a particular pod stop getting sent to our Elasticsearch cluster. Below is a snippet of the logs for this flutentd pod right around the time that logs stop getting shipped over. The particular message about “failed to flush the buffer” will then from this point forward continue to pop up in the log for this flutend pod. I’m assuming this may have something to do with the is sue of the logs failing to ship over.

Any thoughts or ideas?

2019-03-12 15:12:21 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 27, pod_cache_api_updates: 27, id_cache_miss: 27 
2019-03-12 15:12:42 +0000 [info]: #0 following tail of /var/log/containers/kafka-connect-debezium-5fc4bc76c9-25nlm_kafka_kafka-connect-debezium-manager-062b0233ce29d6c8802f3583830526712419c3dcc814a021b4b5665b96b8c9cb.log 
2019-03-12 15:12:42 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method. 
2019-03-12 15:12:42 +0000 [info]: #0 following tail of /var/log/containers/kafka-connect-debezium-5fc4bc76c9-25nlm_kafka_kafka-connect-debezium-f36a3e00416803a9caefec59d9b679460db2166435f22a6918374c9f89767fe2.log 
2019-03-12 15:12:42 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method. 
2019-03-12 15:12:51 +0000 [info]: #0 stats - namespace_cache_size: 4, pod_cache_size: 4, namespace_cache_api_updates: 29, pod_cache_api_updates: 29, id_cache_miss: 29 
2019-03-12 15:13:20 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2019-03-12 15:13:21 +0000 chunk="583e720cc180c91fc705e4ef762ab577" error_class=Elasticsearch::Transport::Transport::Errors::RequestEntityTooLarge error="[413] <html>\r\n<head><title>413 Request Entity Too Large</title></head>\r\n<body bgcolor=\"white\">\r\n<center><h1>413 Request Entity Too Large</h1></center>\r\n<hr><center>nginx/1.13.12</center>\r\n</body>\r\n</html>\r\n" 
  2019-03-12 15:13:20 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/elasticsearch-transport-6.1.0/lib/elasticsearch/transport/transport/base.rb:205:in `__raise_transport_error' 
  2019-03-12 15:13:20 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/elasticsearch-transport-6.1.0/lib/elasticsearch/transport/transport/base.rb:323:in `perform_request' 
  2019-03-12 15:13:20 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/elasticsearch-transport-6.1.0/lib/elasticsearch/transport/transport/http/faraday.rb:20:in `perform_request' 
  2019-03-12 15:13:20 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/elasticsearch-transport-6.1.0/lib/elasticsearch/transport/client.rb:131:in `perform_request' 
  2019-03-12 15:13:20 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/elasticsearch-api-6.1.0/lib/elasticsearch/api/actions/bulk.rb:95:in `bulk' 
  2019-03-12 15:13:20 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-2.11.10/lib/fluent/plugin/out_elasticsearch.rb:621:in `send_bulk' 
  2019-03-12 15:13:20 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-2.11.10/lib/fluent/plugin/out_elasticsearch.rb:523:in `block in write' 
  2019-03-12 15:13:20 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-2.11.10/lib/fluent/plugin/out_elasticsearch.rb:522:in `each' 
  2019-03-12 15:13:20 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-2.11.10/lib/fluent/plugin/out_elasticsearch.rb:522:in `write' 
  2019-03-12 15:13:20 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-1.1.3/lib/fluent/plugin/output.rb:1096:in `try_flush' 
  2019-03-12 15:13:20 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-1.1.3/lib/fluent/plugin/output.rb:1329:in `flush_thread_run' 
  2019-03-12 15:13:20 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-1.1.3/lib/fluent/plugin/output.rb:439:in `block (2 levels) in start' 
  2019-03-12 15:13:20 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-1.1.3/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create' 
2019-03-12 15:13:21 +0000 [info]: #0 stats - namespace_cache_size: 4, pod_cache_size: 4, namespace_cache_api_updates: 29, pod_cache_api_updates: 29, id_cache_miss: 29 
2019-03-12 15:13:22 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2019-03-12 15:13:21 +0000 chunk="583e720cc180c91fc705e4ef762ab577" error_class=Elasticsearch::Transport::Transport::Errors::RequestEntityTooLarge error="[413] <html>\r\n<head><title>413 Request Entity Too Large</title></head>\r\n<body bgcolor=\"white\">\r\n<center><h1>413 Request Entity Too Large</h1></center>\r\n<hr><center>nginx/1.13.12</center>\r\n</body>\r\n</html>\r\n" 
  2019-03-12 15:13:22 +0000 [warn]: #0 suppressed same stacktrace 
2019-03-12 15:13:24 +0000 [warn]: #0 failed to flush the buffer. retry_time=1 next_retry_seconds=2019-03-12 15:13:23 +0000 chunk="583e720cc180c91fc705e4ef762ab577" error_class=Elasticsearch::Transport::Transport::Errors::RequestEntityTooLarge error="[413] <html>\r\n<head><title>413 Request Entity Too Large</title></head>\r\n<body bgcolor=\"white\">\r\n<center><h1>413 Request Entity Too Large</h1></center>\r\n<hr><center>nginx/1.13.12</center>\r\n</body>\r\n</html>\r\n" 
  2019-03-12 15:13:24 +0000 [warn]: #0 suppressed same stacktrace 
2019-03-12 15:13:26 +0000 [warn]: #0 failed to flush the buffer. retry_time=2 next_retry_seconds=2019-03-12 15:13:26 +0000 chunk="583e720cc180c91fc705e4ef762ab577" error_class=Elasticsearch::Transport::Transport::Errors::RequestEntityTooLarge error="[413] <html>\r\n<head><title>413 Request Entity Too Large</title></head>\r\n<body bgcolor=\"white\">\r\n<center><h1>413 Request Entity Too Large</h1></center>\r\n<hr><center>nginx/1.13.12</center>\r\n</body>\r\n</html>\r\n" 
  2019-03-12 15:13:26 +0000 [warn]: #0 suppressed same stacktrace 
2019-03-12 15:13:30 +0000 [warn]: #0 failed to flush the buffer. retry_time=3 next_retry_seconds=2019-03-12 15:13:30 +0000 chunk="583e720cc180c91fc705e4ef762ab577" error_class=Elasticsearch::Transport::Transport::Errors::RequestEntityTooLarge error="[413] <html>\r\n<head><title>413 Request Entity Too Large</title></head>\r\n<body bgcolor=\"white\">\r\n<center><h1>413 Request Entity Too Large</h1></center>\r\n<hr><center>nginx/1.13.12</center>\r\n</body>\r\n</html>\r\n" 
  2019-03-12 15:13:30 +0000 [warn]: #0 suppressed same stacktrace 
2019-03-12 15:13:39 +0000 [warn]: #0 failed to flush the buffer. retry_time=4 next_retry_seconds=2019-03-12 15:13:38 +0000 chunk="583e720cc180c91fc705e4ef762ab577" error_class=Elasticsearch::Transport::Transport::Errors::RequestEntityTooLarge error="[413] <html>\r\n<head><title>413 Request Entity Too Large</title></head>\r\n<body bgcolor=\"white\">\r\n<center><h1>413 Request Entity Too Large</h1></center>\r\n<hr><center>nginx/1.13.12</center>\r\n</body>\r\n</html>\r\n" 
  2019-03-12 15:13:39 +0000 [warn]: #0 suppressed same stacktrace 
2019-03-12 15:13:51 +0000 [info]: #0 stats - namespace_cache_size: 4, pod_cache_size: 4, namespace_cache_api_updates: 29, pod_cache_api_updates: 29, id_cache_miss: 29 
2019-03-12 15:13:55 +0000 [warn]: #0 failed to flush the buffer. retry_time=5 next_retry_seconds=2019-03-12 15:13:54 +0000 chunk="583e720cc180c91fc705e4ef762ab577" error_class=Elasticsearch::Transport::Transport::Errors::RequestEntityTooLarge error="[413] <html>\r\n<head><title>413 Request Entity Too Large</title></head>\r\n<body bgcolor=\"white\">\r\n<center><h1>413 Request Entity Too Large</h1></center>\r\n<hr><center>nginx/1.13.12</center>\r\n</body>\r\n</html>\r\n" 
  2019-03-12 15:13:55 +0000 [warn]: #0 suppressed same stacktrace 
2019-03-12 15:14:21 +0000 [info]: #0 stats - namespace_cache_size: 4, pod_cache_size: 4, namespace_cache_api_updates: 29, pod_cache_api_updates: 29, id_cache_miss: 29 
2019-03-12 15:14:26 +0000 [warn]: #0 failed to flush the buffer. retry_time=6 next_retry_seconds=2019-03-12 15:14:25 +0000 chunk="583e720cc180c91fc705e4ef762ab577" error_class=Elasticsearch::Transport::Transport::Errors::RequestEntityTooLarge error="[413] <html>\r\n<head><title>413 Request Entity Too Large</title></head>\r\n<body bgcolor=\"white\">\r\n<center><h1>413 Request Entity Too Large</h1></center>\r\n<hr><center>nginx/1.13.12</center>\r\n</body>\r\n</html>\r\n" 
  2019-03-12 15:14:26 +0000 [warn]: #0 suppressed same stacktrace 
2019-03-12 15:14:51 +0000 [info]: #0 stats - namespace_cache_size: 4, pod_cache_size: 4, namespace_cache_api_updates: 29, pod_cache_api_updates: 29, id_cache_miss: 29 
2019-03-12 15:15:21 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 29, pod_cache_api_updates: 29, id_cache_miss: 29 
2019-03-12 15:15:26 +0000 [warn]: #0 failed to flush the buffer. retry_time=7 next_retry_seconds=2019-03-12 15:15:26 +0000 chunk="583e720cc180c91fc705e4ef762ab577" error_class=Elasticsearch::Transport::Transport::Errors::RequestEntityTooLarge error="[413] <html>\r\n<head><title>413 Request Entity Too Large</title></head>\r\n<body bgcolor=\"white\">\r\n<center><h1>413 Request Entity Too Large</h1></center>\r\n<hr><center>nginx/1.13.12</center>\r\n</body>\r\n</html>\r\n" 
  2019-03-12 15:15:26 +0000 [warn]: #0 suppressed same stacktrace 
2019-03-12 15:15:51 +0000 [info]: #0 stats - namespace_cache_size: 4, pod_cache_size: 4, namespace_cache_api_updates: 31, pod_cache_api_updates: 31, id_cache_miss: 31 
2019-03-12 15:16:21 +0000 [info]: #0 stats - namespace_cache_size: 4, pod_cache_size: 4, namespace_cache_api_updates: 31, pod_cache_api_updates: 31, id_cache_miss: 31 
2019-03-12 15:16:51 +0000 [info]: #0 stats - namespace_cache_size: 4, pod_cache_size: 4, namespace_cache_api_updates: 31, pod_cache_api_updates: 31, id_cache_miss: 31 
2019-03-12 15:17:21 +0000 [info]: #0 stats - namespace_cache_size: 4, pod_cache_size: 4, namespace_cache_api_updates: 31, pod_cache_api_updates: 31, id_cache_miss: 31 
2019-03-12 15:17:22 +0000 [warn]: #0 failed to flush the buffer. retry_time=8 next_retry_seconds=2019-03-12 15:17:21 +0000 chunk="583e720cc180c91fc705e4ef762ab577" error_class=Elasticsearch::Transport::Transport::Errors::RequestEntityTooLarge error="[413] <html>\r\n<head><title>413 Request Entity Too Large</title></head>\r\n<body bgcolor=\"white\">\r\n<center><h1>413 Request Entity Too Large</h1></center>\r\n<hr><center>nginx/1.13.12</center>\r\n</body>\r\n</html>\r\n" 
  2019-03-12 15:17:22 +0000 [warn]: #0 suppressed same stacktrace 
2019-03-12 15:17:51 +0000 [info]: #0 stats - namespace_cache_size: 4, pod_cache_size: 4, namespace_cache_api_updates: 31, pod_cache_api_updates: 31, id_cache_miss: 31

#2

I just realized after a bit of digging that I likely need to make an adjustment to the ingress that’s in front of my Elasticsearch cluster. I am going to attempt that as my next step and see where it lands me.

I am going to add this annotation and adjust the size as I see fit: nginx.ingress.kubernetes.io/proxy-body-size