Problem with DNS TTL

Hi, everybody:

We are using Rancher V1.6.2 and use Kong as API Gateway. When doing a service upgrade in Rancher, the IP of container changed, but Kong could not resolve the new IP for several minutes. After run a dig command inside container, it returns an A-Record with 600s TTL.

I’ve make some search and found this PR . It seems Rancher-DNS has changed the default TTL to 1S in the past. So I was confused why I was getting a 600s-TTL-Record.

The related part of docs in Kong is here Basicly Kong just use outer-dns-server(which is rancher-dns) to handler resolving domain part in upstream_url.

just noticed the catalog-version of network-services is 0.2.1, and ttl can be user-defined in version 0.2.3.
I’ll try to trigger an upgrade first.

Replying to my self, after I’ve make an upgrade of stacks/infrastructure/network-services to 0.2.3 The DNS-Record was normal. With a 10s TTL default.

The problem is solved.

There’s a few related bugs here:

At some point the default TTL for authoritative records (*.rancher.internal) became 5 min instead of 1 second. That’s being made configurable in the catalog item and defaulted back to a second.

Then there’s a cache in the DNS server, which currently ignores the TTL of the record it’s given and saves it for a set amount of time. It also returns the exact copy of the record from cache every time, instead of decrementing the TTL to reflect the remaining time. So the net effect of that is the effective TTL is 20 worst-case today (10 min in cache, and then a 10 min TTL on it if your client does its own caching). This is a little more involved and will be fixed in a future release, but defaulting to 1 sec mitigates much of the effect for now.