Rancher just wiped my Route 53 Zone?

Hi,

Been using Route 53 external DNS with great success for a couple of weeks now. However, today my whole environment has failed because the entries I created in my Route 53 hosted zone (as well as the ones Rancher automatically creates) have completely disappeared. The hosted zone now has only the default set of records as if it was freshly created.

If I check the container logs I see many of these type of errors:

1/12/2016 4:44:47 PMtime="2016-01-12T16:44:47Z" level=info msg="Adding dns record: {public.public..mydomain.com. [178.62.20.162 213.52.128.100] A 300}"
1/12/2016 4:44:47 PMtime="2016-01-12T16:44:47Z" level=error msg="Failed to add DNS record to provider {public.public..atomx.io. [178.62.20.162 213.52.128.100] A 300}: Request failed, got status code: 400. Response: <?xml version=\"1.0\"?>\n<ErrorResponse xmlns=\"https://route53.amazonaws.com/doc/2013-04-01/\"><Error><Type>Sender</Type><Code>InvalidChangeBatch</Code><Message>FATAL problem: DomainLabelEmpty encountered at public.public..mydomain.com</Message></Error><RequestId>ca9d2527-b94b-11e5-a084-f7f3c04e7250</RequestId></ErrorResponse>"

It looks the default, the environment name is now missing a the records it is trying to create are therefore invalid.

Restarting the container makes no difference. Nothing major has changed with my Rancher set up since I set up Route 53.

What’s even more worrying is that Rancher deleted the CNAMES that I created.

Surely Rancher shouldn’t be able to delete DNS records that it didn’t create? This obviously creates huge issues for any production environment.

Any ideas what has happened and how I might get it working again?

Component Version
Rancher v0.51.0
Cattle v0.130.0
User Interface v0.78.0
Rancher Compose v0.7.0

I had to delete and recreate the entire stack from the catalog to get it working again.

Surely Rancher should not meant to delete the www.mydomain.com CNAME records one manually creates to point to the Rancher generated www.website.default.mydomain.com A records under any circumstances.

Is this a bug or am I doing something wrong?

@djskinner

Surely Rancher should not meant to delete the www.mydomain.com CNAME records one manually creates to point to the Rancher generated www.website.default.mydomain.com

Thats a side effect of the bug when environment name being null/empty. Rancher manages envName.rootDomain A records on Route53, but it should skip programming Route53 if environment name is null by some reason. This bug is easy to fix, I’ll work on it.

The original bug you are describing, is pretty bad. Would need more information from you to debug it. Could you upload the gist with the following mysql statements results:

  • select * from service
  • select * from stack
  • select id, state, removed from account

Miight it be safer (or even preferred) to have Rancher manage only a sub-domain (i.e. stack.mydomain.com)?

Thanks for getting back to me @alena I’ll see what I can do about getting you the debug info. Will it be a problem that I already deleted and recreated the stack?

Thanks for getting back to me @alena I’ll see what I can do about getting you the debug info. Will it be a problem that I already deleted and recreated the stack?

shouldn’t be a problem, we use soft remove when remove service/stack records, so they should be present in the DB along with the removed_time timestamp which should be enough for me to debug.

Miight it be safer (or even preferred) to have Rancher manage only a sub-domain (i.e. stack.mydomain.com)?

it would, but the case when the entire stack gets removed, wouldn’t be covered and we won’t be able to cleanup the records of the stack on Route53 as rancher metadata doesn’t have records for removed resources.

The only case that is not covered is - when the environment gets removed. In this case, user has to cleanup Route53 records manually.

@djskinner could you also include “name” field to “select id, state, removed from account” query?

Apologies I haven’t had the chance to get this data to you but I’ve just discovered it has happened again!

After deleting the stack and starting from scratch, manually adding all the wiped CNAME I have just discovered the hosted zone is once again wiped back to only the basic record sets.

@djskinner if you haven’t re-created the stack yet, could you fetch the following information for me:

get /var/lib/cattle/etc/cattle/metadata/answers.json file from network-agent container of the host where your route53 containers is running

See here: https://gist.github.com/djskinner/7068f4a5fb300a5aaa92

@djskinner I’ve reproduced the bug, thank a lot for your help. I’ll put in a fix and release new external-dns/route53 template tonight.

Awesome! Thanks a lot, I look forward to the fix.

PR reference:

Github ticket reference: