Trouble connecting to Route53

11/9/2015 6:54:06
AMtime=“2015-11-09T12:54:06Z” level=fatal msg=“Failed to list
hosted zones: Get https://route53.amazonaws.com/2013-04-01/hostedzone/?maxItems=9223372036854775807:
dial tcp: lookup route53.amazonaws.com on 169.254.169.250:53: no such
host”

11/9/2015 6:55:03
AMtime=“2015-11-09T12:55:03Z” level=fatal msg=“Failed to list
hosted zones: Get https://route53.amazonaws.com/2013-04-01/hostedzone/?maxItems=9223372036854775807:
dial tcp: lookup route53.amazonaws.com on 169.254.169.250:53: no such
host”

Disconnected

When I do an nslookup on route53.amazonaws.com I get:
Non-authoritative answer:
Name: route53.amazonaws.com
Address: 54.239.19.226

Please, advise.

@ebishop looks very similar to https://github.com/rancher/rancher/issues/2528. The call to route53 is done before the network is fully setup for the route53 instance. I’ll work on the fix, refer to https://github.com/rancher/rancher/issues/2621 for the progress.

I agree, it looks similar. I’ll wait for your fix and then try again.

it looks like there is a new version, but I still get the following error over and over…

11/20/2015 12:44:18 PMtime=“2015-11-20T18:44:18Z” level=fatal msg="Failed to list hosted zones: Get https://route53.amazonaws.com/2013-04-01/hostedzone/?maxItems=9223372036854775807: dial tcp: lookup route53.amazonaws.com on 169.254.169.250:53: no such host"11/20/2015 12:44:35 PMtime=“2015-11-20T18:44:35Z” level=fatal msg="Failed to list hosted zones: Get https://route53.amazonaws.com/2013-04-01/hostedzone/?maxItems=9223372036854775807: dial tcp: lookup route53.amazonaws.com on 169.254.169.250:53: no such host"11/20/2015 12:44:45 PMtime=“2015-11-20T18:44:45Z” level=fatal msg="Failed to list hosted zones: Get https://route53.amazonaws.com/2013-04-01/hostedzone/?maxItems=9223372036854775807: dial tcp: lookup route53.amazonaws.com on 169.254.169.250:53: no such host"11/20/2015 12:44:56 PMtime=“2015-11-20T18:44:56Z” level=fatal msg=“Failed to list hosted zones: Get https://route53.amazonaws.com/2013-04-01/hostedzone/?maxItems=9223372036854775807: dial tcp: lookup route53.amazonaws.com on 169.254.169.250:53: no such host”

@ebishop just to clarify - you’ve re-deployed route53 service using v0.1.7 route53 image from catalog? If so, how soon the container has died? Could you attach the container logs to this post? Thanks!

route53.amazonaws.com on 169.254.169.250:53

When I do: nslookup on route53.amazonaws.com I don’t get 169.254.169.250. So I suppose the error message I’m seeing is correct. Or is it?

In any case, the route53 DNS feature is not working for me.

I just upgraded to the latest rancher versions today.

ok, making some progress. I am able to connect to AWS and list the target zone. But I had to set some proxy environment variables inside my container.

I can’t set these in the UI, so I’m thinking I’ll copy the docker-compose and rancher-compose files out of the UI preview, and add my proxy environment variables.

Do you see any problem with this?

not great news, but different at least. Now I get this error.

route53_1 | 2015-11-30T21:29:58.349638174Z time=“2015-11-30T21:29:58Z” level=fatal msg=“Error reading stack info: invalid character ‘<’ looking for beginning of value”

Can you give me any clues from this information?

@ebishop Glad you’ve workaround your other issue.

But I had to set some proxy environment variables inside my container.

Could you share what proxy related env vars you’ve set inside your container?

@vincent what could be the cause of following error returned by metadata? Missing “self” section perhaps?

invalid character ‘<’ looking for beginning of value"

I set http_proxy and https_proxy nothing special, and there is no ‘<’ character in the proxy URL.

what is the “self” section? I tried to create a route53 instance, then copied the docker-compose.yml and rancher-compose.yml from the UI after it was created. Then added the env vars to the list already in the docker-compose.yml

the stack is created and I tried starting it from the UI and by using rancher-compose.

@alena I’m guessing it’s getting back a html document of some sort (starting with “<”) and failing trying to parse that as json, or the metadata contains characters that aren’t escaped correctly and the resulting json is invalid.

The error I got originally was that I couldn’t find the host in this url:
https://route53.amazonaws.com/2013-04-01/hostedzone/?maxItems=9223372036854775807

Now that I can get to route53.amazonaws.com, when I try this URL in my boto3 script, it returns an error:
botocore.exceptions.ClientError: An error occurred (InvalidInput) when calling the ListHostedZones operation: maxitems must be a positive integer.

is it possible that AWS is choking on maxItems=9223372036854775807 ? It does seems like an invalid number for the max number of pages of hosted zones.

@ebishop its fine that you’ve started your service by manually copying docker-compose and rancher-compose.yml; that alone shouldn’t have caused any issues

The error below:

The error I got originally was that I couldn’t find the host in this url.

doesn’t seem to be relate to error you are observing when run AWS api call using boto3:

is it possible that AWS is choking on maxItems=9223372036854775807 ? It does seems like an invalid number for the max number of pages of hosted zones.

AWS as well as go aws client we use (GitHub - mitchellh/goamz: Golang Amazon Library) does support int64, by boto3 client has a limitation

what is the “self” section?

Self section contains client information - service/container/host information. In metadata config file, each client ip address is assigned with “self” section, so metadata server knows to which client serve which “self”. It looks something like:

 "10.42.165.204": {      >>> client ip address                                                    
                                                                                
        "recurse": [                                                            
            "169.254.169.254", "10.240.0.1"                                     
        ],                                                                      
        "a": {                                                                  
                "rancher-metadata.": {"answer": ["169.254.169.250"]},           
                "rancher-metadata.rancher.internal.": {"answer": ["169.254.169.2
50"]},                                                                          
                "test688481_test820888_1.": {"answer": ["10.42.247.227"]},      
                "test688481_test820888_1.rancher.internal.": {"answer": ["10.42.
247.227"]},                                                            

I set http_proxy and https_proxy nothing special, and there is no ‘<’ character in the proxy URL.

These environment variables set could perhaps result in client ip address being re-set for the route53 container, so its no longer recognizable by metadata server. It results in empty response being served by metadata service to route53 container.

Its essential that route53 client ip address is of 10.42.x Rancher managed network for it to use metadata. And accessing metadata is required for Route53 programming.

We have to set http_proxy and https_proxy to resolve FQDNs outside of our lab. I don’t know why setting these would change Rancher internal routing, but if you’ll tell me how to check I’ll give it a shot. But the container never starts up so it may be a challenge.

Please, advise.

@ebishop sure. Lets do this test. Login to network-agent container either using UI or docker (docker exec -ti [uuid of agent-instance container] bash, and tail /var/log/rancher-metadata log file. Then start your route53 server, and check the “tail” output. You should see the route53-to-metadata request being logged, something like:

time="2015-12-01T20:14:44Z" level=info msg="OK: /self/stack" client=10.42.254.31 version=latest

Let me know what client ip you see, and what error message if any

the rancher-metadata.log file does not contain the word “self” or “client”

So I see nothing like what you suggested

@ebishop most likely happens because route53 container fails to send http request to rancher-metadata server, so it never reaches metadata, and never gets logged. Can be related to rancher managed network not being set properly on your setup (Possible problems with iptables-restore)

I know this is likely not related to the startup problem I’m having, but here is the exception we saw in the rancher server logs:
2015-12-02 13:01:34,935 ERROR [172da064-490c-4bc2-aa49-4fb0382e01df:32906] [instance:232] [instance.purge] [] [cutorService-19] [c.p.e.p.i.DefaultProcessInstanceImpl] Unknown exception io.cattle.platform.eventing.exception.EventExecutionException: 7eb0d25f-297b-4885-a148-55e652b31168 : 500 Server Error: Internal Server Error (“Cannot destroy container 499e63e3b0f3c3b9f413537786c108fad3629843d0cad063d07e15c927a22980: Failed to set container state to RemovalInProgress: Status is already RemovalInProgress”)
at io.cattle.platform.eventing.exception.EventExecutionException.fromEvent(EventExecutionException.java:53) ~[cattle-framework-eventing-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.agent.impl.RemoteAgentImpl.callSync(RemoteAgentImpl.java:72) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.agent.impl.RemoteAgentImpl.callSync(RemoteAgentImpl.java:158) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.process.common.handler.AgentBasedProcessHandler.callSync(AgentBasedProcessHandler.java:170) ~[cattle-iaas-logic-common-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.process.common.handler.AgentBasedProcessHandler.handleEvent(AgentBasedProcessHandler.java:156) ~[cattle-iaas-logic-common-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.process.instance.PostInstancePurge.handle(PostInstancePurge.java:38) ~[cattle-iaas-logic-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandler(DefaultProcessInstanceImpl.java:456) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:396) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:390) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.idempotent.Idempotent.execute(Idempotent.java:42) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandlers(DefaultProcessInstanceImpl.java:390) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runLogic(DefaultProcessInstanceImpl.java:520) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runWithProcessLock(DefaultProcessInstanceImpl.java:323) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$2.doWithLockNoResult(DefaultProcessInstanceImpl.java:259) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:7) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:3) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.acquireLockAndRun(DefaultProcessInstanceImpl.java:256) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]

@ebishop yea, that exception is a known issue, and it shouldn’t cause the problem you are having.

Ok, some progress…at least a different problem:

I decided to create a new host with Rancher, and then not even log in to it…just to be sure I didn’t mess something up.

I created a new route53 service, and also added a simple ubuntu container. The Ubunutu container came up just fine. So I added a logspout catalog stack and pointed it at my ubunut container. (just to see if it would come up). My Route53 container was still struggling, but I was surprised to see some slightly different info in the logs:

It was clearly able to connect to route53 and get a list of hosted zones, and found my target zone. (I changed it to X’s, but it was the correct ID). There are some Healthcheck failures also.

2/2/2015
12:27:34 PMtime=“2015-12-02T18:27:34Z” level=error
msg=“Healtcheck failed: unable to reach a provider”

12/2/2015
12:27:36 PMtime=“2015-12-02T18:27:35Z” level=error msg=“Failed
to update provider with new DNS records: Provider error reading dns entries:
Route53 API call has failed: Get https://route53.amazonaws.com/2013-04-01/hostedzone/XXXXXXXXXXXXXXX/rrset:
dial tcp: lookup route53.amazonaws.com on 169.254.169.250:53: no such
host”

12/2/2015
12:27:36 PMtime=“2015-12-02T18:27:36Z” level=error
msg=“Healtcheck failed: unable to reach a provider”