Dreamhost DNS failures


#1

Since dreamhost isn’t notifying us when their name servers fail (come on guys), I figure I should. Here’s a ticket I filed just now:

As you are probably well aware, your dns servers have been failing often as of late. The last ticket I opened was replied to with “yeah, they’re broken and we have a guy fixing it, but bind takes 40 minutes to start up”. (note: build more DNS servers and distribute the domains if you have to)

This is affecting not only services hosted on your machines, but also affecting my friends ability to reach services hosted on my local network, as you host my DNS.

DNS failures are far more severe than failures of other servics in terms of impact to customers. You have a total of three name servers, and there is no way to tell which server will be used for any given lookup, because there are a variety of different resolver behaviors. This means that potentially one third of all access to all dreamhost services, and one third of all access to any services for which you are hosting the DNS will fail.

You do a great job notifying people when individual boxes fail, but I haven’t received a single notice about these name server failures. Please notify customers when DNS services fail.


#2

I haven’t noticed any dns failures, but it wouldn’t surprise me. Dreamhost has a pattern of only notifying us of critical outages when enough people on the forums complain, even if they know about it hours prior.

Just par for the course.

Also, it’s rather concerning that dreamhost would treat the outage of DNS in such a cavalier manner. Can anyone else confirm these DNS outages?


#3

I thought the point of having 3 DNS servers would be that they could act as a round-robin for each other? I guess my knowledge of networking is severely lacking (I haven’t experence any problems though, which machine are you hosted on?).


#4

Thanks for the info. I’m confused about the part that if one name server goes down, one third of all access to dh will fail.

It’s been about seven years, but I used to be a DNS admin. Our primary name server was in house, and the secondary was off site. The way DNS is setup, the second should automatically (and always did for us) kick in.


#5

As I mentioned, there are a variety of behaviors in different DNS resolvers out there. Though there are three DH name servers, it doesn’t necessarily mean that a given client will keep trying until it gets an answer. Here’s a common scenario.

We’ll assume that everybody’s DNS caches are empty.

A friend tries to ssh to my box at home using a fully qualified domain name, the address record for which is hosted by dreamhost. His local workstation asks its name server to resolve the address (via a recursive DNS request). That name server then performs an iterative DNS request, which would basically break the dns name down into it’s components, processing it from right to left. The name server asks a root name server (all DNS servers have a list of roots) for the location of the servers which serve the “.com” domain, gets the answer, populates the dns cache with that info; the name server asks the .com server for the location of the name servers that host the “dreness.com” domain, the .com server responds with ns1,2,3.dreamhost.com, and the name server caches these values. The name server then asks DH for the address record corresponding to core.dreness.com, selecting one of the three DH name servers to answer the question, returning the result to the client which originally asked to have the hostname resolved.

One key thing to keep in mind here is that the client workstation is completely uninvolved in any but the first and last transactions described above. The client is unaware of how many name servers DH has, and does not care, since it’s the name server’s job to hash that out. A standard (recursive) DNS request means “hey name server, resolve this dns name for me, I don’t care how many people you have to ask, just do it and tell me the answer when you’re done”.

The other key thing to keep in mind here is that the client may cache the response of the name server whether it is positive or negative. If positive, the client then knows that hostname / ip pair, and will not have to resolve it again until that cache entry expires. If the result is NEGATIVE, the client will not attempt to resolve it again until the negative cache entry expires.

I suppose I should also mention that in many operating systems, all of the dns cli / troubleshooting tools (nslookup, host, dig) will craft dns requests and send them directly to the DNS server. This is not the same way that an application (e.g. web browser, ssh client) resolves a host name. Applications typically ask the operating system to resolve domain names using standard system libraries (shared code which is part of the OS that all applications can leverage). These libraries are usually configured to use a local DNS cache, which prevents the client from having to resolve the same name over and over again during short periods of time.

Returning to the example, let us say that shortly after our initial request, another client asks the same name server to resolve the same domain name. The name server probably has a negative cache entry for this hostname, since it just tried to look it up and failed - which means it probably won’t try again until the cache entery expires. Let’s suppose for the sake of argument that the negative cache entries on the name server are configured with a very low time-to-live, so they will expire quickly (which is a good idea to limit the impact of this exact problem, but you can’t rely on DNS server admins of various ISPs to implement this strategy). This time the name server picks a different DH name server, one that is not timing out, and gets a valid answer. That name server will probably cache that value internally. If we then go back to the first client and do “host core.dreness.com” we should get a valid answer, since the name server already had the value cached, and since the ‘host’ program bypassed the internal DNS cache on the workstation. However, applications on the first client are still stuck because their internal dns cache contains a negative entry for that hostname, and the system libraries will check the cache before contacting the name server.

This is all stuff that should be well known by any competant DNS administrator, but is generally well beyond the knowledge of average end users. To them, it’s just broken… and they’re not wrong, because they can’t contact the service in question. Trying to distinguish between a DNS failure and a service failure is generally not part of the normal troublueshooting process for average users.

When my names stop resolving, I will typically manually query each of DH’s name servers directly (bypassing even my ISPs name servers) as follows:

dig @ns1.dreamhost.com core.dreness.com
dig @ns2.dreamhost.com core.dreness.com
dig @ns3.dreamhost.com core.dreness.com

I will generally find that at least one of the servers is not answering. This means that there are likely negative cache entries floating around out there, in the caches of ISP name servers and / or end user workstations. It’s impossible to know all the locations of these negative entries, or how long until they expire.

So, my previously stated figure of 1/3 is indeed a rough estimation. Once you have a negative cache entry in your DNS resolution path (which begins with the local dns cache on your workstation, and includes all DNS caches between you and the root servers), your applications will probably not be able to resolve the hostname until all of those negative cache entries expire - EVEN IF you can nslookup / host / dig and get a valid result. Again, remember that those tools can bypass ALL of the dns caches. Usually, negative cache entries have a short TTL, and so these sorts of problems typically don’t last a horrendously long time (on the other hand, that is a highly subjective time period… sometimes you need mail RIGHT NOW, etc).

In Mac OS X, you can clear the local DNS cache with sudo lookupd -flushcache, but I can’t speak for other platforms (though obviously a reboot would do it). Also watch out for consumer firewalls / routers; they also often perform DNS caching, though determing their policy on negative caching is left as an excercise to the reader :slight_smile:

As a final note, I would not be surprised at all if all the recent threads complaining about email were related to this DNS issue. Mail is especially sensitive to DNS, since there are additional DNS lookups involved with smtp transactions (smtp is used to send email from clients to servers, and also to send from servers to other servers). In addition to standard address record lookups, smtp servers perform mail exchange record lookups to determine the location of the smtp server that should receive mail for a given domain. More DNS requests = higher chance of failure when the DNS servers are flaky. I have been noticing the DNS problems with DH only over the last few weeks or so.

DNS is easily the most important service of all the services that DH provides, since all the other services are accessed by dns name. This is why I’m so concerned at the complete lack of customer notification when a DNS server burps. It can and probably does affect any and all DH services.


#6

Thanks for the well-crafted and very informative post! Good Stuff!
–rlparker


#7

I don’t have enough time to write a better or more complete response, but I’m throwing in my $0.02.

A recursive nameserver will continue checking until one responds (or until all the nameservers respond). If it doesn’t, it’s broken. The DNS was built to tolerate this sort of failure. A lot of what you’re saying is true, but the conclusions you’re drawing are somewhat incorrect. If they weren’t, DH would be getting massive amounts of complaints to support, and would be hemmoraging customers.

If the failure of a single node is causing a user’s recursive nameserver to return a negative response, something is severely broken with the nameserver they’re using. Yes, if the nameserver already has a negative response cached, it may cache it for up to the time specified in the “minimum TTL” in the zone’s SOA. But if one of the nameservers in your scenario failed, the recursive nameserver MUST ask the others. This may cause a delay, but not an outright failure.

Also, there is a different between an RCODE3 (“NXDOMAIN”) response, and being unable to reach all of the nameservers, so I don’t think a negative response will be cached, even if all nameservers are unreachable.

You’re correct that in certain cases (such as a new subdomain being added), a cached negative response will cause new lookups to fail for a period of time… but not in the situation we’re talking about here.

ipconfig /flushdns on Windows 2k and XP, if memory serves.

It’s possible to configure the DNS caching configuration and / or disable caching entirely on both Windows and Mac OS X.


#8

You’re right about the fact that I neglected to distinguish between NXDOMAIN and failure to reach the name server at all - I stand corrected.

However, the fact remains: I have seen cases where one of the DH servers was not responding but others were available AND ssh connections and host lookups from workstations were failing. If the recursive requests were always attempting to contact all available DH servers, this should never happen, but it does. I suppose this could be an issue with default timeouts on DNS requests on workstations, but at the end of the day, that is what matters: the workstation’s ability to resolve the name.

Next time it happens I’ll collect all the relevant data and post it just to show that my brain isn’t cooking inside my tinfoil hat :slight_smile:


#9

don’t worry you’re not the only one having problems. I’ve had a trouble ticket opened with them for 23 days now because of this. Basically the only name server that returns an answer is ns1, both ns2 and ns3 will not return an answer. It either gives a serv fail or times out.

This all seemed to happen in December when the moved the dns modifcations from the single DNS tab to the manage domains tab. For a couple years prior to this, I had the domain setup at zoneedit.com as a slave dns server (basically adding two more servers to the list of dns servers for the domain). I put in a support request for them to allow domain transfers to zoneedit’s dns servers and within a day or so it was done and everything worked fine for a couple years like that.

Then in late december I added a new subdomain and it wasn’t being active. When I was diagnosing the DNS servers that’s when I noticed that ns2 and ns3 weren’t answering for my domain. So I put in a support request and the first thing they say is the additional secondary servers at zoneedit.com is what’s causing the problem. This goes against everything I know about DNS. It is designed to allow multiple slave servers from all around the world to contact the master DNS server (ns1.dreamhost.com) for updates. You just need to tell ns1 to allow ns.wherever.com to do a zone transfer request.

Knowing that wasn’t the problem I still removed them from the whois info for the domain so I just had dreamhost’s name servers on there. Now it wasn’t only still not working, but now only 1 of 3 name servers were responding correctly, and even then ns1 occasionally times out. Fortunately I still had the zone file from when it was at zoneedit, even though it’s an old version, and added that to my work’s dns server and added it to whois. So now 2 of 4 dns servers will respond, and I’ve never seen a servfail message on our dns server, although we’re also only hosting about 50 domains or so, compared to the 1,000s and 1,000s that are on dreamhost’s servers.

So, for the past month I’ve had dns that isn’t stable. I’ve contacted them serveral times with ideas and suggestions to get this resolved. I’ve even offered to just keep the secondary domain hosting on our company’s name servers, taking that little bit of load off of their name servers (i’d still need ns1.dreamhost.com since that’s the SOA (the master dns server)) but then have the other three servers be on our servers, that takes a little strain off of ns2 and ns3. I really am surprised that they still only have the three name servers that every domain hosted with dreamhost is using. Even at zoneedit.com I believe the dns servers I had were ns18 and ns13 or something like that. Although it would be a nightmare to deal with now, unless you just go forward and give new signups ns4, ns5, and ns6, moving people from ns1 as needed (a really busy site) or when problems arise (as in my case). 40 mins to restart bind doesn’t surprise me, but you also shouldn’t have so many domains that it takes 40 mins to restart bind IMO. Personally I would have started to get worried once you hit about the 5 min mark. Even though you hardly ever restart a dns server since you can issue commands to a running server to do things like reloading zones and configs. But when it does happen, having 40 mins of “dead air” is bad.

It seems support has taken a sharp downfall lately. Before, getting responses within a couple hours was the norm. Now it can take a day or two before someone responds. And then you have my case where it was forwarded to an admin a couple weeks ago and I haven’t heard back since. sigh


Todd Eddy
vrillusions.com


#10

ns2 and ns3 are returning SERVFAIL for vrillusions.com, so that’s most likely a problem specific to that domain (and the non-standard setup). In this case, whoever allowed AXFR to the zoneedit.com nameservers didn’t also explicitly allow AXFR to the DH nameservers as well - the explicit allow-transfer statement trumps the defaults.

Ask support to pass your question along to Tavis - he should be able to get this fixed for you. Should be fairly obvious, but you can tell him to remove the also-notify and allow-transfer pameter for that zone.

Re: load - the load from one individual domain is almost always negligible (on DH’s system, I mean) - though I have seen some problems from misbehaved clients in the case of lame delegations pointed to a nameserver. re: number of nameservers, etc. - just because there’s a single IP doesn’t mean it has to be a single machine. Currently, it is in this case, but it’s entirely possible to use load balancers / anycast, VRRP, and other schemes to provide load balancing and / or failover for a single IP address.


#11

We are aware of the persistent dns lookup timeouts. We are in the process moving to multi-threaded bind which should resolve this issue.


#12

Thanks for the tip, I passed it along to support. I knew it had to be something with the way the secondaries were setup or how they received notifies. That’s what probably happened in this case and then once the expire TTL was hit the secondary servers just stopped giving a response for them.

I forgot to mention load balancers and such. We’re actually starting to look into stuff like that at work for a couple of our more mission critical servers so we don’t have to stay up till 5:00 AM before we can do work on them :slight_smile: Also like Tavis even mentioned, a multi-threaded bind would help immensly if these servers have mutliple processors, which I’m sure they do.

Thanks and hopefully this will get dns back up and working properly again.


Todd Eddy
vrillusions.com


#13

Have you looked into NSD? It claims to be WAY faster than BIND.

Here’s a comparison of a multi-threaded BIND 9 versus NSD:

Other references:

http://www.ripe.net/ripe/meetings/ripe-47/presentations/ripe47-dn-dnssec-nsd.pdf (note especially the massively different contour of the performance plots…)

http://www.generic-nic.net/sheets/practical/nameserver-en#benchmark


#14

Perhaps this is related to my issue.

My account on dream host was setup Saturday morning. My DNS info for the domain I am having them host had been waiting 2+ days already so it should have migrated through no problem. I could never log onto my domain (www.domain.com). I figured maybe it needed some more time. I then added another domain to their hosting and changed the DNS for that domain. That domain worked just fine for a few hours and then the main domain came online. Sunday I was unable to access either, and now Monday I can not access either. Anyone know of a problem? I know this can’t be a dns issue on my side as the registrars have had more then enough time to make the changes active. Any advice?


#15

I have two sites hosted by DH. Once the accounts were approved I checked to see if I could access. It took a little time but eventually I was able to access one but I’m still not able to access the other. I set one up in the a.m. and the other one in the p.m., both were approved approximately the same time (I was pleased) but I can’t for the life of me figure out why I’m not able to pull up both. Assuming that it was a DNS issue…now I’m searching other forums to make sure I didn’t miss anything :frowning:


#16

What should one do ? I have this problem with one of my domains (I have 2 domains hosted, one is OK, the other one has “disappeared”, and support keeps sending back messages “Outage resolved: No server-wide problem was found.” In the meantime, we are not getting any Emails, not possible to acess the Website).

Is there any way to “bypass” Dreamhost to reestablish the access to the Web site ? We need the Web site, it also contains wikis and MySQL databases vitals for the functioning of our company.

Enrico


#17

Dreamhost needs to add a section on their contact support page for DNS issues, not just the various server problems.


#18

I’m also having weird DNS issues. After setting up a subdomain off my main domain, it worked fine for a few hours. I went to bed and woke up to find the site was down - DNS errors. It’s now been over 24 hours and the site is still unavailable.

DNSreport.com says:

I was unable to get an answer from the parent servers [ns1.dreamhost.com]

Needless to say I’ve put in a support request but heard nothing back…


#19

Name server outages
(Downtime)

Posted: Feb 13th, 2006 - 07:15:01 AM PST (12 mins 9 secs ago)
We’ve been having intermitten issues with our ns1 name server which caused
some sites not to resolve properly. On top of that, an ill-timed power
outage at our secondary facility in Palo Alto took out ns2 for three hours
this morning. We’re working on fixing ns1 and our Palo Alto facility has
assured us that everything is good on their end now.

We apologize for the inconvenience this has caused. If you have any
questions or concerns, don’t hesitate to contact support@dreamhost.com

The Happy Dreamhost Nameserver Fixing Team.


Get [color=#CC0000]$50 Off[/color] with promo code DAYDREAM at DreamHost.com


#20

A question: why has your support beeing denying there was any problem, rather than beeing honest and straightforward like you are beeing ? Having a problem is OK, denying it when clients ask for help is very bad. Anyway, it’s good to read you are doing something about it. We’ll see if Dreamhost support apologizes to clients.