As I mentioned, there are a variety of behaviors in different DNS resolvers out there. Though there are three DH name servers, it doesn’t necessarily mean that a given client will keep trying until it gets an answer. Here’s a common scenario.
We’ll assume that everybody’s DNS caches are empty.
A friend tries to ssh to my box at home using a fully qualified domain name, the address record for which is hosted by dreamhost. His local workstation asks its name server to resolve the address (via a recursive DNS request). That name server then performs an iterative DNS request, which would basically break the dns name down into it’s components, processing it from right to left. The name server asks a root name server (all DNS servers have a list of roots) for the location of the servers which serve the “.com” domain, gets the answer, populates the dns cache with that info; the name server asks the .com server for the location of the name servers that host the “dreness.com” domain, the .com server responds with ns1,2,3.dreamhost.com, and the name server caches these values. The name server then asks DH for the address record corresponding to core.dreness.com, selecting one of the three DH name servers to answer the question, returning the result to the client which originally asked to have the hostname resolved.
One key thing to keep in mind here is that the client workstation is completely uninvolved in any but the first and last transactions described above. The client is unaware of how many name servers DH has, and does not care, since it’s the name server’s job to hash that out. A standard (recursive) DNS request means “hey name server, resolve this dns name for me, I don’t care how many people you have to ask, just do it and tell me the answer when you’re done”.
The other key thing to keep in mind here is that the client may cache the response of the name server whether it is positive or negative. If positive, the client then knows that hostname / ip pair, and will not have to resolve it again until that cache entry expires. If the result is NEGATIVE, the client will not attempt to resolve it again until the negative cache entry expires.
I suppose I should also mention that in many operating systems, all of the dns cli / troubleshooting tools (nslookup, host, dig) will craft dns requests and send them directly to the DNS server. This is not the same way that an application (e.g. web browser, ssh client) resolves a host name. Applications typically ask the operating system to resolve domain names using standard system libraries (shared code which is part of the OS that all applications can leverage). These libraries are usually configured to use a local DNS cache, which prevents the client from having to resolve the same name over and over again during short periods of time.
Returning to the example, let us say that shortly after our initial request, another client asks the same name server to resolve the same domain name. The name server probably has a negative cache entry for this hostname, since it just tried to look it up and failed - which means it probably won’t try again until the cache entery expires. Let’s suppose for the sake of argument that the negative cache entries on the name server are configured with a very low time-to-live, so they will expire quickly (which is a good idea to limit the impact of this exact problem, but you can’t rely on DNS server admins of various ISPs to implement this strategy). This time the name server picks a different DH name server, one that is not timing out, and gets a valid answer. That name server will probably cache that value internally. If we then go back to the first client and do “host core.dreness.com” we should get a valid answer, since the name server already had the value cached, and since the ‘host’ program bypassed the internal DNS cache on the workstation. However, applications on the first client are still stuck because their internal dns cache contains a negative entry for that hostname, and the system libraries will check the cache before contacting the name server.
This is all stuff that should be well known by any competant DNS administrator, but is generally well beyond the knowledge of average end users. To them, it’s just broken… and they’re not wrong, because they can’t contact the service in question. Trying to distinguish between a DNS failure and a service failure is generally not part of the normal troublueshooting process for average users.
When my names stop resolving, I will typically manually query each of DH’s name servers directly (bypassing even my ISPs name servers) as follows:
dig @ns1.dreamhost.com core.dreness.com
dig @ns2.dreamhost.com core.dreness.com
dig @ns3.dreamhost.com core.dreness.com
I will generally find that at least one of the servers is not answering. This means that there are likely negative cache entries floating around out there, in the caches of ISP name servers and / or end user workstations. It’s impossible to know all the locations of these negative entries, or how long until they expire.
So, my previously stated figure of 1/3 is indeed a rough estimation. Once you have a negative cache entry in your DNS resolution path (which begins with the local dns cache on your workstation, and includes all DNS caches between you and the root servers), your applications will probably not be able to resolve the hostname until all of those negative cache entries expire - EVEN IF you can nslookup / host / dig and get a valid result. Again, remember that those tools can bypass ALL of the dns caches. Usually, negative cache entries have a short TTL, and so these sorts of problems typically don’t last a horrendously long time (on the other hand, that is a highly subjective time period… sometimes you need mail RIGHT NOW, etc).
In Mac OS X, you can clear the local DNS cache with sudo lookupd -flushcache, but I can’t speak for other platforms (though obviously a reboot would do it). Also watch out for consumer firewalls / routers; they also often perform DNS caching, though determing their policy on negative caching is left as an excercise to the reader
As a final note, I would not be surprised at all if all the recent threads complaining about email were related to this DNS issue. Mail is especially sensitive to DNS, since there are additional DNS lookups involved with smtp transactions (smtp is used to send email from clients to servers, and also to send from servers to other servers). In addition to standard address record lookups, smtp servers perform mail exchange record lookups to determine the location of the smtp server that should receive mail for a given domain. More DNS requests = higher chance of failure when the DNS servers are flaky. I have been noticing the DNS problems with DH only over the last few weeks or so.
DNS is easily the most important service of all the services that DH provides, since all the other services are accessed by dns name. This is why I’m so concerned at the complete lack of customer notification when a DNS server burps. It can and probably does affect any and all DH services.