"Outside view/information on the latest outage" or "The Trouble with Tribblebytes"


#1

Hey guys, I’m dead tired and the topic on dreamhoststatus.com was locked before I could hit enter but I’d thought I would post this anyway, since it seemed like a lot of people were posting about a supposed lack of redundancy. I have never taken a CS course (by choice; I love coding as a hobby but could never code for a living). I’m sorry if the following (or this prelude) makes no sense and I hope I’m posting in the right area. Oh and again I do hope it’s somewhat accurate but feel free to call me out; this is the Internet after all :stuck_out_tongue: .

[quote=A post meant for DreamHostStatus but didn’t make it in time; hopefully still informative!]
I wasn’t on the wrong end of the fault this time, but I have been there before. DreamHost techs are honestly doing their best and get caught in the crossfire the same as us, whether we are running personal, business, non-profit, etc. sites/services (also they would agree that if you are paying for a service, regardless of the amount, that they will do their best to provide it). To the people talking about how there needs more redundancy though, that really doesn’t apply to this specific issue (in the past there were some RAID/HDD corruption issues and steps were taken to see that the likelihood of that happening again were microscopic).

I’m willing to bet that they have multi-homed BGP routers setup at each datacenter, however we are getting into the “nitty gritty” here as the paths stored in these records are what bind the Internet together (there are ~500k Autonomous System (AS) Numbers in the public record and probably half as many are used for internal purposes on top of that; note that is just for IPv4 and DreamHost has listened to its customers and have implemented IPv6 across all accounts). Part of the border gate protocol involves constantly watching for any change in routes (such as when a peering partner goes offline or a country decides to block either incoming/outgoing/all traffic - heh something we have seen happen a few times in the last 5 years alone).

The problem is, when border routers break, you are totally cut off from the rest of the Internet (this is why datacenters normally have connections specifically meant for emergency use and don’t have to rely on the rest of their network). I am guessing they probably tried to fix this internally, then after 30mins or so decided it would be best to re-route traffic around their Irvine facility to the one in LA (which should be an example of how redundant they really are to be able to do that), followed by finding a suitable fix (cause appears to still be unknown at this time; it could be any number of things and a bunch of hardware and software will need to be checked) that they felt could handle restoring previous routes and so far it looks like it’s holding. Written but not read as I’m fading fast and there is no caffeine in sight x.x /* random geek minddump */[/quote]