Whenever a server fails to restart they say “we are sending someone over to the datacenter”. My question is how come they don’t have a member of staff there all the time? Servers frequently become unresponsive and need to be rebooted, this takes up to an hour (sometimes more) because (and this is deduction, I may be completely wrong) they themselves have to wait for a response from the unresponsive server to restart it, since it is being done remotely, no?
They’re pretty darn close to the data center. It’s part of the same building where their offies are, I think they’re up above the actual machines - so they send someone into the bowels of the beast to fis the server.
As far as work being done remotely, techincally that’s how just about everyting is done with that server. It’s not like each one has a monitor and keyboard sitting there. A lot of times when a *unix machine needs rebooting, it’s not in the same way as a windoze or an older mac. It’s just that some services have stoped responding, but the kernel (core OS) still responds to the force reboot. When things get really locked up they it’s the gool ole’ reset switch.
Well that good old reset switch needs a little attention, sepulveda always takes 40-60 minutes to restart, when it could be done in 5 seconds.
[quote]Well that good old reset switch needs a little attention,
sepulveda always takes 40-60 minutes to restart, when
it could be done in 5 seconds.
One of our two offices is located in the same building as the data center, albeit on a different floor. On any given day you can probably find one of our administrators doing something within the data center itself (we have full, 24/7 access).
This doesn’t impact most issues, though - it’s only in (relatively rare) cases of outright hardware failure that we ever have to have actual “hands on” access to the servers (we can do much of the usual stuff - run commands, etc. from anywhere in the world - you’ve not lived until you’ve power-cycled a server while stuck in LA traffic).
The 40+ minute delay you speak of is mostly comprised of the time it takes for us to notice the downtime, plus (in the event of a power-cycle) some time - maybe 10-15 minutes or so - for the server to actually come back up. In some cases it can take a little longer if, for example, an fsck is occurring.
So, really, the greatest amount of time is shared between A) being notified that a given server/service is down and B) finishing up whatever we’re working on at the time (ie. another problem).
Some issues receive higher priority, of course. If an entire server is down it will be resolved more quickly than if, for example, a single Apache instance has gone offline.
Anyhow… I would say that 40-60 minutes is longer than normal. If a server is dead in the water, we usually find out about it and get it fixed a lot quicker than that. Even dead Apaches don’t usually take that long to fix.
- Jeff @ DreamHost
- DH Discussion Forum Admin
The time I used was relating to the average time since the fault was confirmed (the panel has a time since first report->confirmed). Its not a great problem, but i’m guessing that when a server becomes unresponsive due to overloading (the server load goes to 200 200 200 ) then a simple restart will solve the problem. This does happen of course, and the server has never been down for much more than an hour but it is quite frequent. When the server is unresponsive it takes me at least 15 minutes to get a successful login through to shell so I am guessing you use the same system and thats why it takes so long. Of course I know not exactly how you solve these issues so it may be unrelated.
Apart from that, all is well in dreamhost land, this was more of a curiosity type post rather than a complaint.
PS. Using shell and the command “top” i get the server statistics, what is the NICE stat in CPU % ?