Contingency plan for server failure


#1

I am considering moving my companies website to Dreamhost. I have been testing Dreamhost with my personal domains for about a month, and I am very happy with the control panel and customer support.

I appreciate the transparancy and information that Dreamhost provides on the dreamhoststatus.com site. However, after reading through past posts, one thing that does concern me a bit is that when an outage occurs to a specific server, if the fix requires FSCK, then the server may be down down for 3, 4, 5, hours (or even longer). I realize that Dreamhost has zillions of servers, so we are going to see this in the blog occasionally, and statistically it is not happening very often at all.

However, it does concern me that we could be down for that length of time in the event of a raid problem on the particular server that hosts our website. What if any steps can be taken to plan for this contingency? I appreciate the “100% uptime” guarantee, but I don’t really want a credit if our server is down for an extended time. I would rather have a way to temporarily move to another server (assuming I had my site backed up locally, which would be my problem).

Or I would even be willing to pay double to have a second account ready to go with a “hot spare” copy of our website. But of course that doesn’t help unless there was a way to change the DNS to point to the backup server.

Is there enough demand to offer this service profitably? Probably not or they would already be offering it…

I’m also considering the VS plan, but in reality this plan would be subject to the same problems if the host server had raid issues or needed fsck.

Maybe I am worried about something that is unlikely to happen to me. DreamHost seems to respond to problems quickly, and downtime seems to be minimal when problems do occur, unless you need an fsck run…

Any ideas or options? Thanks.


#2

Your point makes perfect sense. Ideally there should be a standalone database server, web server, and a backup server. Backup server comes in place in case there is anything wrong with the main web server.

But the above is not possible in a shared server in DH. I think it all depends on your business requirements. If the uptime is crucial, I’d recommend you to go for dedicated server. To achieve best performance, you should have a professional server administration to maintain the servers for you. It is easy to have a server up running, but it is not easy to maintain it well.

$50 off and 3 free domains with code: [color=#CC0000]DH3[/color] Sign Up NOW or More Codes Here


#3

I’m not really opposed to paying the premium for a dedicated server, however it seems that the same potential issue exists with a dedicated server as well. A file system error can have you offline for 3, 4, 5 hours or more. Since our needs are otherwise met with a shared plan, in this case a dedicated server gets us nothing extra. I am not asking for 100% uptime, I know that is not realistic. And it does seem that DreamHost responds fairly quickly when there is a problem. But if the problem turns out to be a RAID issue, or if FSCK has to be run, your website is down for a long, long time. A control panel option that would allow us to change DNS instantly to point to our website hosted on a “backup” server would mitigate outages caused by server failure. I realize it would be up to us to have a current backup copy of the website sitting on this second account or server ready to go. Of all the outages mentioned in the blog, even the most serious ones are resolved fairly quickly. The issues that seem to be common and lengthy are FSCK issues. Having a premium pay option available to customers might be a solution. I can handle 30 minute outage, it happens. But a 5 hour or near complete business day outage would be a problem. Any thoughts on a way to plan for a server failure?


#4

There are some issues with using a backup server for live sites — most notably, any backup will inevitably be “behind” the live site by some amount of time. Depending on the site, this may or may not be an issue: for instance, an infrequently edited HTML site (or a purely database-driven site!) will probably not notice, but sites which store rapidly changing data on the web server itself will definitely notice the reversion. As such, simply switching to a backup immediately when issues arise isn’t necessarily a viable option for all users.

We’re definitely aware that the downtime from a full fsck can be devastating to businesses that rely on their web sites, though, and we’re looking very seriously into ways that we can either speed up or avoid that process.


#5

That’s a good point there, but unless yours is a rather critical site receiving large amount of traffic every minute, a backup of this magnitude is far form cost-effective.


#6

I am not good at networking. But in my knowledge, load balancing with multiple concurrent servers would supply what you achieve.

How it works is that there are multiple web servers running at the same time. There will be a load balancer to distribute the requests to different servers. If one of the server is down, the rest of the servers are still up running.

If you ever need to run FSCK on one of the web server, you can take it out of the loop and put it back once FSCK is done.

I am only giving a rough idea here. For more details, you should seek advice from network professionals.


#7

We’ve talked about doing something like that. Here’s what’s holding us back:

[list]
[] It’s expensive. Setting up a bunch of redundant web servers would more than double our hardware costs (for the spare hardware, plus the load balancer hardware itself). Chances are that you’d end up paying the extra costs… and you probably wouldn’t want that. At least, a lot of our customers wouldn’t.
[
] Load balancing is more complex than it’d seem. A lot of software isn’t naturally load-balancer-friendly — for example, any software which allows users to upload files would have to be modified somehow to copy uploaded files to every server behind the load balancer, rather than just to the server that happened to handle the upload. The most common solution to this type of problem is to use network storage, but that would just return us to the original problem of having a single point of failure.
[] A minor, but nasty issue: Some customers use software which is only licensed for use on a single server. Using multiple servers on the back and might expose them to legal risk, or at least to increased licensing fees.
[
] Worst of all, it still wouldn’t prevent the server from going down. Disk problems are certainly a common cause, but there are a myriad of other causes, not all of which load-balancing could possibly solve.
[/list]


#8

Yes, load balancing is expensive.

That would be a business decision to balance your budget with performance.