Invest in REAL working redundancy


#1

I’ve been with dream host for many years and every 12-18 months the compnay goes through some existential crisis where there is an outage and we get a cute news letter with a kind of apology, and a commitment to improve the design and reliability of dream host, and every year we seem to repeat the same cycle.

When will DH invest in real redundancy to minimize power outages, network brown outs, and other essential infrastructure to eliminate these cyclical outages?


#2

Can you name the last power outage? I can’t.

If you would like a more perfect world, its available for a greater price elsewhere. Even there tho you’re not going to find much better up-time overall. No matter where you go there will always be some issue over the 525960 minutes in a year =]


#3

I had a six hour outage about 18-24 months ago. Just before the plan to go to new york. Also a vareiry of sql hangs, and machine fails over the last 2 years… As an ex-cisco early employee, I’m reminded of other high growth high tech companies that pioritized growth over operations. After 3 hours of working on bringing up the datacenter after a power failure they notice THEN that the routers are fried? Given the infrastructure priority wouldn’t the connection to the outside world come FIRST? It looks the the operations staff is wearing clown shoes…


#4

For the price I pay I’m happy. I know you can always add half a percentage point to the uptime by paying for something more pricey like rackspace, if revenue dictated that then I’m sure I would… =]


#5

I’ve noticed power outages reported at rack level when power strips fail, but never an entire datacenter. The blokes running the show down at Alchemy will have a lot of explaining and begging forgiveness to do. I’d reckon they host more than the DreamHost servers in the facility and no doubt the phone lines would be melting already from some very disgruntled clients who spend a lot of money with them. Alchemy (Irvine) Datacenter


#6

I can’t find the quote i’m looking for about alchemy but there is one that’s close on this page: “DreamHost owns a large equity stake in Alchemy.”


#7

We’ve been investigating the router issues pretty much since the power came back on. Getting confirmation from the manufacturer that the problems we were seeing were a hardware issue took some time, though, and while that was ongoing some of our other staff have been trying to bring up other services where possible.

We’ve also gotten word from Alchemy that they’re doing some emergency UPS maintenance at the Irvine facility this evening, running from right around now through midnight. We don’t have any details on what this involves, but I strongly suspect that it’s related to the power outage earlier today.

DreamHost does have equity in Alchemy, which hosts our Irvine and Los Angeles data centers, but the involvement ends there. Alchemy is managed and operated separately from DreamHost, and they host servers for quite a few other customers.


#8

Andrew,

Relying on hardware vendors to confirm the status of their equipment is EXACTLY the kind of issue that this thread is about. Having an operation DOA plan is what grown up companies do, having planning for ‘what if’ a power cycle hits both routers is the point of real operation planning. It’s now 8:11pm, at 5:30 a spare could have been rolled in and the network could have been patched together. I really have no issue with individual servers or power strips or mysql servers occasionally failing, but paying for useless redunancy is dumb and paying for no effective working redudancy is dumber.


#9

I am a newer DH customer, so this issue does concern me long-term since I am using the Irvine datacenter. I am patient. I keep regular backups, so for me, site failure is no big deal right now. However, when I start to use my service for more availability-critical things or to host my portfolio, I really worry about future outages.

I don’t expect perfection, but I want to know… what measures are being taken to make sure this doesn’t happen for this period of time again?

An explanation for what exactly went wrong would be very helpful. I kept up with the dreamhoststatus blog and it’s confusing to me as a customer to be honest… as I said, I’m very well versed in IT, so I understand simple failures.

Also I read something about performance/server improvements? Is this Irvine Center going away and being merged into Los Angeles? It sounds like from my reading that a lot of problems have come out of this particular center…

Not trying to be negative. My site is up and everything is fine now. I just wanted to know if we could have some feedback on these issues, so in the very least, we can at least know in the future what is being done to minimize these outages/increase performance.

Thanks in advance guys


#10

Our CEO’s update on the root cause and mitigation of power issues affecting services in our Irvine DataCenter:

http://www.dreamhoststatus.com/2013/03/19/power-disruption-affecting-us-west-data-center-irvine-ca/

Thank you all for your continued patience during this frustrating and unfortunate incident.


#11

I literally just saw it. I was about to post this. Thanks :slight_smile: I will read it.
[hr]

Edit: This is good enough for me. I am just glad to know the issue is being resolved and that there is a clear explanation as to what went wrong and what is being done to prevent it in the future… also promises of future information being divulged about this outage. I highly appreciate it.

I’m curious, I noticed this entry:

Network Improvement Series Maintenance – US-West Data Center (Los Angeles, CA) – Part 2/2 of Final Step – Wednesday, March 20th 8pm PST (4 hrs)
http://www.dreamhoststatus.com/2013/03/18/network-improvement-series-maintenance-us-west-data-center-los-angeles-ca-part-22-of-final-step-wednesday-march-20th-8pm-pst-4-hrs/

Is everything being moved from Irvine to Los Angeles? I’m new, so I’m curious how many data centers exactly does DH work out of? I’m not sure what if it means anything for me as a customer but I’m just curious what will be changed is all.


#12

We’ve currently got three data centers: our Los Angeles and Irvine data centers on the West Coast, and our Virginia data center on the East Coast. We had a fourth data center (also in Los Angeles) up until recently, when we migrated all customers out of it, mostly to Virginia.

These three data centers, and everything in them, are staying where they are for the time being. The maintenance that was scheduled for yesterday evening was an upgrade of some router hardware in the Los Angeles data center as part of an ongoing effort to improve our network infrastructure. This was not related in any way to the issues in the Irvine data center, but we cancelled it anyway so that we could keep all of our staff focused on getting Irvine back online.


#13

I like the part where an assurance that the root cause of the outage would not be worked on was understood to be a good thing. Funny stuff.


#14

One of the reasons why I picked DreamHost over its competitors as my first host was that, at the time, they appeared to really own up to failures and provide full disclosure of the events. Of course, having the skill and planning to avoid failures in the first place is preferred, but even the big guys fail occasionally, so it’s understandable as long as there is transparency.

I think I’m not the only one that feels that the level of transparency has diminished greatly over the past 2-3 years. There have been important events during that time which have basically been swept under the rug without a satisfying explanation. When there is an explanation, it invariably ends with a statement that measures will be taken to ensure that it never happens again, without actually specifying what those measures are. And when it does happen again, the credibility of the initial explanation is diminished greatly.

I think almost everyone can accept that there will be times when things don’t go to plan. Most of those people are willing to accept some form of compensation and move on with the understanding that tests, checks, and contingency plans will be put in place to ensure that it either doesn’t happen again or that the impact of a similar occurrence will be greatly reduced.

Ideally, the downtime created by my own mistakes should be far greater than that caused by the mistakes of the host.


#15

I can easily second that feeling. After a number of incidences I feel like real systems analysis has not taken place and dream host has outsourced to companies that it inappropriately trusts instead of keeping the operations and responsibility in house. How was a new state of the art facility in irvine never tested? or inappropriately tested? Why wasn’t critical spare equipment like a third router kept onsite? Once the problem happened once, how in the world did it happen again?

These are system failures of a company that’s been re-focusing on marketing in the last few years and away from backend systems.

My .02


#16

The 7.00pm update was hilarious :smiley:

They handed the reins over to a salesman. It was inevitable.