100% uptime vs reality


#1

How’s your uptime and response time?

Mine’s becoming frustratingly pathetic.
uptime: 99.4% over the past 3-4 months. There’s an outage almost every day recently:

response time: constantly increasing:

Not happy. Not impressed.


#2

Hey bobocat,

You have told us before, but I never set the service up. Remind us where to set up the monitoring service.

Unscientifically, I agree with you that things have gotten worse. I have a cron that runs once a minute that I set up to let me know when a stream of weather data gets interrupted. The alarm goes off more frequently now. This was originally set up not to monitor for dreamhost failures, but a failure on the other side (the local system and connection that does the uploading to dreamhost) but lately most of the failures have been unexplained momentary interruptions. I would like to set up the more scientific approach.


#3

I use pingdom. Their free account allows 1 minute resolution for 1 domain. You can customise it quite a bit. I’ve set it up so that it requests a page that firsts makes a simple request to the database first to produce some text on the page. If the text is not there, then that is reported in my setup as downtime as well. So the downtime is not necessarily unreachable, but unusable since a failure in the DB connection renders most of my site useless.

Well, at least its corroborating evidence. I almost regret setting up the pingdom monitor (thanks sXi!) because I used to be blissfully unaware of momentary downtime, but now it’s clear how often it occurs, which brings my satisfaction level way down.

I realise that a 2-10 minute long downtime may seem like nothing major, but today’s 8 minute or so disappearance was right in the middle of a class of 30 students taking a test on my site. It just doesn’t work to have 30+ people happily clicking away and then in the middle get nothing for 8 minutes. I even spent a lot of time making my app less reliant on the database by loading and caching everything that will be needed for the session at the beginning as I’ve been bitten by that under similar circumstances. But these small windows of darkness are becoming intolerable… I’ve had a service disruption every day for the past seven days as can be seen here.[hr]
I can not even begin to describe how furious I am after reading the latest reply from DreamHost support regarding the latest outage. They have the gall to claim that Pingdom reports are not accurate. Excuse me? If Pingdom can’t reach my site from its 50+ servers around the world, then neither can anyone else! And I already do personally verify every Pingdom report. This in infuriating:

[quote]Thanks for contacting us but unfortunately Pingdom reports simply aren’t
usable. We see too many false positives with the service as look to be
the case today.

I’ve just checked a few of your sites and apache services as well as the
server itself and I’m not seeing any issues since the server was rebooted
2 weeks ago.

If you’re seeing reports of downtime with Pingdom please confirm them by
checking the site personally. If you are seeing any problem please
contact us and provide any specific error messages you’re seeing. Well be
happy to look into any issue then.

Thanks!
Christopher P[/quote]

I hate to say it DreamHost, but I am now officially looking for a new host.


#4

Hi. Note that Pingdom say in their control panel that

The field they are talking about is “how many failed checks to tolerate before raising an alarm”.

What this means is that, in their opinion, a ‘regular’ website can be expected to fail for “5 or even 10” minutes at a time without the owner necessarily having to become concerned.

Obviously there’s a question not addressed there, about how frequently 5-minute failures can occur before one’s satisfaction level can reasonably diminish.

But since (it seems) you are running your Pingdom monitor at a level beyond what they themselves recommend (for ‘regular’ websites), are you sure that those comments from Dreamhost support are unreasonable?

It would be good to get some feedback about how often these 5-to-10 minute outages occur. I’ve (temporarily) set my own pingdom monitor to the unreasonably high level which they disrecommend, and I’ll report back if anything alarming transpires; I hope other customers do the same.

~Tom


#5

Tom, I realise there were a lot of words in my post, but if you don’t want to read all of it, then don’t post a reply. I’ve already addressed the reason why even 1 minute of downtime is unacceptable in my use case.


#6

But pingdom allows you to set it for more critical even on a free account… hmmm… they also forgot to define “regular”.

And by the way, what does “their” opinion have to do with anything? And what motivates “their opinion”? Could their statement be traffic/bandwidth, server/bot load motivated? oh.

Anyway, don’t want to drag this thread more off topic than you have already done. :slight_smile:

I’m not sure why I didn’t set this up a long long time ago. Now we just need to let the data collect.

FWIW, my cron alarm went off last night and it coincided exactly to the time of a 2 minutes outage reported by pingdom. I noticed that in the pingdom analysis of the event they include the text of traceroutes from 2 different locations in the world, both with the same last point of response being the same dreamhost router. Support might have more trouble arguing with data like that.


#7

Provisional definition: a regular website, for the context of this discussion, is a website for which it is appropriate to use a shared server at a mid-priced host with a reasonably good reputation.


#8

keeping the thread on topic…

So far I’ve seen 4 outage events since I set up pingdom yesterday… 3 were minor in duration 2 minutes or less, 1 had a duration of 19 minutes. All 4 events in the detail analysis had 2 traceroutes from different locations, in each of the 8 traceroutes the failure point appears to be router-0.hq.newdream.net (admittedly the IP didn’t resolve in 2 of the 8 cases, but they were IP’s similar to the ones shown for router-0.hq.newdream.net).


#9

Sorry, I should have warned you: Pingdom’s data will cause host dissatisfaction!
I haven’t had a single day without at least a minor disruption for about 10-15 days now, which seems to coincide with DH’s attempt to revamp their network.

As a followup, I received to replies from support recently:

[quote]First, I apologize for the use of a mass message to answer your support request, if your questions have not been answered please do not hesitate to let us know. We will be here to help.

We had been experiencing an outage for some customers whose services communicate between two data centers. The problem has been identified as a configuration mismatch between the routers those data centers. Once identified, this was quickly resolved by our Network Engineering team.

As we continue our Network Improvement Series, we will be adding this to the checklist of items to be checked and double-checked to ensure this type of issue does not happen again.[/quote]

So they did finally own up to some of the problem. My setup with Pingdom will report downtime if the site is reachable but the DB is not, which seems to be responsible for about 30% of the downtimes.

Note the traditional ending: we’ll make sure this never happens again. That’s become almost comical lately.

As for the connection issues, after initially blowing me off and claiming Pingdom was reporting false positives, I provided them with more concrete evidence including changing my Pingdom password so that someone from DH could log into my account themselves to get as much data as they needed. I use Google Analytics and frequently have the real-time stats open in my browser, especially if I know that someone is going to be using it with a class full of students (which can show up as a constant stream of 100+ clicks per minute for 20 minutes at a stretch). The clicks will disappear at the same times that Pingdom report downtimes. The logs show this stark drop off as well. I personally check and witness many of these outages. This is why I’m so infuriated that Support can just ignore the problem and claim that Pingdom is giving false positives.

Anyway, their response:

[quote]Since we weren’t able to replicate the issues you’re having on our end,
we’re going to move your account over to a new server entirely.[/quote]

Ok, great, so maybe they can ensure that my server will be in the same data centre as my database (which would seem only logical and has been a source of some outages as of late). When I inquired about that, I got this reply:

[quote]My apologies, I hadn’t realized you were wanting to move it to the same
datacenter as your database.

your sites are currently being moved to the “applewood” server (in Irvine
data center), while your databases are on “inki” (in our LA datacenter).

I suggest you perhaps just monitor your sites after the DNS propagates
and see how they perform.[/quote]

This is a stunning response! They know that this configuration causes errors and they claim to be consolidating servers and DB servers within the same data centres to avoid this issues (some details here). So what is DH’s plan when they know that my configuration is fragile? Oh, well, just watch it and see what happens.

Unbelievable. Simply unbelievable. For a company which has the gall to claim 100% uptime, which they, and the rest of the hosting world, knows in advance that they simply can not provide, it’s shocking to see such a nonchalant attitude to documented uptime of only around 99.4%. Which hosting company would advertise that? Which hosting companies actually stoop that low??

Oh, and I’ve just read one of the automated replies a little more closely:

[quote]WHOOPS! We noticed you included an attachment with your email! We
cannot accept email attachments… if there’s a file you’d like us to see,
please upload it to the web somewhere and re-send your message with the
URL to the attachment! We’d greatly appreciate it![/quote]

So apparently about half of the data I’ve been sending them has not even arrived although it was referenced in my messages (i.e. see the attached screenshot). This is flabbergasting on so many levels.

One, it shows that they are not even reading what I write, just skimming for some key words. Two, if they are reading it, they aren’t sincere enough to care that I’m trying to send them all the data they need because they dismiss my claims as being false positives without more evidence. And three, if you primarily provide support by email and you don’t accept email attachments, then shouldn’t you provide a *%$!ing service to upload attachments yourself rather than asking your clients to upload it somewhere??? DreamHost, you are a hosting company, right? That somewhere should be your service! You accept attachments from the panel, but the only way to follow up on an unresolved issue and maintain the previous conversation is via email. But you don’t accept email attachments???

Ok, I’m boiling over with rage now. This is farcical. The dream is just about extinguished, DreamHost. I’ve been one of your big supporters. I’ve volunteered countless hours on this forum learning and helping others. I’ve written and improved wiki pages. I’ve even helped your customers set up web apps. That’s been a waste of time.


#10

Just to follow up, here’s my DH uptime for 2012. This graph includes one server move around 10 June.

Basic stats @ 1 minute resolution for March — December 2012:
[list]
[]Uptime: 99.67%
[
]Downtime: 1d 3m
[]Number of Downtimes: 252
[
]Longest continuous Uptime: 34d 14h 58m
[/list]


#11

My December:

[list]
[]Uptime: 99.29%
[
]Outages: 5
[/list]

2012-12-10 04:37:12 – 0h 05m 00s
2012-12-10 06:32:12 – 0h 10m 00s
2012-12-10 07:02:12 – 0h 10m 00s
2012-12-10 07:27:12 – 0h 10m 00s
2012-12-10 07:42:12 – 4h 40m 00s

(Server was moved on the day above).

The preceding months were 99.92% and better. Couple were 100%

Pinging @ 5 minute intervals.


#12

It will be really good if DH provides pinging results by default :slight_smile:


#13

As a follow up to my follow up, here are my average response times for March — December 2012 measured @ 1 minute resolution. These times represent a request to a PHP framework which makes a single database query and creates a page. This script has not changed so although the average response time is much higher than a ping, it should remain relatively constant.

Note that prior to mid-April, response times averaged < 500 ms. This is corroborated with Google Analytics data which goes back to around November 2011. Since that time, the average speed has slowed by about 50%. I’m not sure why and I’m not too happy about it, but I can live with 750 ms.
[hr]

Your resolution is too low :stuck_out_tongue:


#14

I’m running out of time at the moment, but later today I’ll post my May to Dec pingdom stats.

I’ll tell you what the show ahead of time tho… alot of outages early, then alot of smooth sailing, follow but alot of outages at the end of the year. I ignored the recent outages for awhile, but as of Friday early AM I opened a ticket, they haven’t replied yet to the ticket.


#15

…and it’s only pinging a static HTML page :slight_smile:


#16

My monitoring started on 6/9/2012 thru 12/31/2012, 1 minute resolution, with just a simple page request for the top level of the domain:

Uptime: 99.65%
Downtime: 17h 25m
Number of downtimes: 262

More alarming is Dec 11th thru today:

Uptime: 98.6%
Downtime: 10h 43min
Number of downtimes: 79

As you can see there’s been a huge increase in downtime since Dec11th.

Graph: 6/9 to 1/12: http://i.imgur.com/NcEC2.png

I raised a support ticket Friday and the first response from support was that ‘uptime’ on the server was "19:52:31 up 156 days, 16:54, 3 users, load average: 2.51, 2.63, 2.30"
and the suggestion was that I should contact pingdom to find out how they arrive at their conclusion. :stuck_out_tongue:

If the electricity didn’t go off, does that mean the lights were on the whole time? No, the switch might have gotten turned off, the bulb may have burned out, the fuse might have blown, or the wire to the light might have gotten cut.


#17

Exactly the uptime I got, and I doubt you are on the same server, so those two figures corroborate each other.

[quote=“LakeRat, post:16, topic:57772”]I raised a support ticket Friday and the first response from support was that ‘uptime’ on the server was "19:52:31 up 156 days, 16:54, 3 users, load average: 2.51, 2.63, 2.30"
and the suggestion was that I should contact pingdom to find out how they arrive at their conclusion. :stuck_out_tongue:
[/quote]

It’s this sort of response that gives me the energy to find another host. As you noted in your analogy, just because the server is running, doesn’t mean that the world can reach it. Support’s response is always something along these lines: load is low and server uptime is high, so it must not be our problem.

This is ironic because I’ve already been through a situation some months ago where there was an outage every day at a specific time for 5-10 minutes or so which DH vigorously denied until one day they posted something on their status site noting that they had found some configuration problem between data centres or something like that. After that, the micro-outages disappeared, for a while.

This is why my check includes a DB query. For many websites, it doesn’t matter if the server is running and the connection is up. If the server can’t talk to the DB, no page is going to be served.

My setup tests the entire system and DH would be wise to do something similar rather than just blindly relying on the server uptime report. They have multiple data centres now, so it wouldn’t be hard to set up it’s own Pingdom which makes requests to each server in other data centres and then displaying the results in the panel. In a way, this would even result in fewer wasted resources if DH customers could see trustworthy uptime graphs in the panel.

Otherwise, there may be 1% or so on each server who set up a service such as Pingdom, which means there could be 10 or so requests every minute from various users on the same server just to make sure it’s up, compared to DH running their own service with a single request.

But DH probably won’t do it because then they can’t blame someone else for low uptime figures. They would actually have to own up to their failures or change their claim of 100% uptime to 99.65% uptime.

If I had time, I’d buy dreamhostdowntime.com and setup a small app where DH customers could add their Pingdom accounts and pool the data via the API to provide a more realistic assessment of DH’s uptime. The only problem is that I don’t have the time.


#18

I received the copy/pasta response like that after sending in a note the other day that the server I was logged into was running at 500+ load and that, you know, it might be worth taking a look at. Recieved the uptime and load bs over 24 hours later. I woulda LOOSING THOUSANDS REPLY!!! at the total lack of logic if the stuff I have here at DreamHost was at all important.

I hear ya. Even Level 0.001 Support should not be allowed to issue those “any response is a good response” type of replies. I’m starting to think that anyone who is savvy enough to understand how to log in to shell should probably be using the “No offence, but I probably know more about this than you do” selection when submitting a message.

…and you’d be relying on your DH uptime to collect, appraise, and output the data :stuck_out_tongue:


#19

I was going to list that as problem two, but Pingdom keeps the records, so if DH was down, that would just mean the aggregate results couldn’t be displayed. They could still be aggregated during the next cronjob after DH was back up.


#20

UPDATE: Support moved the site to a different instance of Apache on the same machine, and the problem seems to be rectified. No more frequent and sometimes long “down time”