Periodic slowdown fixed by Apache reset...but it keeps hapening


#107

I waited more than 24hrs after submitting the ticket to get the 30 seconds (or multiple of 30s) delay problem, and it came back, again.
I enjoyed less than a day of normal operation.
This time, I didn’t wait to see if DreamHost catches and corrects it and submitted a ticket, right away.

I’m starting to monitor the site myself, as I don’t think DraemHost is monitoring, and I’m thinking about writing a script to submit a ticket, too.

On the server, we just need a very simple CGI, like this:

#!/bin/sh

echo ""
echo "Testing CGI"

Name it “test.cgi” , for example, and do “chmod 700 test.cgi”

And we can issue a command like this from any machine:
time wget http://yoursite.com/test.cgi

With the 30 second delay problem, it will take slightly more than 30 seconds.

With a script that invokes another process, it takes twice as much.

#!/bin/sh

echo ""
date

If you use pipe, it will take a lot longer.

#!/bin/sh

echo ""
ls -al |tail

If you issue more commands, the same thing.
30s for starting CGI process, and 30s each for every single process, no matter how simple it is.

#!/bin/sh

echo ""
date
date
date

It’s very unusual to get the same delay for each subprocess, and each time this problem happens, by accident.
I’d understand if it’s by design and if it’s desired to be like this.
In addition, the manner support stuffs answer to tickets about this problem is strange.
They simply say the HTTP server needed to be reset.

Then, why not monitor the server and reset it automatically when this problem happens?
It would reduce the ticket volume and the associated cost.
I don’t understand why they keep this hanging around for months.


#108

Response has arrived to my ticket (from post #99 in this thread). I’ve bolded the part that readers of this forum might be interested in:

[quote]I apologize for the extreme delay and for the inconvenience. We are aware
of the issues that you and other customers of ours have seen with the
Apache service needing to be restarted. We have begun working on a
permanent fix to stop this from happening.
I have restarted the Apache
service for you once again. Please let us know if you are still seeing
the errors after this restart. We will be more than happy to take another
look. If you have any further questions, please feel free to message us
back. :)[/quote]


#109

Well, I also got a response… pretty much saying they couldnt find anything… it worked fine for a little while, but now i am back at 30 secs again…


#110

For quite a while I was having problems similar to what some other folks have described in this thread – low-traffic Wordpress blog, fairly standard install without a bunch of weird plugins, and insanely long page load times with frequent downtime. After going back and forth with support a bit, and after a move to a different Apache instance that helped only temporarily, it was suggested that in some cases this problem can result from certain Wordpress plugin/theme combinations not working well with the PageSpeed module (Page Speed Optimization on the domain edit page). I had checked the box because, hey, who wouldn’t check a box to make things faster? Pretty much as soon as I unchecked that box, the problem disappeared. That was about 4 months ago and no trouble since then.

This might have already been suggested – toward the end of the thread I was skimming – but for anyone who has this problem and is running page speed optimization it’s worth a try disabling it to see if it helps.


#111

If it is “30 seconds delay problem”, it depends on the number of subprocesses.

So, if page optimization invokes subprocess, it might affect.
But it’s usually not a good idea to remove optimization and increase the load on the server.

For my site, the problem is back after about a day of normal operation.
So, I placed simple monitoring script to log the response time.

Here is a part of the response time to very simple CGI
With the automatic logging, I’ll know when it starts and when it stops.

Thu Mar  6 04:46:32 PST 2014
 04:46:32 up 100 days, 16:33,  1 user,  load average: 5.55, 7.03, 7.79

real	0m30.114s  <========== 30 seconds + alpha
user	0m0.000s
sys	0m0.024s
---
Thu Mar  6 04:48:18 PST 2014
 04:48:18 up 100 days, 16:35,  1 user,  load average: 9.31, 7.91, 8.02

real	0m30.042s  <========== 30 seconds + alpha
user	0m0.000s
sys	0m0.008s
---
Thu Mar  6 04:50:01 PST 2014
 04:50:01 up 100 days, 16:37,  1 user,  load average: 11.56, 8.78, 8.29

real	0m30.093s  <========== 30 seconds + alpha
user	0m0.008s
sys	0m0.016s
---
Thu Mar  6 05:00:01 PST 2014
 05:00:01 up 100 days, 16:47,  1 user,  load average: 8.09, 10.78, 9.52

real	0m30.130s  <========== 30 seconds + alpha
user	0m0.012s
sys	0m0.000s

Load average is pretty mild for this server, and static pages are loading fast.
But all CGI has the 30 seconds (or multiple of 30 seconds) delay.

So far, the support person does not understand the nature of the problem,
as he is asking Ping and trace route output…


#112

Over the last year, I think I’ve had to open tickets about 3 times with reference to this problem. Oddly, it’s like re-inventing the wheel each time. Once must first convince support that there is a problem. Then you have to convince them that it’s a problem they have to deal with, and not a problem self created. Then you have to go thru an iteration or 2 of “resets” of the apache service. Then finally they move you to a new apache instance and all is good again, until a bad site winds up on the same apache instance again. FWIW, I am convinced this problem is nothing more than bad neighbors (that may not know they are a bad neighbor) What I don’t understand tho, is why dreamhost hasn’t come up with an automated detection method, and a protocol to deal with it that involves identifying the offender. Or there might be a solution that keeps the processes from getting hung in the first place (many support replies indicate they cleared out “hung processes” and restarted apache). In any case, the current protocol of moving the victims after they whine loud enough isn’t a scalable long term solution.


#113

[quote=“LakeRat, post:112, topic:58840”]…In any case, the current protocol of moving the victims after they whine loud enough isn’t a scalable long term solution.
[/quote]

A priceless understatement. And if I may be so bold, most of the “victims” I’ve seen complaining about this issue are more than charitable in accepting the “we’ll just restart apache for you” anodyne.

Is this issue a factor in both VPS and shared servers? I’ve tried to divine this by reading all the threads on this problem but it’s not clear to me whether it’s only a VPS issue.

(I am considering migrating from my current host to DH, but until this thing is resolved, making the move now seems a little reckless).

Thanks,
–Bob.


#114

Our forum app for Android and iPhone doesn’t have any problems loading pages of our forum, in fact it’s always so much faster compared to loading our forum pages on a PC browser. Also a 30 sec (or multiple) delay doesn’t exists, surfing our forum with a smartphone or similar device which uses our app. That’s why it contradicts the statement of support that one of the problems is a high CPU usage. Our smartphone users never complain, our PC users complain all the time.


#115

I am on a shared service and this is been going on for a few weeks now for me. Its getting to the point it is hard to do any development or have people come to my site due to the fact that its always having problems. I got one reply 2 days after I sent my support ticket in saying they had to reboot the apache instance. This is not a problem just an annoyance. I have been a member for about 5 years or more and the server is been great up to this past year when they started having some big problems. Now I had issues yesterday took 2-3 hours for it to be resolved and still no reply from DH Support.

I would be fine if it was a reboot needed but can they not automate that and check?


#116

In my case, it’s about shared hosting servers (and not VPS).

My site became normal 10 hrs after submitting a ticket.
(No response to the ticket)
Then the problem started again 5 hrs later.
This time, it lasted 4 hers before things became normal.
(Still no response to the ticket)

We can have this problem without high server load (probably because servers handling CGI are different from the server we log on with ssh).

2014-03-07 00:30:01-08:00
 00:30:01 up 101 days, 12:17,  0 users,  load average: 3.55, 3.88, 4.07

real	0m30.100s
user	0m0.012s
sys	0m0.008s

And as far as I’ve observed, it has nothing to do with PC or smart phone, etc.
(Unless the smartphone access goes through caching proxy and the dynamic pages are cached, somehow…)
It has nothing to do with network outside of DH.
The delay still happens when we access the CGI from the DH server itself (with ssh).

As there is no 30 seconds lag on SSH, FTP, etc, the problem is on the CGI backend
or in between HTTP server and CGI backend, most probably.

Also, the 30 second delay exists on multiple DH servers.
I’ve seen this with different accounts and on different servers, and it’s exactly 30 seconds (or multiple of 30 seconds).
So, I think it’s probably a setup, configuration, common to many (in not all) DH shared servers.
I think I’ve seen this when my site was on a server in LA, and then after we’ve moved to VA or somewhere in east coast.


#117

The problem happened, again, last night.
I didn’t submit any ticket as I was sleeping.
But it resolved (by itself? or by DreamHost?), 3 hours later.

2014-03-08 21:50:01-08:00
21:50:01 up 103 days, 9:37, 1 user, load average: 6.08, 5.83, 5.44

real 0m0.132s <== Normal
user 0m0.008s
sys 0m0.008s

2014-03-08 22:00:01-08:00
22:00:01 up 103 days, 9:47, 1 user, load average: 6.12, 7.35, 6.57

real 0m30.244s <=== 30 seconds delay started around 22H00 Pacific time
user 0m0.008s
sys 0m0.008s

2014-03-08 22:10:01-08:00
22:10:01 up 103 days, 9:57, 1 user, load average: 7.28, 6.72, 6.80

real 0m30.092s
user 0m0.016s
sys 0m0.000s

----------- snip ------------


2014-03-09 00:20:01-08:00
00:20:01 up 103 days, 12:07, 0 users, load average: 5.60, 5.04, 6.00 <== Low load average

real 0m30.065s <== Still 30 seconds
user 0m0.012s
sys 0m0.000s

2014-03-09 00:30:02-08:00
00:30:02 up 103 days, 12:17, 0 users, load average: 17.62, 11.72, 8.28 <== High load average

real 9m53.654s <== 9 minutes delay !!!
user 0m0.012s
sys 0m0.000s

2014-03-09 00:40:01-08:00
00:40:01 up 103 days, 12:27, 0 users, load average: 23.98, 19.53, 13.80

real 0m22.694s <== 22 seconds
user 0m0.016s
sys 0m0.000s

2014-03-09 00:50:01-08:00
00:50:01 up 103 days, 12:37, 0 users, load average: 29.83, 25.00, 19.14

real 0m0.166s
user 0m0.004s
sys 0m0.008s

2014-03-09 01:00:01-08:00
01:00:01 up 103 days, 12:47, 0 users, load average: 19.97, 22.17, 20.50 <== Still high

real 0m0.270s <=== But no more 30 seconds delay.
user 0m0.004s
sys 0m0.012s

2014-03-09 01:10:01-08:00
01:10:01 up 103 days, 12:57, 0 users, load average: 27.44, 24.16, 22.22

real 0m0.179s
user 0m0.016s
sys 0m0.000s


#118

What I see in your results is really not surprising. Keep in mind the load averages on the server may spike at specific times of the day… for example this sequence is just after midnite when yourself and other users may have cron tasks etc firing.

Also keep in mind that the apache issue being discussed in this thread has nothing to do with high server load averages, as proven by your test results here.


#119

As far as I ca see, SSH and cron jobs aren’t related.
As I said earlier, we have no problem with SSH, FTP, cronjob during the 30 seconds delay problem.

But 30 seconds delay problem MAY increase the server load (the main server) as many processes can stay alive and clog the system.

Since I started to monitor more closely, 2 episodes ended with higher server load and nearly 10 minutes delay (instead of 30 seconds) at the final moment.
In this case, FTP and other services might have been slow, too.

Anyway, there is a problem and either DH isn’t taking care of it or what DH is doing isn’t enough, so far.

And the crazy problem started, again.

2014-03-09 18:30:01-07:00
18:30:01 up 104 days, 5:17, 0 users, load average: 12.35, 11.13, 10.33

real 0m0.132s
user 0m0.012s
sys 0m0.000s

2014-03-09 18:40:01-07:00
18:40:01 up 104 days, 5:27, 0 users, load average: 15.26, 11.79, 10.76

real 0m30.174s
user 0m0.004s
sys 0m0.008s

2014-03-09 18:50:01-07:00
18:50:01 up 104 days, 5:37, 0 users, load average: 10.19, 12.22, 11.91

real 0m30.079s
user 0m0.004s
sys 0m0.012s


#120

Certainly one would think that dreamhost should have done something differently between the time this thread was started 12/18/2012 and now. In those past 15 months Dreamhost hasn’t solved the problem.

At this point if this problem is affecting your web site the only solution that seems to semi-permanently solve the problem (for you only) is to ask support for a different apache instance. Sadly, this isn’t even easy because it’s been my experience, and is illustrated in this thread, that front line support doesn’t understand the problem exists, how to test for it. etc.

It should also be noted tomtavoy posted a workaround (most recently in post 85 of this thread).

At least in Post 78 (12/10/2013) we get clear admission by dreamhost that it is something they should fix soon.

It’s sad that this issue has lived so long. It would be interesting to know how many customers never figured the issue out and simply packed up and went somewhere else. The threat or actual lose of customers has never seemed to phase dreamhost tho.


#121

I really not sure if that fixes (prevent) the problem from happening,
but I changed the timing of the crontab from every 10 minutes to 5 minutes.

So, I’ll see if it works and report back.

The 30 seconds delay is still there, this morning.
2014-03-10 05:40:01-07:00
05:40:01 up 104 days, 16:27, 3 users, load average: 12.08, 13.62, 13.63

real 0m30.146s
user 0m0.004s
sys 0m0.024s

2014-03-10 05:45:01-07:00 <== Now, every 5 minutes to see if the remedy works
05:45:01 up 104 days, 16:32, 3 users, load average: 11.40, 13.58, 13.71

real 0m30.111s
user 0m0.004s
sys 0m0.024s

So, I don’t think DH is monitoring the situation (the problem started early evening, yesterday).
If it resolved quicker, sometime, I guess someone has submitted a ticket.

From now on, I’ll submit a ticket as soon as I see the problem, every time.


#122

When I observe CGI process, it’s sleeping (IO or NET wait?)
Things stay this way till the end of the delay.

$ ps auxw
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
xxxxxxx  13782  0.0  0.0  68436  1576 ?        SN   07:07   0:00 sshd: xxxxxxx@pts/1
xxxxxxx  13784  0.0  0.0 124072  2040 pts/4    SNs  07:07   0:00 -bash
xxxxxxx  13822  0.0  0.0  17540  1388 ?        D    07:08   0:00 /bin/sh ttt.cgi  <== D
xxxxxxx  14063  0.0  0.0  17540   244 ?        R    07:09   0:00 /bin/sh ttt.cgi  <== R but cpu is 0%
xxxxxxx  14160  0.0  0.0 121104  1168 pts/4    RN+  07:09   0:00 ps auxw

====================
       D    Uninterruptible sleep (usually IO)
       R    Running or runnable (on run queue)
       S    Interruptible sleep (waiting for an event to complete)
       T    Stopped, either by a job control signal or because it is being
	    traced.
       W    paging (not valid since the 2.6.xx kernel)
       X    dead (should never be seen)
       Z    Defunct ("zombie") process, terminated but not reaped by its
	    parent.

NFS related? IPv6? Spawn configuration or throttling?

NFS causing 30sec delay (network issue?)
http://comments.gmane.org/gmane.org.user-groups.linux.tolug/60629

IPv6 causing 30 sec delay
https://bugzilla.novell.com/show_bug.cgi?id=304320#c5


#123

walterd, please email me (using the button below my post) with the domain you’re seeing this issue with.


#124

Hi Andrew.

I’ve sent e-mail.

Now, the delay is getting worse.

========================
2014-03-10 13:15:01-07:00
13:15:01 up 105 days, 2 min, 1 user, load average: 13.81, 14.68, 15.42

real 0m39.023s
user 0m0.004s
sys 0m0.008s

2014-03-10 13:20:01-07:00
13:20:01 up 105 days, 7 min, 1 user, load average: 13.97, 14.73, 15.26

real 0m36.334s
user 0m0.012s
sys 0m0.004s

2014-03-10 13:25:01-07:00
13:25:01 up 105 days, 12 min, 1 user, load average: 12.17, 14.42, 15.03

real 0m44.113s
user 0m0.004s
sys 0m0.008s

2014-03-10 13:30:01-07:00
13:30:01 up 105 days, 17 min, 1 user, load average: 14.25, 15.33, 15.37

real 0m47.170s
user 0m0.012s
sys 0m0.000s

2014-03-10 13:35:01-07:00
13:35:01 up 105 days, 22 min, 1 user, load average: 16.63, 17.07, 16.31

real 0m39.869s
user 0m0.008s
sys 0m0.012s


#125

We had a smooth running site for 25 hours (on march 8), after being 40 minutes down twice (restart?). It was fast, no 30s response delays. One moment we thought it was fixed…

After the 25 hours the problem came back and got worse and worse. Response delays of multiple 30s up to 400sec! when the site is up. We have regular downs of 5 to 40 minutes. Even Database errors on occasions.

http://i.imgur.com/6totajU.png
http://i.imgur.com/ZccbTmY.png
http://i.imgur.com/r0ZUPy5.jpg


#126

It finally came back to normal after more than a day of slowness.
(No response from the support, yet.)

The suggested remedy does not correct the problem once it happens, most probably.
Now, I’ll see if it works for prevention.

2014-03-11 02:30:01-07:00
02:30:01 up 105 days, 13:17, 2 users, load average: 10.69, 11.20, 11.09

real 0m30.089s <== 30 sec
user 0m0.016s
sys 0m0.004s

2014-03-11 02:35:01-07:00
02:35:01 up 105 days, 13:22, 2 users, load average: 9.84, 11.13, 11.19

real 1m8.337s <== More than 1 minutes. Back end getting clogged?
user 0m0.004s
sys 0m0.008s

2014-03-11 02:40:01-07:00
02:40:01 up 105 days, 13:27, 2 users, load average: 8.52, 11.06, 11.29
2014-03-11 02:45:01-07:00
02:45:01 up 105 days, 13:32, 2 users, load average: 6.30, 9.65, 10.77

real 8m32.720s <== 8 minutes!
user 0m0.016s
sys 0m0.000s

2014-03-11 02:50:01-07:00
02:50:01 up 105 days, 13:37, 2 users, load average: 12.99, 12.26, 11.58

real 6m32.208s
user 0m0.000s
sys 0m0.008s

2014-03-11 02:55:01-07:00
02:55:01 up 105 days, 13:42, 2 users, load average: 13.80, 13.07, 12.12
2014-03-11 03:00:01-07:00
03:00:01 up 105 days, 13:47, 2 users, load average: 8.05, 10.27, 11.20

real 14m6.240s <== 14 minutes !!! Current record.
user 0m0.012s
sys 0m0.000s

2014-03-11 03:05:01-07:00
03:05:01 up 105 days, 13:52, 2 users, load average: 19.34, 14.01, 12.34

real 11m15.976s
user 0m0.000s
sys 0m0.008s

real 7m38.568s
user 0m0.012s
sys 0m0.000s

real 4m37.201s
user 0m0.000s
sys 0m0.008s

2014-03-11 03:10:01-07:00
03:10:01 up 105 days, 13:57, 2 users, load average: 10.91, 12.52, 12.16

real 3m50.722s
user 0m0.008s
sys 0m0.008s

2014-03-11 03:15:01-07:00
03:15:01 up 105 days, 14:02, 2 users, load average: 8.61, 12.50, 12.49
2014-03-11 03:20:01-07:00
03:20:01 up 105 days, 14:07, 2 users, load average: 8.97, 10.98, 11.84

real 9m5.485s
user 0m0.004s
sys 0m0.004s

2014-03-11 03:25:01-07:00
03:25:01 up 105 days, 14:12, 2 users, load average: 11.26, 10.46, 11.30
2014-03-11 03:30:01-07:00
03:30:01 up 105 days, 14:17, 2 users, load average: 11.60, 10.47, 10.99

real 1m42.411s
user 0m0.008s
sys 0m0.008s

real 6m43.498s
user 0m0.000s
sys 0m0.020s

real 11m57.525s
user 0m0.004s
sys 0m0.016s

2014-03-11 03:35:01-07:00
03:35:01 up 105 days, 14:22, 2 users, load average: 5.89, 8.61, 10.15

real 0m0.072s <== Finally, back to normal
user 0m0.004s
sys 0m0.016s