For a while now we have been having ongoing issues (several months) with the load on our dedicated server. The problem is very random and is always an apache or php process that starts to consume 100% of the CPU. Websites become less and less responsive when this starts to happen. Eventually, they do not respond at all. Sometimes we can wait and the process will release and the load will go back down on its own. Other times we can kill all the apache and php processes, restart apache, and everything is ok. We even have to reboot the server occasionally to fix the issue if a process cannot be killed. The server this is happening on has approximately 140 websites hosted on it, all Joomla! based on various versions.
This is causing a lot of service disruptions to our customers and their websites and we would like to find out what we can do to fix it long term.
We have tried contacting live support, but they simply kill the process, or reboot the server, and say it is fixed. That is not really a good long-term solution for us. It is exactly what we are doing, and only temporary at best. Because the problem is random, and we want to return service to our customers as quickly as possible, we cannot put in a ticket. Response is not fast enough and the issue gets resolved before a higher level admin has time to investigate. Last contact with support suggested a cron job to restart apache every hour. That is too disruptive. They also suggested increasing memory allocation to Apache, but refused to suggest which options to change. I just don't see how this could be a memory issue at this point. Swap is never even touched on that server.
We are running Debian Squeeze (6.0.7) on this server, so not the latest available, but still fairly recent. Although, apache seems to be a custom compiled version that DH packages on their servers. I know making config changes can be hairy at best, and overwritten by the panel. So I'm decidedly cautious about this route.
Is there anything we can do? Has anyone seen this before? Any suggestions on where we might be having problems?
Thank you for any help,