My issue has been resolved. There was an issue that was found between the network attached storage (hard disks) and my server. Unfortunately support decided the scope of the problem was not large enough to post it on the status website - which I disagree with. It's a tough job they have balancing our need to know (and the value of our time diagnosing problems on our sites) with the alternative which would be to post every outage on the status site - the equivalent of yelling 'fire' in the crowded theater.
My suggestion was that if they do not wish to generate undue support calls/emails/bad press/etc., that they could post smaller scale outages on our panel when we login to our account.
I think your ideas are great. Ironically 'information technology' groups are often the last ones to automate themselves. They should find outages, rather than us. It is a big job, but there is probably some 80/20 rule where a simple system to detect CPU starvation, network issues, and some simple php/mysql queries could be implemented and find most things. As for a way to automate it, there are several UNIX command line programs that can be employed to hit some web pages and do automated testing. Just off the top of my head, there is curl, lynx, wget, ping, etc. These could be batched up in a script and run as a cron'd job every x minutes. Ect, etc., etc. Having said all that, I'm sure the people running the server farm are much more capable than we (or at least I) at coming up with a good scalable solution. I think our bit is to shine a light on the issue we wish to see resolved. Let them come up with a solution that can solve the problem to our satisfaction, and more importantly that they can live with and be happy supporting, etc.
Thanks for your feedback. Hopefully they are listening and watching the forums. :-0