Evaluation server questions

I’ve got a site here that’s been moved to an evaluation server because it’s using too many resources on it’s shared server. It’s there to be looked at and see what we can trim down to make it more neighborly if at all possible. On an evaluation server you get access to some resource logs that I’m having trouble reading, I’ve been through all kbase articles I can find and there’s really no mention of how to interpret these logs that I can find anywhere. Can anyone help me out here?

It generates 2 different files for review per day, one is raw data of the format:

user 0.04 cpu 1573k mem 0 io index.phpabout a bazillion lines of that^ - then there’s an overview/analyzed file that has column headers and looks a bit different:

Process CPU seconds user machine count average index.php 26477.5600 87.641% 110.323% 15485 1.710The overview/analyzed file^ there looks to take all files and tally up their numbers from the big raw data file.

It’s hard for me to fully understand these logs without any reference, I think I might have a real good idea, or at least an overview of the big things to attack, but without definitive reference I’m a little leery that I’m not knowing all there s to know.

There is a perl script that does the analysis of the huge data files (assumed generated by sa(8) - print system accounting statistics) but I’m having trouble running down anything that helps me (the completely ignorant nix admin) with the output.

Right now it looks as though the major problem is search engine spiders, they are accounting for anywhere from 100-400 times the traffic of normal users on the site at any given time. The largest offender is Yahoo Slurp and we are clamping down on that robot, msn as well has robots directives for us to use (crawl delay). Google however appears to have nothing in place that would allow us to restrict their crawling of out site in any way (aside from a complete ban), and they are amongst the very top offenders.

Anyone know of a way to talk to the googlebot like you can to MSNbot and Yahoo Slurp using crawl-delay? The only thing I’ve been able to determine for Google is that you’ll have to personally contact them and give them a crawl delay to manually feed to the bot…

All hep’s appreciated!


Is that the index.php of a WordPress blog? There are a few WP plugins known to cause horrible spikes like that.

I don’t work here. I’m just your typical support forum volunteer.

The raw data shows information for each execution of the named file. You are only really concerned with the number of cpu seconds (0.04) in this case. The analyzed file is a tally for each named file. Note that ‘index.php’ is every file called index.php running as your user so it may actually represent more than URL.

The ‘count’ is the number of times it was executed in the reporting period and the ‘average’ is the average cpu second usage per execution.

Does that help?

  • Dallas
  • DreamHost Head Honcho/Founder

Not a WordPress blog no.

ok, so the rest of the row is inconsequential then? That was mostly what I didn’t understand, thanks.

Kinda figured that, and it’s a mighty sucky deal if you’re trying to find a rogue script, any way at all to break it down by directory?

You bet, thanks.

If it’s the bots doing the damage things should clear up fairly quickly, if it’s not the entireity of it, is there anyway we can break down CPU time for the unique instances of index.php?


actually, I’ve thought of another clarification for you, little confused about the reports, that is, what report is for what day?

It would appear that “0” is always the current day (judging by the time stamp), is this correct?

edit gah! I’m being mailed the “analyzed” numbers daily after the reporting fires right? Well to compound my confusion over what’s what (or more accurately, “what’s when”), the “0” log that we’ve been assuming is the most current, looks much better today since clamping down on the robots. But the mail I received this morning does not match the numbers from the “0” file. In fact, it’s a duplicate of yesterday’s numbers.

So I guess some kind of explanation on the date thing would be good :slight_smile:


Yeah, most of that row is inconsequential for our needs now. That’s the output from the process accounting tools and it includes extra information.

There’s no way with our current tools (and the Linux implementation of process accounting) to break it down the index.php information by directory, I’m afraid. You may want to look at the Analog stats for more information about which parts of your website are being hit, but it still won’t provide the whole picture. We do have more information that includes timestamps for each file access so you may be able to compare that to your access logs but it would be a tedious process. Send me a private message if you think that would be of use.

  • Dallas
  • DreamHost Head Honcho/Founder

The .0 should always be ‘yesterday’, and .1 is the day before, .2 the day before that, etc.

I’m not sure what information is emailed so I’ll ask around about that.

  • Dallas
  • DreamHost Head Honcho/Founder

Thanks so much man. I’ll just sneak in with the lull in conversation and say what a class act you guys have been with this process. In fact you have always been pure class in every facet of my dealings with you since I’ve been here. Where other hosts have policy, you guys have people, and good people at that. DH policy seems more along the lines of “we reserve the right to…” but the people always come first with a fair shake, bad fu is only executed after it’s been determined that the user is not willing to address the issue in an honest manner. That’s just right.

I’ve seen it a few times over the years here in the support forums, people freaking out and talking about how DH wronged them. It’s processes like these that make it absolutely impossible to believe them on their word. In fact I’m immediately inclined to write such ranters off as fussy, delusional, panicky, short on reason, whatever. I’d recommend anyone new to Dreamhost do the same.

I’ve always been given above and beyond a fair shake here.



yeah right

case in point :smiley:

d’oh! That was the last of my troll food =[


well things are looking great, we’re getting it whittled down, I think you’ll let us stay! :slight_smile:

You’d better since you just made us DHSOTM! I’m sure the traffic from that newsletter plug will take down your network (edit: actually, it won’t since you left the S off and it’s directing people to a popup ad generation station). You knew it’d be Josh’s new wife making the fatal mistake of pimpin a site on a limbo server =D

congrats to Mr and Mrs Josh


well we received a mail that said we’d been moved back to the original spot again, aya! Thanks for the opportunity to take care of things.

We have whittled it quite a bit but with the resources you provided and the opportunity to be more diligent we’ve spotted some suff that we’ll be continuing to look into to trim things even more, which leads me to the question…

Do you guys have anything monitoring/deflecting santy worm (or other similars) traffic?

We seem to have a lot of that throwing itslef at the site and the requests are generating thousands of errors per day.

Any help thwarting this kind of crap would be very gratefully received, and a wonderful thing to have on record in the forum/wiki



We do have some rules in place that attempt to block the santy worm, I believe. They take the form of mod_rewrite rules that should be working on all traffic on all domains. If those aren’t working properly we may have to make some adjustments.

We also regularly add exploits we see into our mod_security rules, accessible as the ‘Extra Web Security’ setting for domains. Enabling that is a good way to protect yourself from the exploits we see occurring and the attacks are blocked before they hit your website so it doesn’t add to your cpu usage at all.

  • Dallas
  • DreamHost Head Honcho/Founder

We’re running the mod_security and have been all along as far as I know.

We’re continuing to collect data, looking for patterns. If something pops up that we think should be passed on we definitely will!

Thanks Dallas