Dealing with Web Spam


#1

I am getting a lot of bots etc. going after my site. According to awstats, by bandwidth was around 188 MB in January, in March it jumped to 3GB. I found a lot and I mean a lot of unresolved ip addresses. Plus bots that are using my bandwidth like there is no tomorrow.
I thought I would be smart and block the unresolved ip’s, come to find out one of the ip’s was mine, the bandwidth did not match my ip address that did resolve.
I want to know what to do before I end up with everybody’s ip address in my block list in the htaccess file.
BTW some of the bots even though have been blocked are still gaining access including psbot which grabbed 41 mb of bandwidth so far this month.
I am getting concerned this jump in bandwidth will effect my account.
Silk

My website


#2

3 GB for a month? I can understand your concern, but do you really think that using %0.15 of what’s alloted to you is such a big deal?

At any rate, consider using robots.txt for this; well-behaved bots will follow it. Those that don’t you can block, but at least base it on user agent, rather than blocking every IP you see.

emufarmers.com
Very little to do with either emus or farmers!


#3

I can only suggest one method of what I do, and it involves downloading and maintaining a browser capability file, making sure your PHP installation can update its .ini to use it, and some PHP programming.

First, get Gary Keith’s PHP browser capability file “php_browscap.ini” from http://browsers.garykeith.com/downloads.asp and put it somewhere accessible to your web app.
Second, modify your php.ini file’s “browscap” entry to use the file for its browscap.
Third, include a routine at the beginning of your pages that will incorporate PHP’s built-in get_browser function to fetch the browser info each time a new session is started. The array it returns has a key for “isbanned”, but you can filter on a number of different flags. If it’s a banned user agent, simply send the session to a low-bandwidth “sorry, you’re banninated” page.

Pros: Once you’ve downloaded the browscap file you can set additional banned flags for agents you don’t want visiting.
Cons: Ya need to keep the file updated as new browsers and revisions thereof are added.

After that, it’s down to banning by IP address, which is a whole 'nother can of worms.


#4

I was mostly concerned with the large jump. I have read several threads of web sites being disabled because of large spikes of usage. As cpu and memory are more of the concern than bandwidth.
Even though I have a robots.txt file I use that mainly to guide legit bots away from directories that don’t need to be in the search.
In my htaccess file I have ip blocking, user-agent blocking rewrite rule and redirects. I have not read anywheres that the htaccess files can not handle all three at the same time or if they have to be in a specific order.
Now I am wondering because the psbot got through. Either psbot is using a different user-agent name or the rewrite rule can’t be applied last.
My htaccess file is in the following order :

Order allow,deny
Deny from ip address
Allow from all
Redirect Permanent webaddress webaddress
AddHandler phpFive .php
Action phpFive /cgi-bin/php.cgi
Rewrite Engine on
RewriteCond %{HTTP_USER_AGENT} username [OR]
RewriteRule ^.* - [F]

Silk

My website