Blocking bad robots


#1

I would like to know what measures fellow Dreamhosters take to block bad robots (those that ignore robots.txt) and site downloaders from their sites, and if the measures are effective.

  • marsbar

#2

I don’t do any blocking myself (I’m nowhere near the bandwidth limit), but…

How to block spambots, ban spybots, and tell unwanted robots to go to hell


#3

Thanks for responding, Bob. Thanks for the link to the helpful guide, Mark.
Will the ever increasing size of my .htaccess file (currently 8KB) seriously impact on the speed at which my site loads, and will it severely tax the server itself? If so, are there ways to ‘condense’ the file, or are there better ways to block bad bots and downloaders?

  • marsbar

#4

There were some people in the comments of that article and the webmasterworld forums who were asking the same question about the load. That’s not my area of expertise, though…

The real question is, what type of robots specifically are you concerned about with your site? (Your concerns most likely don’t coincide exactly with the person who wrote that article.) Are you trying to get rid of email address harvesters? Is it moral disagreements with the “trademark, copyright, and plagerism policebots”? Are certain robots are hitting your site with too many requests? Entering private areas of the site? “Borrowing” all of your content without permission? Some of these problems might be easier to deal with than others. Some may be impossible to prevent.

I would look at the specific things you’re concerned about, try to block those, and don’t worry about the rest unless it becomes a problem. (For example, it’s been a while since that article was written–it’s possible that some of the “misbehaved” robots might have been fixed since then, or that the static IP ranges given might be out of date.)


#5

Basically, you create a .htaccess in your site root and can block by user agent and/or IP. Take a look at http://torbie.com/htaccess.txt (you can copy this and rename to .htaccess).

From the web logs, I block specific user agents that ignore robots.txt. Also, I block IPs that constantly fish and try to hit various cgi scripts. This has worked very well for me.

I got the orignial idea from http://www.clockwatchers.com/robots_bad.html


#6

I use htaccess too, like this:

SetEnvIfNoCase User-Agent “^[Ww]eb[Bb]andit” bad_bot
SetEnvIfNoCase User-Agent “^WWW-Collector-E” bad_bot

Order Allow,Deny Allow from all Deny from env=bad_bot

And it seems to work great. Not sure how much bandwidth it actually rescues from those “bad bots,” but I don’t like 'em rooting around anyway. I steer clear of banning by IP address, since an evil bot may not always be the one using a particular addy. It wouldn’t be polite to ban a legitimate visitor, would it? :wink:

~Chell


#7

I only ban by IP when it is consistent and frequent abuse (they are no longer a legitimate visitor :slight_smile: It doesn’t happen often (the list I have is after several years of monitoring).