Leech / spider protection (log monitoring?)

I just had the same IP address leech / spider my entire site two times in a row (1.4 GB * 2), one right after the other. I have had a few other people leech off of my site on the order of a few hundred megabytes, but almost 3.0 GB of bandwidth by a single IP over 1-2 days has been the worst case yet and has slightly irked me. The user-agent was set to look like plain old MSIE, so blocking by user-agent wouldn’t have helped.

Does anyone know of some simple software (shell script cron job?) that will monitor http logs for certain conditions ("> X requests per minute", “> Y megabytes downloaded per hour”) and block offending IP addresses (e.g. using htaccess)?

Most leech protection I’ve been able to find (e.g. on the DH wiki) has been about hotlinking, which I don’t really mind because it’s easy enough to monitor. The problem of watching individual IPs who leech is slightly more advanced, just wondering if anybody knew of anything offhand…

This reads like you had a web browsing person who came to your site, cruised through the links on your site, and used them to download things you’ve got accessible from your publicly viewable web pages.

I’m assuming that you’ve got public links available on your site, otherwise the referrer wouldn’t have been your host and you could eliminate it by disabling hotlinking. I don’t know how else someone could get the links unless you put them out there. This means it’s still highly vulnerable to download agents, which are easily found for all major browsers. You could require some degree of authentication which would identify the culprit, no matter how simple it at least makes them login.

Just my opinion here, but you absolutely don’t want to consider something that will analyze logs and modify the htaccess file automatically. Those files are sketch enough as it is, and even the most talented developers occasionally screw them up when attempting legitimate modifications.

I recommend that you just get real vigilant in keeping an eye on your site’s access by monitoring the DH supplied traffic stats frequently. If the abuse is limited to a few individuals, you can find their IPs and ban them one at a time. Even though daily logs are disabled by default, you can reenable them manually.

If this is happening so frequently that you can’t keep the problem in check by taking a quick look at your stats once a day, quit hosting all that pr0n! (J/K)

Due to the way my site is set up there is a lot of content to spider, it is very easy to reach the content by using a spider (i.e. 4-5 hops from the index to reach thousands of pages on the site), and people have an incentive to download the content en masse. One of the main features of my site is a Chinese character dictionary, which provides pronunciation / images / other data for Chinese characters (and there are thousands of Chinese characters!).

Every month there are a few people who try to leech a substantial portion of content from the site, this guy (3.0GB in 1-2 days) is just one extreme example. I’m fine with people using any of the content on the site, but it’d be nice if I could spot people who are going about it the wrong way.

I’m not real adamant about checking logs (once a week?) so by the time I see someone leech the site it’s been a couple of days so banning their IP won’t do much good… Maybe I’ll just write a script to monitor the logs for IPs with lots of requests and send me an email when it sees a potential mass downloader.

Thanks for the ideas :slight_smile:

Conceptually, you could require logins and give visitors “karma” ratings based on how they participate: + points for uploads or postings in a forum, or whatever you consider “good”, and - points for downloads, etc. If a visitor’s karma decreases, slowly increase a delay time for each request… I don’t know of an easy, free script for download.

I have to say I’m a little disappointed, though, with the sentiment of your post and concern. A few thoughts:

I agree 3 GB in 1-2 days is a lot. OTOH, even if that happened every day, you wouldn’t come close to using your bandwidth quota.

It is not clear at a glance what parts are mirrored. That history and sentiment might seem to encourage additional mirroring, which requires downloading. While some crawlers use delays, an individual’s efficiency is best served by quick downloading, if the server can handle it.

Some people who see useful things like to download them for using/reading off-line. Some people who see things that disappeared previously like to download a copy just in case they disappear again.

One person’s “leeching” is another person’s productive use of a resource.

If you can go days or weeks without noticing, until you review logs, does it matter?

I’d rather have somebody downloading information to make good use of it than somebody “hammering” (requesting) again and again and again and again on formmail or guestbook scripts that transfer trivial KB’s, and no longer really function. :slight_smile:


They hired more support help.
Tip me (and DreamHost gets 5%+$0.30) Cut the code: [color=#00CC00]9999[/color]=$99.99 Off

A karma system might work, but I do not want to require any kind of registration for people to use the site; the site is similar to a dictionary, not a forum or message board.

I know DH has a huge bandwidth quota, all I want is a way to take some action (send me an email, block the IP, etc.) when someone is clearly abusing the site. In this case the guy most certainly was abusing the site.

I have my email address on just about every major page so that people can email me with requests and feedback. I have had people email me asking if they could mirror or use parts of the site offline, I was glad to provide them with a tgz of the content they needed. I just don’t want people to abuse the site.