I'm writing a script

design

#1

Anyone want to help solve an argument?

If anyone has ever heard of the ROBOT 9000 script, it is an IRC robot written in perl to prevent chatters from saying something that has already been said.
A full explanation can be found here:http://blag.xkcd.com/2008/01/14/robot9000-and-xkcd-signal-attacking-noise-in-chat/

Anyway, I’m adapting this (read: re-writing) this in PHP for phpBB3. I’m in a bit of an argument. A PHP friend of mine thinks the best way to go would be to searcht he MySQL database every time a post is made for a similar post. I think it’d be all around better just to write the post to a flat file, without user data or anything like that, for reasons that it’d be easier to edit in the future if I want to and that it doesn’t put too much pressure on the SQL. After that, I want it to treat it like a flood limit.

So, SQL search or flat file write?


#2

I don’t think I quite understand what your usage scenario is. You want to prevent identical posts from being made to your phpBB forum in the same way the xkcd-signal bot prevents identical lines from being posted to an IRC channel? What sort of editing do you suspect you might want to do later?

If I was responsible for doing this, I would definitely do it with SQL. For one thing, posts are necessarily being written to the database anyway, so kludging together a flat file would require you to save every post twice. Secondly, looking up data is precisely what a database is designed to do and a database is likely to do a better at indexing that data for later lookup than you’re likely to be able to do on your own. As an added advantage, using SQL means that you don’t have to deal with situations where users edit or delete their posts; unlike IRC, a forum is not always a “post it and move on” medium. Finally, I don’t know if your goal is to make a phpbb MOD out of this for release to the phpbb community, but if that is your intent, a SQL based solution will probably be far more portable and prevent you from getting emails complaining about installation problems and screwy edge case bugs.

As a final thought, and I’ll say this for whichever solution you choose, I would suggest that phpbb-signal work on the basis of hashes of posts rather than on posts themselves. Over the long term, I suspect these will be a lot easier to manage. In a database, it’s a lot easier to index a hash than it is a text blob, which means added lookup speed and better performance. In a flat file, it means you don’t need to find some way to delimit posts from one another and a vastly simplified lookup algorithm, which will make your life a lot less messy.


#3

Of course all of this may be moot. According to the DreamHost Terms of Service, you are not allowed to run IRC robots on a shared account. Do you have DreamHost PS?

si-blog
Max discount on any plan with promocode SCJESSEYTOTAL


#4

OP said he is implementing this for a PHPBB3 forum - not IRC. Regardless having the script check all previous posts will require additional resources and very well have to move to DreamHost PS.

:cool: openvein.org -//- One-time [color=#6600CC]$50.00 discount[/color] on [color=#0000CC]DreamHost[/color] plans: Use ATROPOS7


#5

I agree and have a few questions for the OP:

  1. Are you trying to prevent identical posts from going up or are you looking for similar posts?
  2. Are is there some timeframe where you’re looking to prevent identical posts? For example, are you trying to prevent the same post from going up multiple times in a row and only really care about the last minute or so? Or are you going to look in the entire post history?

If you’re just trying to prevent floods of identical posts, the filesystem solution sounds like it would be more efficient. Doing a giant text search in the database across all history will be very slow because the post text can’t be indexed, I believe (too long of a field).

Use the [color=#CC0000]3DOM50[/color] promo code for 3 extra lifetime domains and $50 off
More Dreamhost coupons here!


#6

It’s based on an IRC robot, it isn’t itself an IRC robot. I don’t host IRC.


#7

It’s actually meant to keep the community original. It’s going to permanently make sure that if you post something already posted, you will not be able to post for about 5 minutes. It essentially means you have to try really hard to make sure that you’re post is not only worthwhile but original.

I’m trying to prevent identical posts, in other words.

also, it’s not flood protection, that’s built into phpBB3. I’m trying to look at the entire post history, from all boards.

And thank you for your input, that was my conclusion as well. I did get a little bit of disagreement from several parties, which is why I want it discussed before I actually finish off function_posting.php.


#8

Of course, flat files aren’t exactly indexed either :wink:

But that’s where my recommendation of hashes, for either solution, comes in. You basically turn an arbitrary length post into a 40 character hex string, which is small enough that it can be easily indexed, and that’s what you store. Using a cryptograpic hash like SHA1 or MD5, both of which are built into PHP by default, will be unique enough for all practical purposes, as the probability of ever seeing a false duplicate would be astronomical.


#9

I agree, alpicola.

If it was some sort of flood prevention for, like, an N minute history one could easily store files in a directory, deleting those older than N minutes. The filenames could be some hash of the contents, etc.

I suppose you could do something similar in the database…

Use the [color=#CC0000]3DOM50[/color] promo code for 3 extra lifetime domains and $50 off
More Dreamhost coupons here!