Httpd logs

Couldn’t find a thread through extensive searching that answered these questions, and unfortunately I’m a little new to the concept I want to pursue. What I’d like to do is take the server generated log files for all of my sub-domains, combine them as one larger log, then load those into a DB, or even a server side MySQL db. I’m trying to experiment a bit with data mining using Transact SQL (don’t ask, but it’s what the place I work at uses), and I’ve got a local, test instance of SQL Server I can load the info into. Since what I would be doing for the job I do would be quering and analysis of time-indexed data, I figure learning off of time-indexed log files of some sort would be ideal.

Now I know where the files are located, $HOME/logs/, $HOME/logs/, etc… what I need to do is append them together. Unfortunately I don’t know too much concerning shell scripting.

Basic concept is a shell script that goes something like this:
(please note this is not written in an actual language, but just summarizes my idea. I’d have to match it of course to a usable language)

dim $sites as array
$sites() = (“”, “”, “”, “”);
log.copy("/home/grymwulf/dailylog.blank", “/home/grymwulf/dailylog.log”);
dim log as FILEHANDLER();
foreach $sites() as $site do {
log.append("/home/grymwulf/logs/" & $site & “/httpd/logfile.” & $TEXT(($DATE() - $DAYS(1)), “yyyymmdd”), “/home/grymwulf/dailylog.log”); }

And then just build another script that emails the daily log to an email address, and deletes the daily log.

I read your post several times, and still can’t find any “questions”. Did you have a question, or were you just sharing your thoughts? :wink:


While it might be rejected on Jeopardy for not being in the form of a question, there is indeed a question here IMO. Looks like a request for sugestions on importing raw log files into an RDBMS, that will serve as a data warehouse for log aggregation and reporting.

First just because I like to be aware of stuff like this, Transact SQL is the proper name for Microsoft’s flavor of programmatic data scripting. Oracle’s version is called PL/SQL, but I am not sure what it’s named when talking about MySQL’s version.

Data loads of the raw log files can be accomplished with manual scripting, but there are pretty sophisticated utilities provided for this sort of loading with SQL Server since is such a common task. For MySQL I am sure there is a more appropriate method, but I know you can load the raw log into an Open Office spreadsheet and get it into a holding database, which you can use to generate MySQL import scripts.

You can get a description of the columns contained in the access log by checking out the wiki article on setting up Awstats. You would just need a table to hold the various details in their own columns, as well as a PK and a FK to a table describing your websites.

LogFormat="%host %other %other %time1 %methodurl %code %bytesd %refererquot %uaquot" - - [21/Nov/2006:00:52:33 -0800] “GET /robots.txt HTTP/1.1” 301 247 “-” “Exabot/3.0”

There might be some other questions in your post, but like rlparker I’m having difficulty identifying them. Maybe I’ll dork out and work on something that might help you. Keep it up though, this kind of exercise is good for keeping yourself useful in industry.

There is no real reason to concatenate your log files first, just load them sequentially. This will make it easier to keep the data seperated in case of some mistakes. Also, you’re not going to be able to delete the daily log files with any scripts, they’re owned by root…at least on my machine they are.

I guess my question kind of got mis-routed when I threw together my little programing train of thought. Just when something happens that way I try to capture it. Basically, what I had been asking was a way/package/script that would or could automate loading the httpd server logs into a mysql database. I can easily use the mysql ODBC driver to import to access, then push it to sql server, or just use a mysql export to upload to sql server.

And yes, transact sql is M$'s version of sql, based off of the ANSI-SQL standard of 1992. It’s also used by Sybase, and the reason M$ uses it is that originally SQL Server was a rebranding of Sybase technology. I’ve actually read up on it a bit to understand the why & how behind it, as I’ve always believed that gives me a better handle on something. Plus it impresses the PHBs to no end when you can spout stuff like that :slight_smile:

So basicly, the question. Does anyone know of a common script that can be run off of cron that will automate loading the server logs into mysql? Perl, Python, php, C; I don’t care the language, just that it will do it.

Hey, it’s easy to get caught up in the thought process and forget to ask for what you need! :wink: At least Ben (pangea33) made a stab at it and provided some useful information.

While I don’t know of a script that I can guarantee will do exactly what you want “out of the box”, I have played with a couple that should need very little tweaking to allow you to suck up Dreamhost style Apache “Extended Common Log Format” (ECLF) files into a MySQL database:

  1. Here is a skeletal PHP application for Parsing Apache Log file, and importing to MySQL that may need very little tweaking.

2)apacheLogSplit is a “more modern” version, that may need a little more tuning for the DH log format, but might be easier to work with.

Sourceforge also has a plethora of apache log related data mining tools, but that would defeat the point of the exercise, as I believe one of your stated goals is to manipulate the sql as a learning exercise in addition to just consuming the “mined” output. :wink:

While you will still need to “gather” the individual files (as has been suggested, I wouldn’t concatenate them if they are headed to a database anyway - why load the server or risk the wrath of the prockiller), hopefully one of these tools will get you started. Good Luck!


Thanks for the links, I’ll take a look in that direction. :slight_smile:

Anyone know a way to pick up on what modules are enabled for apache? I’m assuming that dh uses apache2, and taking a look over at shows an interesting module: mod_log_mysql. It seems to fit the bill for what I want to do.

Logs are rotated daily… so you need to grab the data daily… and i think that get not to much cpu…

Fast I think…
find ./ -name “access.log” -exec cat {} ; >> big.log
(note the >> as to append instead of replace)
find /home/youruser -name “access.log” -exec cat {} ; >> big.log

and in the crontab
0 1 * * * sh /home/youruser/

Maybe you need to chmod u+x

And maybe you would need some day to delete the biglog file…
as is debian
on crontab
@montly sh /home/youruser/
rm /home/your/user/big.log

more fancy

cat /dev/null > /home/your/user/big.log

As to send via e-mail… well I would prefer to use scp ftp or any other thing…

In any case if all your files are on your user all your logs are anyway in the logs directory it does more easy the things… If you have diferent users well it will need a little tweaking.

This previous post has information about Apache modules in use at Dreamhost, as well as an explanation as to how I determined which modules are present.

Unfortunately neither mod_log_mysql nor mod_log_sql is available in either version of Apache on Dreamhost :frowning: . It’s a shame, as it would make what you are trying to do a lot easier.


The links from this wiki phpinfo article give some of the info, in the SERVER_SOFTWARE section; it’s less complete than your list, but maybe includes the “big” items?

They hired more support help.
Tip me (and DreamHost gets 5%+$0.30) Cut the code: [color=#00CC00]9999[/color]=$99.99 Off

Yep, that is a start and is good for showing the PHP modules installed but (with the exception of the single entry that runs under mod_php) only shows the “barest” of info concerning the Apache modules installed.

I suppose I should probably put the more “complete” Apache module information in the wiki - any suggestions as to where I should put it? I don’t think it really belongs with the link you referenced, as it is more about the Apache config than the PHP config. It’s also likely to change, and if it does and we lose the ability to run /force mod_php usage, I don’t know how to tell what modules are installed - I can’t find (by design, I’m sure!) the appropriate httpd.conf file on DH to read the info.