Generate Apache-Style HTTP Access Logs via SQL and PHP

Does your web hosting company block access to access.log, the text file containing raw website log files? If so, you’re like me, and it sucks. There’s a plethora of gorgeous and extremely insightful website traffic analyzers, but all of them require access to raw HTTP access logs. Today I propose a semi-efficient way to generate such logs utilizing PHP to determine page load data (time, user IP, requested page, referring page, user client, etc) and SQL to save such data for easy retrieval later. Note that this method is a HUGE improvement of my previous project which used PHP scripts to store HTTP access logs as flat files. Although it worked in theory, in all practicality the process of opening, writing to, and closing a text file (which grew a few MB a week) was too cumbersome for my server to comfortable handle. The method described on this page utilizes SQL, a database engine well-suited to meet these exact demands. When we’re done, you’ll be able to use a web interface to view your access log (pictured, converting long, complicated search queries to web search and image search strings automatically), or have the option to export it directly to an access.log text file in a standard Apache-style format.
sql_php_http_log_viewer

First, make sure your database is structured appropriately. This page is written for those with a working knowledge of PHP and SQL, but if you’re new to the field I encourage you to learn! W3Schools.com is an awesome resource to rapidly learn new languages. Also, when starting-out with SQL (like me), phpMyAdmin is a awesome. The code, as it’s currently written (below) is designed to store data in the “nibjb” database under the “logs” table. Briefly, it uses PHP to determine user data (time, ip, requested page, etc.) and injects this information into the SQL database. In fact, it’s doing it to you right now! Don’t believe me? View the source of this web page and scroll to the bottom. BAM! There you are.

// logme.php
<?php

if ( !isset($wp_did_header) ) {
	$wp_did_header = true;
	require_once( '/home/content/n/i/b/nibjb/html/blog/wp-load.php' );
	//wp();
	//require_once( '/home/content/n/i/b/nibjb/html/blog/wp-includes/template-loader.php' );
}

function logwriter_handlevar($varname,$defaultvalue){
    $tempvar = getenv($varname);
    if(!empty($tempvar)) {
        return $tempvar;
    } else {
        return $defaultvalue;
    }
}

if (!empty($REMOTE_HOST)) {
$logwriter_remote_vistor = $REMOTE_HOST;
}else{
$logwriter_remote_vistor = logwriter_handlevar("REMOTE_ADDR","-");
}

$logwriter_remote_ident = logwriter_handlevar("REMOTE_IDENT","-");
$logwriter_remote_user = logwriter_handlevar("REMOTE_USER","-");
$logwriter_date = date("d/M/Y:H:i:s");
$logwriter_request_method = logwriter_handlevar("REQUEST_METHOD","GET");
$logwriter_request_uri = logwriter_handlevar("REQUEST_URI","");
$logwriter_server_protocol = logwriter_handlevar("SERVER_PROTOCOL","HTTP/1.1");
$logwriter_http_referer = logwriter_handlevar("HTTP_REFERER","-");
$logwriter_http_user_agent = logwriter_handlevar("HTTP_USER_AGENT","");
$logwriter_logstring = "$logwriter_remote_vistor $logwriter_remote_ident $logwriter_remote_user [$logwriter_date $logwriter_timezone] "$logwriter_request_method $logwriter_request_uri $logwriter_server_protocol" 200 - "$logwriter_http_referer" "$logwriter_http_user_agent"n";
?>

<?php
$username="YOUR_USERNAME";
$password="YOUR_PASSWORD";
$database="nibjb";
mysql_connect('mysql157.secureserver.net',$username,$password);
//mysql_connect(localhost,$username,$password);

$query = "INSERT INTO logs VALUES ('','$logwriter_date','$logwriter_remote_vistor','$logwriter_request_method','$logwriter_request_uri','$logwriter_server_protocol','$logwriter_http_referer','$logwriter_http_user_agent')";
mysql_query($query);
mysql_close();
?>

<!--
LOG DETAILS:
time: <?php echo($logwriter_date); ?>
vistor: <?php echo($logwriter_remote_vistor); ?>
method: <?php echo($logwriter_request_method); ?>
request: <?php echo($logwriter_request_uri); ?>
protocol: <?php echo($logwriter_server_protocol); ?>
referrer: <?php echo($logwriter_http_referer); ?>
agent: <?php echo($logwriter_http_user_agent); ?>
HTML LOG LINE:
<?php echo($logwriter_logstring); ?>
 -->

All right, that was easy. Every time we load logme.php, it adds the data to the SQL database. To add data every time you go to a particular web page, you could use a PHP include() statement in each webpage, or you could take advantage of the PHP’s auto_append_file feature! Simply insert the following line into your php.ini file if you have access to yours:

auto_append_file = "/path/to/html/logme.php"

How do we access this data once it’s been loaded into the database? There are many different ways, but I’ve chosen to get a little creative with a sleek, yet minimalistic web-based fronted. It basically just shows the last [x] number of entries in the access log. You can adjust the number of entries displayed by slapping on some arguments to the URL, transforming viewLast.php into viewLast.php?limit=123 or something (see the screenshot above). I won’t discuss the details of this script. It’s self-explanatory.

// viewLast.php
<html>
<head>
<style type="text/css">
td {
font-family: verdana, arial;
font-size:10px;
}
</style>
</head>
<body>
<?php

$limit = (int)$_GET['limit'];
if ($limit===0) {$limit=25;}

$username="YOUR_USERNAME";
$password="YOUR_PASSWORD";
$database="nibjb";
mysql_connect('mysql157.secureserver.net',$username,$password);
mysql_select_db($database) or die( "Unable to select database");
$query="
SELECT * FROM logs WHERE
request NOT LIKE "%testlog.php%"
AND request NOT LIKE  "%/logs/%"
AND request NOT LIKE "%/wp-admin/%"
ORDER BY ID DESC LIMIT 0,$limit
";
//$query="SELECT * FROM logs WHERE referrer LIKE "%&q=%" or referrer LIKE "%&prev=%" ";
$result=mysql_query($query);
$num=mysql_numrows($result);
mysql_close();
?>

<b><?php echo($query); ?></b>
<table border="1">
<tr>
<td>id</td>
<td>time</td>
<td>visitor</td>
<td>request</td>
<td>referrer</td>
</tr>

<?php
$i=1;
while ($i<$num) {
$id=mysql_result($result,$i,"id");
$time=mysql_result($result,$i,"time");
$visitor=mysql_result($result,$i,"visitor");
$method=mysql_result($result,$i,"method");
$request=mysql_result($result,$i,"request");
$protocol=mysql_result($result,$i,"protocol");
$referrer=mysql_result($result,$i,"referrer");
$referrer2=str_replace("&", "& ", $referrer);
$agent=mysql_result($result,$i,"agent");
$searchWords="";
$searchEngine="";
if (strpos($referrer, "q=")>0 and strpos($referrer, "google")>0) {$searchEngine="Google Web Search: ";}
if (strpos($referrer, "prev=/images")>0 and strpos($referrer, "google")>0) {$searchEngine="Google Image Search: ";}

// SEARCH EXTRACTION //
$j=0;
$rTemp=str_replace("prev=/images%3Fq%3D", "q=", $referrer);
$rTemp=str_replace("?q=","&q=", $rTemp);
$rTemp=str_replace("%2B"," ", $rTemp);
$rTemp=str_replace("%26"," ", $rTemp);
$rTemp=str_replace("%3D"," ", $rTemp);
$rTemp=str_replace("+"," ", $rTemp);
$wvars=split("&",$rTemp);
while ($j<count($wvars)){
	if (substr($wvars[$j],0,2) === "q=") {
		$searchWords = $searchWords . $wvars[$j] . " ";
		}
	$j++;
}

$searchWords=substr($searchWords,strpos($searchWords, "q=")+2);
if (strlen($searchWords)<3) {$searchWords=$referrer;}
////////////////////////

echo "
<tr>
<td>$id</td>
<td>$time</td>
<td>$visitor</td>
<td><a href='$request'>$request</a></td>
<td>$searchEngine <a href='$referrer'>$searchWords</a></td>
</td>
";
$i++;
}
?>
</table>
</body>
</html>

And you’re done! This example is a simplified, bare bones example. You can take this a long way if you’d like. My goal is lite & flexible. A quick query from Python and Matplotlib (for example) yields gorgeous visual representations of otherwise-convoluted data!

If you have any questions, or end-up developing something awesome with this code, shoot me an email! It’s not luxurious, but this code works for me, and I share it with the best of intentions.


     

Removing Textile From WordPress

I realized that the C code from yesterday wasn’t showing-up properly because of textile, a rapid, inline, tag-based formatting system. The app converted blog code from [“text”:http://www.SWHarden.com/ *like* _this_] to [text like this. ] While it’s fun and convenient to use, it’s not always practical. The problem I was having was that in C code, variable names (such as _delay_) were becoming irrevocably italicized, and nothing I did could prevent textile from ignoring code while styling text. The kicker is that I couldn’t disable it easily, because I’ve been writing in this style for over four years! I decided that the time was now to put my mad Python skills to the test and write code to handle the conversion from textile-format to raw HTML.
I accomplished this feat in a number of steps. Yeah, I could have done hours of research to find a “faster way”, but it simply wouldn’t have been as creative. In a nutshell, I backed-up the SQL database using PHPMyAdmin to a single “x.sql” file. I then wrote a pythons script to parse this [massive] file and output “o.sql”, the same data but with all of the textile tags I commonly used replaced by their HTML equivalent. It’s not 100% perfect, but it’s 99.999% perfect. I’ll accept that. The output? You’re viewing it! Here’s the code I used to do it:

## This Python (1.0) script removes *SOME* textile formatting from WordPress
## backups in plain text SQL format (dumped from PHP MyAdmin). Specifically,
## it corrects bold and itallic fonts and corrects links. It should be easy
## to expand if you need to do something else with it.
## Enjoy! --Scott Harden (www.SWHarden.com)

infile = 'x.sql' # < < THIS IS THE INPUT FILE NAME!

replacements=   ["r"," "],["n"," n "],["*:","* :"],["_:","_ :"],
                ["n","<br>n"],[">*","> *"],["*< ","* <"],
                [">_","> _"],["_< ","_ <"],
                [" *"," <b>"],["* "," "],[" _"," <i>"],["_ ","</i> "]
                #These are the easy replacements

def fixLinks(line):
    ## replace ["links":URL] with [<a href="URL">links</a>]. ##
    words = line.split(" ")
    for i in range(len(words)):
        word = words[i]
        if '":' in word:
            upto=1
            while (word.count('"')&lt;2):
                word = words[i-upto]+" "+word
                upto+=1
            word_orig = word
            extra=""
            word = word.split('":')
            word[0]=word[0][1:]
            for char in ".),'":
                if word[1][-1]==char: extra=char
            if len(extra)>0: word[1]=word[1][:-1]
            word_new='<a href="%s">%s</a>'%(word[1],word[0])+extra
            line=line.replace(word_orig,word_new)
    return line

def stripTextile(orig):
    ## Handle the replacements and link fixing for each line. ##
    if not orig.count("', '") == 13: return orig #non-normal post
    line=orig
    temp = line.split
    line = line.split("', '",5)[2]
    if len(line)&lt;10:return orig #non-normal post
    origline = line
    line = " "+line
    for replacement in replacements:
        line = line.replace(replacement[0],replacement[1])
    line=fixLinks(line)
    line = orig.replace(origline,line)
    return line

f=open(infile)
raw=f.readlines()
f.close
posts=0
for raw_i in range(len(raw)):
    if raw[raw_i][:11]=="INSERT INTO":
        if "wp_posts" in raw[raw_i]: #if it's a post, handle it!
            posts+=1
            print "on post",posts
            raw[raw_i]=stripTextile(raw[raw_i])


print "WRITING..."
out = ""
for line in raw:
    out+=line
f=open('o.sql','w')
f.write(out)
f.close()

I certainly held my breath while the thing ran. As I previously mentioned, this thing modified SQL tables. Therefore, when I uploaded the “corrected” versions, I kept breaking the site until I got all the bugs worked out. Here’s an image from earlier today when my site was totally dead (0 blog posts)

hostingwork


     

PHP-Generated Access.log is a Success

THIS CODE HAS BEEN UPDATED!
THIS CODE HAS BEEN UPDATED!
THIS CODE HAS BEEN UPDATED!

>>> CHECK OUT THE NEW CODE < << [Generate Apache-Style HTTP Access Logs via SQL and PHP]

OBSOLETE CODE IS BELOW…

A few months ago I wrote about a way I use PHP to generate apache-style access.log files since my web host blocks access to them. Since then I’ve forgotten it was even running! I now have some pretty cool-looking graphs generated by Python and Matplotlib. For details (and the messy script) check the original posting.

This image represents the number of requests (php pages) made per hour since I implemented the script. It might be a good idea to perform some linear data smoothing techniques (which I love writing about), but for now I’ll leave it as it is so it most accurately reflects the actual data.


     

Using PHP to Create Apache-Style Access.log

THIS CODE HAS BEEN UPDATED!
THIS CODE HAS BEEN UPDATED!
THIS CODE HAS BEEN UPDATED!

>>> CHECK OUT THE NEW CODE < << [Generate Apache-Style HTTP Access Logs via SQL and PHP]

OBSOLETE CODE IS BELOW…

My web server blocks access to my apache-generated visitor logs (commonly stored in “access.log”). Therefore, many great site usage stats generators (such as awstats – see this example) cannot be used to analyze web traffic to my site. (How many people go what pages? Where do they come from? What search phrases do they type into Google to find my website?) My web host does allow PHP, and access to php.ini, so I figured that I could generate my own access.log using PHP code. I succeeded, but had a hard time doing this because it’s not clearly documented elsewhere – so I’ll make it clear.

Sample line from access.log generated by my PHP script:
132.170.10.227 – – [22/Jan/2009:11:58:49 +0800] “GET /blog/2005-06-29-eva-05-attack-scotts-sanity/ HTTP/1.1” 200 – “http://www.google.com/search?hl=en&client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&hs=8Lk&q=swharden+eva-05&btnG=Search” “Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5”

All I had to do was insert the following line at the end of my php.ini file:

  

 auto_append_file = "/home/content/n/i/b/nibjb/html/logme.php"  

 

And I placed logme.php in my root folder with the following code:

  

 $logwriter_logformat = "combined"; // log format,combined or common  

 $logwriter_logdir = "/home/content/n/i/b/nibjb/html/logs/"; // physical path where your log file located  

 $logwriter_logfilename = "access.log"; // your log file's filename  

 $logwriter_timezone = "+0800"; // your server's time zone. +0800 means GMT+8  

   

 function logwriter_writelog($logstring){  

   

 global $logwriter_logdir,$logwriter_logfilename;  

 $fullpathfilename = $logwriter_logdir.$logwriter_logfilename;  

   

 if (!is_file($fullpathfilename)) {  

 print "Log file doesn't exist or file is corrupt.";  

 return;  

 }  

   

 if (!is_writeable($fullpathfilename)) {  

 print "Log file is not writable,please change its permission.";  

 return;  

 }  

   

 if($fp = @fopen($fullpathfilename, "a")) {  

 flock($fp, 2);  

 fputs($fp, $logstring);  

 fclose($fp);  

 }  

 }  

   

 function logwriter_handlevar($varname,$defaultvalue) {  

 $tempvar = getenv($varname);  

 if(!empty($tempvar)) {  

 return $tempvar;  

 } else {  

 return $defaultvalue;  

 }  

 }  

   

 if (!empty($REMOTE_HOST)) {  

 $logwriter_remote_vistor = $REMOTE_HOST;  

 }else{  

 $logwriter_remote_vistor = logwriter_handlevar("REMOTE_ADDR","-");  

 }  

   

 $logwriter_remote_ident = logwriter_handlevar("REMOTE_IDENT","-");  

 $logwriter_remote_user = logwriter_handlevar("REMOTE_USER","-");  

 $logwriter_date = date("d/M/Y:H:i:s");  

   

 $logwriter_server_port = logwriter_handlevar("SERVER_PORT","80");  

 if($logwriter_server_port!="80") {  

 $logwriter_server_port = <a href="".$logwriter_server_port;"></a>  

 }else{  

 $logwriter_server_port = "";  

 }  

   

 $logwriter_request_method = logwriter_handlevar("REQUEST_METHOD","GET");  

 $logwriter_request_uri = logwriter_handlevar("REQUEST_URI","");  

 $logwriter_server_protocol = logwriter_handlevar("SERVER_PROTOCOL","HTTP/1.1");  

   

 if ($logwriter_logformat=="common") {  

 $logwriter_logstring = "$logwriter_remote_vistor $logwriter_remote_ident $logwriter_remote_user [$logwriter_date $logwriter_timezone] "$logwriter_request_method $logwriter_request_uri $logwriter_server_protocol" 200 - 

 ";  

 }else{  

   

 $logwriter_http_referer = logwriter_handlevar("HTTP_REFERER","-");  

 $logwriter_http_user_agent = logwriter_handlevar("HTTP_USER_AGENT","");  

   

 $logwriter_logstring = "$logwriter_remote_vistor $logwriter_remote_ident $logwriter_remote_user [$logwriter_date $logwriter_timezone] "$logwriter_request_method $logwriter_request_uri $logwriter_server_protocol" 200 - "$logwriter_http_referer" "$logwriter_http_user_agent" 

 ";  

   

 }  

   

 logwriter_writelog($logwriter_logstring);  

 

Note that the PHP code must be surrounded with < ? php ?> as demonstrated here

The result? As you can tell, my logme.php dumps data to www.swharden.com/logs/access.log – if you browse a few pages on my website, or even use Google to search for me (ie: google for ‘swharden’ and ‘minidisc’) you can see yourself in the logfile – pretty cool huh? Once I have a good volume of log data I’ll demonstrate how to turn it into useful information.