25 Years of Programming
An open source source for C, C++, OWL, BASIC, MDB, XLS, DOT, and more...
Home   Projects   Up   Sitemap   Search   Blog   Forum+Chat   About Us   Privacy   Terms of Use   Feedback   FAQ   Images   Services   Ads   Donate   Humor

WebStats.mdb - CLF web site log statistics

WebStats.mdb is a Microsoft Access 2003 database for analyzing raw web server access logs that are in Combined Log Format (CLF). It is not designed to be an all-in-one solution to replace Webalizer or other statistics programs, but it does:

  • Generate a few specific reports and statistics that appear to be difficult or impossible to generate with Webalizer.
  • Provide the flexibility of having the data stored in a general-purpose database. You can easily create new custom filters and reports as you need them.

This is the database I use for web statistics reports. It is a work in progress with new features added as I need them.

Click here to download webstats.zip (59 KB). It contains:

  • WebStats.mdb -- the database in MS Access 2003 format.
  • WebStats.doc -- this page of instructions in MS Word 2003 format.
  • Copying.txt -- the GNU GPL license.

Changes in Version 2.0 (3/12/2007):

  1. CDate is used to interpret the CLF date/time string. CVDate used in Version 1.0 didn't properly handle date calculations when the log contained entries spanning multiple years.
  2. Two fundamental queries now request start and end dates so you can obtain a report for a specific time period. All queries based on those base queries now also request start and end dates.
  3. There are several new queries and 1 new form.
  4. You should be able to import your Log table from a Version 1 database into Version 2 with no difficulties.

The questions the database is supposed to answer are:

  1. How many real, human, people are looking at my site?
  2. What real, HTML, pages are they viewing?

Real People

A server log file shows all the requests for files that were made to the server. It includes all the requests by your site visitors, intermingled with all the requests by automated crawlers and by you, the site owner. If your site is new or has a low amount of traffic, the large number of accesses by crawlers and bots and by yourself during building and testing may make it difficult to answer the question, "How many real, human, people are actually looking at my site?" So the first purpose of this database was to allow filtering out web crawlers and yourself so only real human visitors are reported.

Real Pages

The server also logs every request for every file. If an HTML page contains references to 50 small .gif files used on its navigation bars, then every request for that page will generate 51 log file entries. The second purpose of this database was to allow defining what is a Real Page, and filter out the rest.

 

Both of these things can be done to some degree with Webalizer when you run it locally on your own computer and customize its .conf file. It generates more useful reports locally than it does when run on a remote server where customization of the .conf file is not allowed.

However, even run locally, some of the reports I wanted to see would:

  1. require modifying Webalizer filters to get the desired result for one report but with unwanted side effects for other reports generated during the same run.
  2. require creating multiple Webalizer .conf files for different purposes, with the difficulties of maintaining multiple .conf files when modifying settings common to all of them.

So, to get more flexibility, I created this database.

Instructions for use:

  1. Download your raw access log from your server and unzip it, if necessary.
  2. Rename the log file so its extension is .txt (.csv, .tab, or .asc would also be ok)
  3. In WebStats.mdb, click File > Get External Data > Import
  4. Select Files Of Type: Text Files, navigate to the file, select it, and click Import.
  5. At the Security Warning about the Microsoft Office ACWZLIB.MDE wizard (which is a built-in module that Access should probably know enough to trust), click Open.
  6. In the Import Text Wizard dialog, click Delimited.  Next...
  7. Select Delimiter = space, and Text Qualifier = ".  First row does not contain field names.  Next...
  8. Store your data in an existing table: Log.
  9. At the next page, click Finish. You can display Help if you want to, but you don't want the table analyzed.

There will frequently be import errors, saved to a separate table, which you will discover are truncation of Referrer fields that were longer than 255 characters. This can be resolved by making Referrer a Memo field, but that would make it a non-indexable field, so I didn't do it.

Customization is required before use

Examine the following queries in Design View before using them:

  1. The RealPagesByRealPeople query requires customization:
Host You might be able to set criteria for this field to exclude your own visits to your site.

If any part of your own IP address never changes, enter it here in the form of: Not Like "xxx.yyy.*". This will also filter out anyone else whose IP matches that pattern. If you're with a large ISP, that's not desirable. The current text in the query, Not Like "000.000.*", is a dummy placeholder to show where the line goes.

The criteria here also specify some robots that I routinely filter out.

Agent This field is where you can filter out bots, crawlers, and anyone else you consider not a real visitor.
FileName This is where you filter out file extensions you do not consider to be Real Pages. There are no filters provided here because after importing data I now permanently delete all records that do not refer to Real Pages. The action query for doing that is described below.
  1. The Server Error Report query also has a couple of criteria where you might want to make modifications regarding which files you want to exclude and which server result codes you consider errors or not.
  2. You should look at all the other queries, too. Since this is the database I actually use, it is likely I've set a number of criteria that won't suit your purposes.

Database contents:

Tables:

Log Table designed to allow direct import of CLF log files without any preprocessing.

I have discovered to my surprise that not all CLF log files have the same format. This database was designed around the Apache log files that I receive from my web host. However, the logs generated by Apache 2 on my local computer lack the last two fields provided for in this database. It is possible you might need to adjust the Log table to match the fields provided in your own log files.

Log Table Fields:

Host IP address of the computer that made this request to the server
RFCId RFC 1413 identity of the remote host (usually a dash, meaning not known)
UserID The REMOTE_USER, the HTTP authenticated login userid of the requester. Usually a dash, or your own userid.
Time Date and time (server time)
Zone Time zone (where the server is located)
File The HTTP request line, containing the filename requested
Result The server status result code
Bytes Number of bytes actually sent by the server
Referrer The web page containing the link that was followed to this page of your site
Agent User-Agent info about the remote host browser and operating system

Queries

The most useful ones are highlighted in yellow.

The queries that have {} around their descriptions are data sources for other queries, and are otherwise not interesting.

AllPagesAccessedByOneHostIPA Find all the accesses attempted from a given IP address.
Browser Tabulation Crude tabulation of browsers & versions. Set the ones you want to recognize in Module1.
Daily Stats - Real People Real Pages (Chart Source) Totals for each day: Unique human visitors and Real pages viewed.
Delete Trivial Files GIF JPG PNG CSS PERMANENTLY DELETE tiny files that are not real pages, to reduce database size. Set your criteria carefully. Preview before deleting records.
Hack Attempts Possible attempts to hack the server. (If you discover a successful hack, see Step By Step Repair After a Website Hack and How To Prevent It.)
Last Access Date For Each Page The most recent access date for each of your site pages.
PagesViewedPerHost {Number of pages viewed by each unique visitor (IP).}
PagesViewedPerHost - Average, StdDev Average number of pages viewed by each unique visitor (IP), and standard deviation.
PagesViewedPerHost-Tabulation How many visitors viewed 1 page, how many viewed 2, ...
PopularPages-RealPeopleOnly The most popular pages on your site, with the count of PageViews of each.
PopularPages-RealPeopleOnly-BySearchString The most popular pages whose path+filenames match a substring you enter.
Raw Log in Chronological Order Raw log, no filtering, in chronological order.
Real Visitors and Real Page Views - Tabulation By Month Total Unique Visitors and Total PageViews for each MONTH.
Real Visitors and Real Page Views - Tabulation By Month, Day Total Unique Visitors and Total PageViews for each DATE.
RealPagesByRealPeople Chronological Log of Real Pages by Real People. A source for several other queries. It is like cpanel's Latest Visitors report, but filtered.
RealPagesByRealPeopleSortedByFile For each file on your site, shows all accesses of it, in chronological order.
RealPagesByRealPeopleSortedByHost For each unique Visitor (IP), shows a chronological log of their visits, and what they viewed.
Redirects Page requests resulting in code 301 or 302 redirects
Server Error Report Server access errors to investigate.
Unique Agent Strings {Tabulates the incidence count of each unique Agent string and extracts the Browser used.}
Unique Visitors - By Month {For each MONTH, shows all the Unique Visitors (IPs) and how many pages each viewed.}
Unique Visitors - By Month, Day {For each DATE, shows all the Unique Visitors (IPs) and how many pages each viewed.}

Form

Daily Stats - Real People Real Pages 2-series line chart tracking daily a) Real Visitors and b) PageViews of real pages
PagesViewedPerHost - Tabulation Bar chart showing how many visitors viewed 1 page, how many viewed 2,...

 

 

Valid HTML 4.01 Transitional Valid CSS
View content labeling at ICRA.
Copyright ©2008 Steven Whitney. Last modified 06/30/2008.